Azure AI Services – OpenAI Usage

Understanding Usage and Limits

This guide explains how to monitor, control, and optimize usage of Azure OpenAI Service.

Key Concepts

Tokens – The unit of consumption. Both input and output tokens count toward usage.
Rate Limits – Maximum requests per minute per deployment.
Quota – The total number of tokens allowed per month for a subscription.

Monitoring Usage

Use the Azure portal or the REST API to retrieve usage metrics.

Azure Portal

REST API

Navigate to Azure OpenAI → Deployments → Usage to view daily token consumption, request counts, and throttling events.

Call the /usage endpoint to retrieve JSON metrics.

GET https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{accountName}/usage?api-version=2023-08-01

Sample Code: Retrieve Token Usage (Python)

import os from azure.identity import DefaultAzureCredential from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient credential = DefaultAzureCredential() client = CognitiveServicesManagementClient(credential, os.getenv("AZURE_SUBSCRIPTION_ID")) usage = client.accounts.list_usage( resource_group_name="myResourceGroup", account_name="myOpenAIAccount", filter="properties/usageStart eq '2024-01-01'" ) for u in usage: print(f"Date: {u.properties.usage_start}, Tokens: {u.properties.total_tokens}")

Sample Code: Retrieve Token Usage (JavaScript)

const { DefaultAzureCredential } = require("@azure/identity"); const { CognitiveServicesManagementClient } = require("@azure/arm-cognitiveservices"); async function getUsage() { const credential = new DefaultAzureCredential(); const client = new CognitiveServicesManagementClient(credential, process.env.AZURE_SUBSCRIPTION_ID); const usage = await client.accounts.listUsage("myResourceGroup","myOpenAIAccount",{filter:"properties/usageStart eq '2024-01-01'"}); usage.forEach(u=>console.log(`Date: ${u.properties.usageStart}, Tokens: ${u.properties.totalTokens}`)); } getUsage();

Rate Limits

Deployment	Requests/Minute	Tokens/Minute
gpt-35-turbo	60	900,000
gpt-4	20	300,000

If you exceed limits, you’ll receive a 429 Too Many Requests response. Implement exponential back‑off to retry.

Best Practices to Optimize Usage

Trim prompts and use system messages wisely.
Set max_tokens to the smallest value that satisfies your output requirements.
Cache frequent responses when possible.
Enable logprobs only when needed for analysis.