Understanding Usage and Limits
This guide explains how to monitor, control, and optimize usage of Azure OpenAI Service.
Key Concepts
- Tokens – The unit of consumption. Both input and output tokens count toward usage.
- Rate Limits – Maximum requests per minute per deployment.
- Quota – The total number of tokens allowed per month for a subscription.
Monitoring Usage
Use the Azure portal or the REST API to retrieve usage metrics.
Azure Portal
REST API
Navigate to Azure OpenAI → Deployments → Usage to view daily token consumption, request counts, and throttling events.
Call the /usage endpoint to retrieve JSON metrics.
GET https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{accountName}/usage?api-version=2023-08-01
Sample Code: Retrieve Token Usage (Python)
import os
from azure.identity import DefaultAzureCredential
from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient
credential = DefaultAzureCredential()
client = CognitiveServicesManagementClient(credential, os.getenv("AZURE_SUBSCRIPTION_ID"))
usage = client.accounts.list_usage(
resource_group_name="myResourceGroup",
account_name="myOpenAIAccount",
filter="properties/usageStart eq '2024-01-01'"
)
for u in usage:
print(f"Date: {u.properties.usage_start}, Tokens: {u.properties.total_tokens}")
Sample Code: Retrieve Token Usage (JavaScript)
const { DefaultAzureCredential } = require("@azure/identity");
const { CognitiveServicesManagementClient } = require("@azure/arm-cognitiveservices");
async function getUsage() {
const credential = new DefaultAzureCredential();
const client = new CognitiveServicesManagementClient(credential, process.env.AZURE_SUBSCRIPTION_ID);
const usage = await client.accounts.listUsage("myResourceGroup","myOpenAIAccount",{filter:"properties/usageStart eq '2024-01-01'"});
usage.forEach(u=>console.log(`Date: ${u.properties.usageStart}, Tokens: ${u.properties.totalTokens}`));
}
getUsage();
Rate Limits
| Deployment | Requests/Minute | Tokens/Minute |
|---|---|---|
| gpt-35-turbo | 60 | 900,000 |
| gpt-4 | 20 | 300,000 |
If you exceed limits, you’ll receive a 429 Too Many Requests response. Implement exponential back‑off to retry.
Best Practices to Optimize Usage
- Trim prompts and use system messages wisely.
- Set
max_tokensto the smallest value that satisfies your output requirements. - Cache frequent responses when possible.
- Enable logprobs only when needed for analysis.