Azure AI Services – OpenAI

Understanding Usage and Limits

This guide explains how to monitor, control, and optimize usage of Azure OpenAI Service.

Key Concepts

Monitoring Usage

Use the Azure portal or the REST API to retrieve usage metrics.

Azure Portal
REST API

Navigate to Azure OpenAI → Deployments → Usage to view daily token consumption, request counts, and throttling events.

Call the /usage endpoint to retrieve JSON metrics.

GET https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{rg}/providers/Microsoft.CognitiveServices/accounts/{accountName}/usage?api-version=2023-08-01

Sample Code: Retrieve Token Usage (Python)

import os from azure.identity import DefaultAzureCredential from azure.mgmt.cognitiveservices import CognitiveServicesManagementClient credential = DefaultAzureCredential() client = CognitiveServicesManagementClient(credential, os.getenv("AZURE_SUBSCRIPTION_ID")) usage = client.accounts.list_usage( resource_group_name="myResourceGroup", account_name="myOpenAIAccount", filter="properties/usageStart eq '2024-01-01'" ) for u in usage: print(f"Date: {u.properties.usage_start}, Tokens: {u.properties.total_tokens}")

Sample Code: Retrieve Token Usage (JavaScript)

const { DefaultAzureCredential } = require("@azure/identity"); const { CognitiveServicesManagementClient } = require("@azure/arm-cognitiveservices"); async function getUsage() { const credential = new DefaultAzureCredential(); const client = new CognitiveServicesManagementClient(credential, process.env.AZURE_SUBSCRIPTION_ID); const usage = await client.accounts.listUsage("myResourceGroup","myOpenAIAccount",{filter:"properties/usageStart eq '2024-01-01'"}); usage.forEach(u=>console.log(`Date: ${u.properties.usageStart}, Tokens: ${u.properties.totalTokens}`)); } getUsage();

Rate Limits

DeploymentRequests/MinuteTokens/Minute
gpt-35-turbo60900,000
gpt-420300,000

If you exceed limits, you’ll receive a 429 Too Many Requests response. Implement exponential back‑off to retry.

Best Practices to Optimize Usage

  1. Trim prompts and use system messages wisely.
  2. Set max_tokens to the smallest value that satisfies your output requirements.
  3. Cache frequent responses when possible.
  4. Enable logprobs only when needed for analysis.

Try It – Generate Sample Text