Troubleshooting Azure Cosmos DB

This guide provides common troubleshooting steps for issues you might encounter with Azure Cosmos DB.

Common Issues and Solutions

Problem: Requests are failing with HTTP status code 429 (Too Many Requests) or experiencing high latency.

Cause: Your requests are exceeding the provisioned throughput (RUs) for your container or database.

Solution:
- Scale Up Throughput: Increase the Request Units (RUs) allocated to your container or database in the Azure portal. You can use autoscale or manual scaling.
- Optimize Queries: Ensure your queries are efficient. Avoid `SELECT *` and select only the necessary fields. Use parameterized queries and appropriate indexing.
- Batching: For high-volume operations, consider using the Cosmos DB SDK's batching capabilities or the Transactional Batch support.
- Partition Key Strategy: Ensure your partition key is well-chosen and distributes requests evenly across partitions. Hot partitions can lead to 429 errors.

Tip: Monitor the "Provisioned RUs" and "Consumed RUs" metrics in the Azure portal to understand your throughput utilization.

Problem: Applications are unable to connect to the Cosmos DB endpoint.

Cause: Network configuration, firewall rules, incorrect connection string, or SDK version issues.

Solution:
- Verify Connection String: Double-check your Cosmos DB account's connection string, including the endpoint and primary/secondary key.
- Network Access: Ensure your application's network can reach the Cosmos DB endpoint. Check firewall rules, Network Security Groups (NSGs), and Private Endpoint configurations if applicable.
- SDK Version: Use the latest stable version of the Azure Cosmos DB SDK for your programming language. Older versions might have compatibility issues.
- DNS Resolution: Ensure your application can resolve the Cosmos DB endpoint's DNS name.

Note: If using Private Endpoints, verify the DNS resolution and NSG rules applied to the Private Endpoint.

Problem: Recently written data is not immediately available when performing a read operation.

Cause: This is often due to the consistency level configured for your Cosmos DB account.

Solution:
- Understand Consistency Levels: Cosmos DB offers various consistency levels (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual). Session consistency, the default, guarantees that reads within a single client session will see their own writes. Reads from other sessions might experience a slight delay.
- Adjust Consistency Level (if appropriate): If strong consistency is required, configure your account and client SDK accordingly. Be aware that stronger consistency levels can impact latency and throughput.
- Retry Logic: Implement robust retry logic in your application, especially for transient errors or when dealing with less strict consistency levels.

Problem: Query performance degrades over time, or index transformations are slow.

Cause: Inefficient indexing policies or large amounts of data being indexed.

Solution:
- Optimize Indexing Policy: Include only the paths that are necessary for your queries. Exclude paths that are not queried. Use range indexing for numerical and string properties and composite indexes for queries that filter on multiple fields.
- Indexing Mode: Understand the difference between `consistent` and `lazy` indexing. `lazy` indexing can improve write performance but might impact read performance until indexing is complete.
- Reset Index: In some cases, resetting the index for a container (via the Azure portal) can resolve indexing-related performance issues, though this is a disruptive operation.

Important: A poorly optimized indexing policy can significantly impact both RU consumption and query latency.

Problem: Specific partitions consistently consume a disproportionate amount of RUs, leading to 429 errors.

Cause: Uneven distribution of requests or data across logical partitions due to an inappropriate partition key choice.

Solution:
- Re-evaluate Partition Key: Choose a partition key with high cardinality that evenly distributes your workload. For example, use a User ID for a social app or an Order ID for an e-commerce app.
- Higher Partition Count: If you have very high throughput requirements and a good partition key, consider increasing the number of physical partitions by increasing the provisioned throughput (RUs) beyond the initial 10,000 RU threshold per container.
- Data Migration: In extreme cases, you might need to migrate data to a new container with a better partition key strategy.

Azure Cosmos DB provides several tools and metrics to help diagnose issues:

Azure Portal Metrics: Monitor key metrics like Request Units consumed, latency, throttled requests (429s), document count, and storage.
Diagnostic Logs: Enable diagnostic logs for your Cosmos DB account to capture detailed operational information.
Cosmos DB Data Explorer: Use the Data Explorer in the Azure portal to run queries, view data, and monitor performance for specific operations.
Azure Monitor Logs: Integrate Cosmos DB logs with Azure Monitor Logs for advanced querying and alerting.
SDK Logging: Configure verbose logging for your Cosmos DB SDK to capture client-side details.

If you are still facing issues, consider the following resources: