Table of Contents
Connectivity Issues
This section addresses common problems encountered when connecting to your Azure Cosmos DB account.
Connection String Problems
Problem: Unable to connect to the database, often reporting "Authentication failed" or "Network path not found."
Solution:
- Verify that your connection string is correctly copied from the Azure portal.
- Ensure the endpoint and primary/secondary keys are accurate.
- Check if your application's network allows outbound connections to the Cosmos DB endpoint.
- If using private endpoints, confirm network configuration and DNS resolution.
Example Connection String:
AccountEndpoint=https://your-cosmosdb-account.documents.azure.com:443/;AccountKey=your_account_key_here=;
Firewall and VNet Configuration
Problem: Connections are intermittently failing or blocked, especially when Cosmos DB is configured with firewall rules or VNet integration.
Solution:
- If using firewall rules, ensure that your application's IP address or subnet is added to the allowed list in the Cosmos DB account settings.
- If using VNet service endpoints or private endpoints, verify the VNet configuration, subnet delegation, and routing rules.
- For private endpoints, confirm that DNS is correctly configured to resolve the Cosmos DB endpoint to the private IP address.
Performance Degradation
Troubleshooting slow query execution, high RU/s consumption, and latency issues.
High Request Units (RU/s) Consumption
Problem: Application experiences throttling (HTTP 429) or slow response times due to exceeding provisioned throughput.
Solution:
- Analyze Query Performance: Use the Azure portal's Query Metrics to identify expensive queries. Optimize queries by filtering early, using appropriate indexes, and avoiding cross-partition queries where possible.
- Partition Key Design: Ensure your partition key distributes requests evenly and avoids "hot partitions." A good partition key has high cardinality and is frequently used in query filters.
- Indexing Policy: Review your indexing policy. Including all paths (`/*`) can increase index storage and maintenance overhead. Consider including only the paths frequently used in queries.
- Autoscale Provisioning: If your workload is variable, consider using autoscale throughput to automatically scale RU/s based on demand.
- Batching: For operations involving many small documents, consider batching requests to reduce the number of individual operations and RU/s consumed.
High Latency
Problem: Increased latency for read and write operations.
Solution:
- Geographic Distribution: If your application has users across multiple regions, consider enabling multi-region writes and configuring read regions closer to your users.
- Indexing Latency: While indexing improves query speed, it adds latency to writes. Ensure your indexing policy is optimized.
- SDK Version: Use the latest version of the Azure Cosmos DB SDK, as they often include performance improvements and bug fixes.
- Connection Pooling: Ensure your application is using persistent connections via the SDK's connection pooling mechanism.
- Resource Constraints: Check if your application's host (VM, container) is experiencing CPU, memory, or network bottlenecks.
Common Error Codes
Understanding and resolving frequently encountered HTTP status codes.
HTTP 400 Bad Request
Problem: Invalid request syntax, missing required headers, or malformed JSON body.
Solution: Validate your request payload and headers. Ensure you are sending valid JSON and all required fields are present. Check SDK documentation for correct request formatting.
HTTP 401 Unauthorized
Problem: Invalid or missing authentication credentials (master key, resource token, or Azure AD token).
Solution: Double-check your account key or resource token. Ensure the token has the necessary permissions and hasn't expired. If using Azure AD, verify token validity and scopes.
HTTP 403 Forbidden
Problem: The authenticated principal does not have permission to perform the requested operation.
Solution: Verify the permissions assigned to your user or application. Ensure they have the necessary roles (e.g., "Cosmos DB Account Reader", "Cosmos DB Contributor") or resource token permissions.
HTTP 404 Not Found
Problem: The requested resource (e.g., database, container, document) does not exist.
Solution: Confirm the names of your databases and containers are spelled correctly and exist in your account. Ensure you are targeting the correct resource ID.
HTTP 429 Too Many Requests
Problem: Exceeded provisioned or available RU/s for the request. Throttling occurred.
Solution: Implement retry logic with exponential backoff in your application. Increase provisioned RU/s for the container or database, or consider autoscale. Optimize queries and partition key design to reduce RU/s consumption.
HTTP 503 Service Unavailable
Problem: Temporary service unavailability or overload.
Solution: Implement retry logic with exponential backoff. Monitor Cosmos DB health in the Azure portal. If persistent, contact Azure support.
Authentication & Authorization
Managing access and permissions for your Cosmos DB resources.
Master Key Rotation
Problem: Need to rotate primary or secondary master keys for security compliance.
Solution:
- Go to your Cosmos DB account in the Azure portal.
- Navigate to "Keys" under "Settings."
- Click "Regenerate primary key" or "Regenerate secondary key."
- Update your application's connection string with the new key. It's recommended to update the application with the new secondary key first, then rotate the primary key later to avoid downtime.
Resource Tokens
Problem: Securely granting limited access to specific resources for client-side applications without exposing master keys.
Solution:
- Use the Cosmos DB server-side SDK in your backend to generate resource tokens for specific users or entities.
- These tokens grant read/write or read-only access to a specific container or even documents.
- Manage token generation and expiry carefully in your backend application.
Replication & Consistency
Understanding and troubleshooting data replication and consistency models.
Failover Operations
Problem: Unexpected failovers or ensuring data availability during regional outages.
Solution:
- Multi-Region Writes: Enable multi-region writes for automatic failover if one region becomes unavailable.
- Read Regions: Configure your application to read from the nearest available region.
- Application Logic: Design your application to handle transient network errors and potentially retry operations against different regions if primary region access fails.
Consistency Level Issues
Problem: Seeing stale data or experiencing unexpected data consistency behavior.
Solution:
- Understand Models: Be aware of the different consistency levels (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual) and their trade-offs between consistency, availability, and latency.
- Session Consistency: This is the default and generally recommended. It ensures that within a single client session, reads will always see writes from that session.
- Testing: If you suspect consistency issues, explicitly set the consistency level in your SDK configuration and test thoroughly.