Azure Cosmos DB SDK Connection Errors

Understanding and Resolving SDK Connection Issues

Connection errors with the Azure Cosmos DB SDK can be frustrating. This guide provides a systematic approach to identify, diagnose, and resolve common connection problems, ensuring your applications can reliably interact with your Cosmos DB data.

Tip: Always ensure your SDK version is up-to-date. Newer versions often contain bug fixes and performance improvements for connection handling.

Common Causes of Connection Errors

  • Network connectivity issues (firewall, DNS, proxy).
  • Incorrect connection string or endpoint configuration.
  • Resource limitations (e.g., IP address exhaustion, throttling).
  • SDK client instantiation and management.
  • Azure service health or regional outages.

Step-by-Step Troubleshooting Guide

Ensure your application environment can reach the Cosmos DB endpoint.

  • Firewall Rules: Check if any firewalls (local, network, or Azure Network Security Groups) are blocking outbound traffic on port 443 to your Cosmos DB endpoint.
  • DNS Resolution: Verify that your application can resolve the Cosmos DB endpoint name to an IP address. Use tools like ping or nslookup.
  • Proxy Settings: If your environment uses a proxy, ensure it's configured correctly and not interfering with the connection.

Diagnostic Command (Example using `curl`):

curl -v https://your-cosmosdb-account.documents.azure.com:443

A successful response will show connection details and HTTP status codes (e.g., 401 Unauthorized, which is expected if not authenticated, but indicates a connection was made).

A malformed or incorrect connection string is a frequent culprit.

  • Endpoint URL: Double-check the endpoint URL in your configuration. It should follow the pattern https://.documents.azure.com:443/.
  • Primary/Secondary Key: Ensure you are using a valid read-write or read-only key. Keys are case-sensitive.
  • Database Name: While not strictly for connection, ensure your database name is correct if specified in some SDK configurations.

Example Connection String Format:

AccountEndpoint=https://your-cosmosdb-account.documents.azure.com:443/;AccountKey=your_primary_or_secondary_key=;

Tip: Consider using Azure Key Vault to securely store and retrieve your Cosmos DB keys.

Proper client management is crucial for efficient connection handling.

  • Singleton Pattern: It is highly recommended to instantiate your Cosmos DB client once and reuse it throughout your application's lifecycle. Creating new clients for each request can lead to resource exhaustion and connection pool issues.
  • Dispose of Clients: When your application is shutting down, ensure you properly dispose of the Cosmos DB client instance to release its resources and connections.

Example (C# - .NET SDK):


// In your application startup or a dedicated service
private static CosmosClient cosmosClient;

public static CosmosClient InitializeCosmosClient(string endpoint, string key)
{
    if (cosmosClient == null)
    {
        cosmosClient = new CosmosClient(endpoint, key, new CosmosClientOptions
        {
            ConnectionMode = ConnectionMode.Gateway, // Or Direct
            // Other options...
        });
    }
    return cosmosClient;
}

// In your application shutdown
public static void DisposeCosmosClient()
{
    cosmosClient?.Dispose();
    cosmosClient = null;
}
                        

Example (Java SDK):


// Initialize once
CosmosClient client = new CosmosClientBuilder()
    .endpoint("https://your-cosmosdb-account.documents.azure.com:443/", "your-key")
    .directMode() // Or gatewayMode()
    .buildClient();

// When application exits
client.close();
                        

Sometimes, the issue might be with Azure itself.

  • Azure Status Page: Visit the Azure Status page to check for any ongoing incidents or outages in your region.
  • Resource Health: Within the Azure portal for your Cosmos DB account, check the "Resource health" blade for any specific issues reported for your database.

Detailed logging and error messages are invaluable.

  • Enable Logging: Configure your application to log detailed information from the Cosmos DB SDK. This often involves setting up logging providers (e.g., Serilog, NLog, Log4j).
  • Exception Details: Pay close attention to the full exception message, inner exceptions, stack traces, and any specific error codes returned by the SDK. These often point directly to the root cause.

Common Exception Types:

  • Microsoft.Azure.Cosmos.CosmosException: Generic Cosmos DB errors.
  • System.Net.Http.HttpRequestException: Network-related issues during HTTP requests.
  • System.Net.Sockets.SocketException: Lower-level network socket errors.
  • Microsoft.Azure.Cosmos.Rntbd.Exceptions.RntbdConnectionException: Occurs with Direct mode when the RNTBD transport layer fails.

Example of Enabling Logging in .NET SDK:


var cosmosClient = new CosmosClient(endpoint, key, new CosmosClientOptions
{
    // ... other options
    CosmosDiagnosticsOptions = new CosmosDiagnosticsOptions
    {
        ShouldTrace = true,
        // Use a custom logger or Console.WriteLine
        OnDiagnostics = diagnostics =>
        {
            Console.WriteLine($"\n--- Cosmos Diagnostics ---");
            Console.WriteLine($"Operation: {diagnostics.OperationType}");
            Console.WriteLine($"Endpoint: {diagnostics.Endpoint}");
            Console.WriteLine($"Status Code: {diagnostics.StatusCode}");
            Console.WriteLine($"Duration: {diagnostics.Duration}");
            Console.WriteLine($"Client Requested Duration: {diagnostics.ClientRequestedDuration}");
            Console.WriteLine($"Transport Requested Duration: {diagnostics.TransportRequestedDuration}");
            Console.WriteLine($"Connection Delay: {diagnostics.ConnectionDelay}");
            Console.WriteLine($"Request Content: {diagnostics.RequestContent}");
            Console.WriteLine($"Response Content: {diagnostics.ResponseContent}");
            Console.WriteLine($" --- End Cosmos Diagnostics ---\n");
        }
    }
});
                        

Exceeding provisioned throughput (RUs) or IP address limits can cause connection issues.

  • Request Units (RUs): Monitor your Cosmos DB account's RU consumption in the Azure portal. If you are frequently hitting your provisioned throughput, you might experience throttling, which can manifest as connection instability.
  • IP Address Limits: If you have a very large number of client instances connecting simultaneously, you might hit IP address limits, especially when using Direct mode. Consider using a gateway service or checking your network configuration.

Warning: Throttling errors (HTTP 429) are typically for operations, but persistent high load can indirectly impact connection health and stability.

The SDK offers two primary connection modes:

  • Gateway Mode: Uses HTTP/2.0. Simpler to configure, works well across most network environments, and handles scaling automatically. It can sometimes have higher latency compared to Direct mode.
  • Direct Mode: Uses TCP. Offers lower latency and higher throughput but requires careful network configuration (e.g., ensuring access to specific ports/protocols). This mode can be more susceptible to network configuration issues and requires the SDK to manage connections more directly.

Tip: For most applications, Gateway mode is a good starting point. If you experience performance bottlenecks and have full control over your network environment, Direct mode might offer benefits, but troubleshooting connection issues in Direct mode can be more complex.

Advanced Troubleshooting

If the above steps don't resolve your issue, consider:

  • Packet Capture: Use network analysis tools (like Wireshark or `tcpdump`) to capture traffic between your client and the Cosmos DB endpoint. This can reveal low-level network problems.
  • Azure Support: If you suspect a platform issue or have exhausted all troubleshooting steps, open a support request with Azure. Provide them with detailed logs, error messages, and steps you've already taken.

Common Error Messages and Meanings

  • Name or service not known / UnknownHostException: DNS resolution failed. Check your endpoint URL and network DNS settings.
  • Connection timed out / ConnectException: Network issue, firewall blocking, or endpoint unreachable.
  • 403 Forbidden: Often indicates an invalid or expired key, or firewall issues preventing authentication.
  • RntbdChannel open failed: Specific to Direct mode, indicates a problem with the underlying TCP connection.