Azure Cosmos DB Monitoring Tutorials

Welcome to this collection of tutorials focused on monitoring your Azure Cosmos DB resources. Effective monitoring is crucial for ensuring the performance, availability, and cost-effectiveness of your NoSQL database solutions.

Introduction to Azure Cosmos DB Monitoring

Azure Cosmos DB offers robust built-in monitoring capabilities. This section provides an overview of the key metrics and logs you should be tracking to gain insights into your database's health and performance.

Key performance indicators (KPIs) like Request Units (RUs) per second, latency, and throughput.
Understanding diagnostic logs for detailed event tracking.
Leveraging Azure Monitor for centralized monitoring.

Tutorial 1: Monitoring Throughput and Request Units (RUs)

This tutorial guides you through monitoring your provisioned and consumed Request Units. Understanding RU consumption is vital for cost optimization and preventing throttling.

Steps:

Navigate to your Azure Cosmos DB account in the Azure portal.
Access the "Metrics" blade.
Select "Data Consumption" and "Total Requests" as metrics.
Analyze the graphs to identify peak usage times and potential bottlenecks.
Learn how to set up alerts for RU consumption exceeding predefined thresholds.

Example metric query in Azure Monitor:


az monitor metrics list --resource  \
    --metric "TotalRequests" \
    --aggregation "Sum" \
    --interval "PT1H" \
    --timespan "PT24H"

Tutorial 2: Analyzing Request Latency

High latency can significantly impact your application's user experience. This tutorial shows you how to monitor and diagnose request latency issues.

Key Metrics:

MaxRUsPerSecondUsed
ProvisionedRUs
Client-side latency
Server-side latency

Understand the difference between client-side and server-side latency and how to investigate each.

Troubleshooting Tips:

Check network connectivity between your application and Cosmos DB.
Review partitioning strategy for even data distribution.
Ensure sufficient RU provisioning.

Tutorial 3: Setting Up Alerts for Critical Events

Proactive alerting is essential for quickly responding to issues. This tutorial covers setting up alerts for key metrics and diagnostic events.

Alerting Scenarios:

Throughput exceeded (e.g., 80% of provisioned RUs).
High request latency (e.g., average latency > 100ms).
Availability issues (e.g., API errors).
Cost anomalies.

Learn how to configure notification actions, such as sending emails or triggering webhook calls.

Tutorial 4: Using Diagnostic Logs and Log Analytics

Diagnostic logs provide granular details about operations performed on your Cosmos DB account. This tutorial demonstrates how to collect and analyze these logs using Azure Log Analytics.

Log Categories:

Operations: Records all data plane and control plane operations.
AutoPilot: For autoscale-related events.
Exceptions: Records any exceptions encountered.

Example Kusto Query Language (KQL) query to find all failed requests:


AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB" and Category == "Operations"
| where resultType == "Failed"
| project TimeGenerated, operationName, resourceUri, statusCode, message
| order by TimeGenerated desc

Best Practices for Monitoring

Consolidate your monitoring efforts and establish a robust strategy:

Regularly review dashboards.
Define clear SLAs and monitor against them.
Automate responses to common alerts where possible.
Use Application Insights for end-to-end application monitoring, including Cosmos DB interactions.