Cosmos DB Partitioning Overview - Azure Documentation

Partitioning Overview

This document provides an overview of partitioning in Azure Cosmos DB, a critical concept for scaling your database effectively.

What is Partitioning?

Partitioning, also known as sharding, is the process of horizontally scaling your database by distributing data across multiple logical partitions. Each partition is a set of physical storage that contains a subset of your data. Azure Cosmos DB handles the complexities of partitioning, replication, and load balancing for you.

Partition Key

The cornerstone of partitioning is the partition key. When you create a container (e.g., a collection in SQL API, a container in Gremlin API), you must choose a property from your items to act as the partition key.

The value of the partition key property for an item determines which logical partition that item resides in.
Azure Cosmos DB uses the partition key value to deterministically route requests to the correct partition.
A good partition key is essential for efficient query performance and even distribution of request throughput (Request Units) across partitions.

Choosing a Good Partition Key

The ideal partition key should have:

High Cardinality: A large number of distinct values to ensure data is spread across many logical partitions.
Even Distribution: Values that are relatively evenly distributed across your items to avoid "hot partitions."
Query Alignment: If possible, choose a key that is frequently used in query filters (e.g., /customerId or /tenantId) so queries can be routed directly to the relevant partitions.

Note: Avoid partition keys with very few distinct values (e.g., /status with values like 'active', 'inactive') as this can lead to hot partitions and underutilization of provisioned throughput.

Logical and Physical Partitions

Azure Cosmos DB abstracts the complexity of physical storage. You interact with logical partitions, which are collections of items that share the same partition key value.

Each logical partition contains items with the same partition key value.
Azure Cosmos DB maps logical partitions to underlying physical partitions for storage and throughput.
As your data grows or your throughput needs increase, Azure Cosmos DB automatically scales the number of physical partitions.

Partitioning Strategy

A well-defined partitioning strategy is crucial for performance and scalability.

1. Select the Partition Key

Based on your data model and common query patterns, select an appropriate partition key. For example:

// Example: SQL API Item Structure
{
    "id": "12345",
    "orderId": "ORD-98765",
    "customerId": "CUST-ABC123",
    "orderDate": "2023-10-27T10:00:00Z",
    "totalAmount": 150.75
}

In this example, /customerId or /orderDate could be potential partition keys. /customerId would be suitable for queries that fetch all orders for a specific customer. /orderDate might be useful for time-series analysis but could lead to hot partitions if most orders occur on the same day.

2. Understand Throughput Distribution

The total provisioned Request Units (RUs) for a container are distributed evenly across all its physical partitions. If you have a hot partition (one that receives a disproportionately high amount of traffic), it can become a bottleneck.

3. Broad Partition Key Values

To avoid hot partitions, ensure your chosen partition key has a sufficient number of distinct values. For example, if you have 10,000 customers, using /customerId as the partition key will likely result in good distribution.

4. Synthetic Partition Keys

In scenarios where natural partition keys have low cardinality, you can create a "synthetic" partition key by combining two or more properties. For instance, you could combine customerId and orderDate to create a synthetic key like customerId-orderDate.

// Example Synthetic Partition Key Value
"CUST-ABC123-2023-10-27"

Partition Key Limits

Each logical partition has a maximum storage size and a limit on the number of items. As of the latest updates, these limits are quite generous, but it's good practice to be aware of them:

Logical Partition Size: 20 GB
Number of Logical Partitions: Scaled automatically by Azure Cosmos DB

Key Takeaway: The partition key is the most critical decision for scaling and performance in Azure Cosmos DB. Choose wisely based on your data and access patterns.

Common Scenarios

Multi-tenant Applications: Use /tenantId as the partition key for excellent data isolation and query efficiency.
IoT Data: Partitioning by /deviceId is common, but consider time-based partitioning or synthetic keys if device activity is highly variable.
E-commerce Orders: /customerId or a synthetic key combining customer and date ranges can work well.