Partitioning Overview
This document provides an overview of partitioning in Azure Cosmos DB, a critical concept for scaling your database effectively.
What is Partitioning?
Partitioning, also known as sharding, is the process of horizontally scaling your database by distributing data across multiple logical partitions. Each partition is a set of physical storage that contains a subset of your data. Azure Cosmos DB handles the complexities of partitioning, replication, and load balancing for you.
Partition Key
The cornerstone of partitioning is the partition key. When you create a container (e.g., a collection in SQL API, a container in Gremlin API), you must choose a property from your items to act as the partition key.
- The value of the partition key property for an item determines which logical partition that item resides in.
- Azure Cosmos DB uses the partition key value to deterministically route requests to the correct partition.
- A good partition key is essential for efficient query performance and even distribution of request throughput (Request Units) across partitions.
Choosing a Good Partition Key
The ideal partition key should have:
- High Cardinality: A large number of distinct values to ensure data is spread across many logical partitions.
- Even Distribution: Values that are relatively evenly distributed across your items to avoid "hot partitions."
- Query Alignment: If possible, choose a key that is frequently used in query filters (e.g.,
/customerId
or/tenantId
) so queries can be routed directly to the relevant partitions.
Note: Avoid partition keys with very few distinct values (e.g., /status
with values like 'active', 'inactive') as this can lead to hot partitions and underutilization of provisioned throughput.
Logical and Physical Partitions
Azure Cosmos DB abstracts the complexity of physical storage. You interact with logical partitions, which are collections of items that share the same partition key value.
- Each logical partition contains items with the same partition key value.
- Azure Cosmos DB maps logical partitions to underlying physical partitions for storage and throughput.
- As your data grows or your throughput needs increase, Azure Cosmos DB automatically scales the number of physical partitions.
Partitioning Strategy
A well-defined partitioning strategy is crucial for performance and scalability.
1. Select the Partition Key
Based on your data model and common query patterns, select an appropriate partition key. For example:
// Example: SQL API Item Structure
{
"id": "12345",
"orderId": "ORD-98765",
"customerId": "CUST-ABC123",
"orderDate": "2023-10-27T10:00:00Z",
"totalAmount": 150.75
}
In this example, /customerId
or /orderDate
could be potential partition keys. /customerId
would be suitable for queries that fetch all orders for a specific customer. /orderDate
might be useful for time-series analysis but could lead to hot partitions if most orders occur on the same day.
2. Understand Throughput Distribution
The total provisioned Request Units (RUs) for a container are distributed evenly across all its physical partitions. If you have a hot partition (one that receives a disproportionately high amount of traffic), it can become a bottleneck.
3. Broad Partition Key Values
To avoid hot partitions, ensure your chosen partition key has a sufficient number of distinct values. For example, if you have 10,000 customers, using /customerId
as the partition key will likely result in good distribution.
4. Synthetic Partition Keys
In scenarios where natural partition keys have low cardinality, you can create a "synthetic" partition key by combining two or more properties. For instance, you could combine customerId
and orderDate
to create a synthetic key like customerId-orderDate
.
// Example Synthetic Partition Key Value
"CUST-ABC123-2023-10-27"
Partition Key Limits
Each logical partition has a maximum storage size and a limit on the number of items. As of the latest updates, these limits are quite generous, but it's good practice to be aware of them:
- Logical Partition Size: 20 GB
- Number of Logical Partitions: Scaled automatically by Azure Cosmos DB
Key Takeaway: The partition key is the most critical decision for scaling and performance in Azure Cosmos DB. Choose wisely based on your data and access patterns.
Common Scenarios
- Multi-tenant Applications: Use
/tenantId
as the partition key for excellent data isolation and query efficiency. - IoT Data: Partitioning by
/deviceId
is common, but consider time-based partitioning or synthetic keys if device activity is highly variable. - E-commerce Orders:
/customerId
or a synthetic key combining customer and date ranges can work well.