Partitioning in Azure Cosmos DB
On this page
Introduction to Partitioning
Azure Cosmos DB is a globally distributed, multi-model database service that enables you to create and query document, key/value, and graph databases. It offers elastic and independent scaling of storage and throughput across a given geographic distribution. Partitioning is a fundamental concept in Cosmos DB that enables this horizontal scalability.
Every container in Azure Cosmos DB is a unit of scalability that is comprised of one or more logical partitions. Each logical partition contains a subset of its items. A logical partition is defined by its partition key value. Azure Cosmos DB automatically scales your database by partitioning your data across logical and physical partitions.
Key Concept: Logical partitions are the building blocks of horizontal scalability. A good partition key is crucial for performance and scalability.
When you choose a partition key for your container, Cosmos DB uses the value of this key for each item to determine which logical partition the item belongs to. This ensures that items with the same partition key value are collocated on the same logical partition. This collocation is vital for efficient transaction processing and querying.
Choosing a Partition Key
The partition key is a property within your items whose value is used to determine the logical partition that an item is stored in. The choice of a partition key significantly impacts your application's performance, scalability, and cost. A well-chosen partition key distributes requests and storage evenly across logical partitions.
Key Characteristics of a Good Partition Key:
- High Cardinality: The partition key should have a large number of distinct values to enable effective distribution of data and requests.
- Even Distribution: The values of the partition key should be evenly distributed across your data to avoid "hot partitions" (partitions that receive a disproportionate amount of traffic or storage).
- Query Patterns Alignment: Ideally, the partition key should be included in the filter predicates of most of your queries. This allows Cosmos DB to efficiently route queries to the relevant logical partitions.
For example, if you are storing user profiles, a userId
could be a good partition key if each user has a unique ID and queries often target specific users.
{
"id": "user123",
"userId": "a7b4f9c1-e8d3-4a2b-8f0e-1c9a3b5d7e8f",
"name": "Jane Doe",
"email": "jane.doe@example.com",
"registrationDate": "2023-10-27T10:00:00Z"
}
Partition Key Best Practices
To maximize performance and avoid common pitfalls, adhere to these best practices when selecting and using partition keys:
- Avoid Hot Partitions: A hot partition occurs when a small number of logical partitions receive a disproportionately high share of requests or data. This can be due to a partition key with low cardinality or a key that is frequently used in queries but has unevenly distributed values.
- Choose Keys Based on Query Patterns: Select a partition key that is commonly used in your application's queries. If your application frequently queries data by tenant ID, then
tenantId
is a strong candidate. - Leverage a Single Property: For most scenarios, partitioning on a single property is recommended. While composite partition keys are supported, they can add complexity.
- Consider Data Size and Request Rate: The maximum number of Request Units (RUs) per logical partition is 10,000 RUs. Similarly, the maximum storage per logical partition is 50 GB. If your partition key values don't meet these limits, consider a key with higher cardinality.
- Immutable Partition Keys: Once an item is written to a logical partition, its partition key value cannot be changed. Ensure your chosen partition key values are immutable.
- Use System Generated IDs Carefully: While
id
is unique for each item, it's often not a good partition key unless your data access patterns specifically target it with high frequency and even distribution.
Example of a Poor Partition Key Choice:
If you partition by registrationDate
and most users register on the same day, you'll likely create a hot partition.
Partition Scalability
Azure Cosmos DB automatically manages the number of physical partitions behind your logical partitions to scale with your data and throughput needs. As your data grows or your request rate increases, Cosmos DB can:
- Add More Physical Partitions: To handle increased load and larger data volumes, Cosmos DB will add more physical partitions.
- Rebalance Data: Data is automatically rebalanced across these physical partitions to maintain even distribution.
The total throughput of your container is distributed across its physical partitions. If you have a high throughput requirement, ensuring your data is spread across many logical partitions (via a good partition key) is essential to utilize the available physical partitions effectively.
Partition Size Limit: Each logical partition has a maximum limit of 50 GB for storage and 10,000 RUs for throughput. If a logical partition exceeds these limits, Cosmos DB will automatically split it into two logical partitions. This is a key mechanism for maintaining performance as your data grows.
Query Performance and Partitioning
The way you structure your queries, especially the inclusion of the partition key in your filters, has a significant impact on query performance and cost.
- Partition Key Routing: When a query includes the partition key in its filter clause (e.g.,
WHERE c.userId = 'some-user-id'
), Cosmos DB can perform a "key-based routing." This means the query is only sent to the logical partitions that contain data matching that partition key value, significantly reducing the number of RUs consumed. - Broad Queries: Queries that do not include the partition key in their filter clause will be "fan-out" queries. Cosmos DB will have to send the query to all logical partitions to retrieve the results. This is less efficient and consumes more RUs.
- Transactions Across Partitions: Cross-partition queries and transactions are possible but generally more expensive and slower than single-partition operations. Aim to design your data model and partition key to keep related data within the same logical partition whenever possible.
Example of an Efficient Query:
SELECT * FROM c WHERE c.userId = 'a7b4f9c1-e8d3-4a2b-8f0e-1c9a3b5d7e8f'
This query uses the partition key and will be routed efficiently.
Example of a Less Efficient Query:
SELECT * FROM c WHERE c.email = 'jane.doe@example.com'
If email
is not the partition key, this query will likely fan out to all partitions.
Indexing Considerations:
Ensure your indexing policy is configured appropriately for your partition key and common query patterns to further optimize performance.