Partitioning Overview

This document provides an overview of partitioning in Azure Cosmos DB, a critical concept for scaling your database effectively.

What is Partitioning?

Partitioning, also known as sharding, is the process of horizontally scaling your database by distributing data across multiple logical partitions. Each partition is a set of physical storage that contains a subset of your data. Azure Cosmos DB handles the complexities of partitioning, replication, and load balancing for you.

Partition Key

The cornerstone of partitioning is the partition key. When you create a container (e.g., a collection in SQL API, a container in Gremlin API), you must choose a property from your items to act as the partition key.

Choosing a Good Partition Key

The ideal partition key should have:

Note: Avoid partition keys with very few distinct values (e.g., /status with values like 'active', 'inactive') as this can lead to hot partitions and underutilization of provisioned throughput.

Logical and Physical Partitions

Azure Cosmos DB abstracts the complexity of physical storage. You interact with logical partitions, which are collections of items that share the same partition key value.

Partitioning Strategy

A well-defined partitioning strategy is crucial for performance and scalability.

1. Select the Partition Key

Based on your data model and common query patterns, select an appropriate partition key. For example:

// Example: SQL API Item Structure
{
    "id": "12345",
    "orderId": "ORD-98765",
    "customerId": "CUST-ABC123",
    "orderDate": "2023-10-27T10:00:00Z",
    "totalAmount": 150.75
}

In this example, /customerId or /orderDate could be potential partition keys. /customerId would be suitable for queries that fetch all orders for a specific customer. /orderDate might be useful for time-series analysis but could lead to hot partitions if most orders occur on the same day.

2. Understand Throughput Distribution

The total provisioned Request Units (RUs) for a container are distributed evenly across all its physical partitions. If you have a hot partition (one that receives a disproportionately high amount of traffic), it can become a bottleneck.

3. Broad Partition Key Values

To avoid hot partitions, ensure your chosen partition key has a sufficient number of distinct values. For example, if you have 10,000 customers, using /customerId as the partition key will likely result in good distribution.

4. Synthetic Partition Keys

In scenarios where natural partition keys have low cardinality, you can create a "synthetic" partition key by combining two or more properties. For instance, you could combine customerId and orderDate to create a synthetic key like customerId-orderDate.

// Example Synthetic Partition Key Value
"CUST-ABC123-2023-10-27"

Partition Key Limits

Each logical partition has a maximum storage size and a limit on the number of items. As of the latest updates, these limits are quite generous, but it's good practice to be aware of them:

Key Takeaway: The partition key is the most critical decision for scaling and performance in Azure Cosmos DB. Choose wisely based on your data and access patterns.

Common Scenarios