MSDN Azure

Azure Cosmos DB Storage Overview

Key Takeaways

Azure Cosmos DB is a globally distributed, multi-model database service that supports NoSQL data. This overview covers its storage aspects, including how data is structured, managed, and optimized for performance and scalability.

Introduction to Azure Cosmos DB Storage

Azure Cosmos DB offers a fundamentally new database experience that embraces the cloud. Its architecture is designed for elastic scalability, high availability, and low latency, making it suitable for a wide range of applications. Understanding its storage model is crucial for optimizing performance and cost.

Unlike traditional relational databases that store data in tables with predefined schemas, Azure Cosmos DB stores data in items. These items are typically represented as JSON documents. The database service manages the physical storage, indexing, and distribution of these items across multiple regions and partitions.

Data Model and Structure

Azure Cosmos DB supports multiple data models, including document, key-value, graph, and column-family. The primary way data is stored is as items, which are the smallest unit of data in Cosmos DB. For document data, items are represented as JSON documents.

Items: The atomic unit of data.
Containers: A logical grouping of items. Containers can be either a collection (for document and key-value data) or a graph (for graph data).
Database: A logical namespace that contains one or more containers.

Items within a container do not require a predefined schema. This schema-agnostic approach allows for flexible data modeling and iteration.

Partitioning for Scalability

To achieve horizontal scalability, Azure Cosmos DB uses partitioning. Data within a container is divided into multiple partitions. Each partition is a logical set of items that share the same partition key value.

A partition key is a property within your item that Azure Cosmos DB uses to distribute data and requests across logical partitions. Choosing an effective partition key is critical for performance and scalability. A good partition key should have a high cardinality and distribute requests evenly across partitions.

Example of an item with a partition key:

                
{
    "id": "todo1",
    "category": "personal",
    "name": "Buy groceries",
    "description": "Milk, Bread, Eggs",
    "isComplete": false,
    "partitionKey": "personal"
}
                
            

In this example, partitionKey (or often a property like category or userId) would be used as the partition key.

Indexing

Azure Cosmos DB automatically indexes every item in a container as soon as it's written. By default, it creates an index for every property in the JSON documents. This indexing is done automatically without requiring manual schema definitions or index management.

The indexing policy can be customized to include or exclude paths, set indexing modes (consistent, lazy, none), and define composite indexes for specific query patterns.

Indexing Policy Example:

                
{
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
        { "path": "/*" }
    ],
    "excludedPaths": [
        { "path": "/path/to/exclude//*" }
    ]
}
                
            

Request Units (RUs)

Request Units (RUs) are a normalized measure of the throughput that Azure Cosmos DB provides. They represent the computation, CPU, memory, and I/O resources required to perform database operations. Each operation consumes a certain number of RUs.

Understanding RU consumption is key to managing costs and performance. You provision throughput for your containers or database, and this throughput is measured in RUs per second (RU/s).

Storage Management and Optimization

Azure Cosmos DB handles the underlying storage complexity. However, you can optimize storage and performance through:

Partition Key Selection: A well-chosen partition key ensures even data distribution and avoids hot partitions.
Indexing Policies: Optimize indexing by excluding unnecessary paths or using composite indexes for frequent query patterns.
Item Size: While Cosmos DB supports large documents, very large items can increase RU consumption and affect latency.
Lease Management: For certain application patterns (like change feed processing), efficient lease management is important.