Azure Cosmos DB Concepts

Azure Cosmos DB is a globally distributed, multi-model database service that enables you to create and query NoSQL, relational, and graph databases with minimal development. It's designed to be massively scalable, highly available, and low-latency.

Introduction to Azure Cosmos DB

Azure Cosmos DB is Microsoft's globally distributed, multi-model database service. It offers a variety of data models and APIs, including SQL (DocumentDB), MongoDB, Cassandra, Gremlin (Graph), and Table. This flexibility allows developers to use the API that best suits their application needs while benefiting from the core features of Cosmos DB.

Data Model

Azure Cosmos DB supports several data models:

Document: Data is stored as JSON or Avro documents, similar to NoSQL document databases.
Key-Value: Data is stored as key-value pairs, suitable for simple lookups.
Graph: Data is stored as nodes and edges, representing relationships, for graph traversal.
Column-Family: Data is organized into column families, suitable for wide-column store use cases.

APIs

Cosmos DB provides multiple APIs to interact with your data:

Core (SQL) API: The native API for Azure Cosmos DB, offering rich querying capabilities over JSON documents.
MongoDB API: Compatible with the MongoDB wire protocol, allowing existing MongoDB applications to connect to Cosmos DB.
Cassandra API: Compatible with the Cassandra Query Language (CQL), enabling Cassandra applications to leverage Cosmos DB's global distribution and scalability.
Gremlin API: Implements the Apache TinkerPop Gremlin standard for graph databases.
Table API: Compatible with the Azure Table storage data model.

Accounts, Databases, Containers, and Items

The core hierarchical structure in Azure Cosmos DB is:

Account: The top-level resource, representing your Cosmos DB deployment. It can contain multiple databases.
Database: A logical namespace that groups containers.
Container: The fundamental unit of scalability and throughput in Cosmos DB. It stores items and their associated index. Containers can be provisioned with dedicated throughput or use serverless capacity.
Item: The basic unit of data stored in a container. For the SQL API, items are JSON documents.

Partitioning

To achieve massive scalability, Azure Cosmos DB partitions data horizontally. Each container is logically partitioned into multiple partitions. A partition key is selected when you create a container. This key, a property within your item's JSON document, determines which partition an item is stored in. Good partition key design is crucial for performance and scalability.

Choosing an effective partition key is essential for distributing your data and requests uniformly across partitions.

Indexing

Azure Cosmos DB automatically indexes every property of every item stored in a container. The indexing policy, which is configurable, dictates how this indexing is performed. The default indexing policy is a composite index that includes all properties, providing broad query support.


{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/*",
      "indexes": [
        {
          "kind": "Range",
          "dataTypes": ["String", "Number", "Boolean", "Null"],
          "precision": -1
        },
        {
          "kind": "Spatial",
          "dataTypes": ["Point", "Polygon", "LineString"]
        },
        {
          "kind": "Composite",
          "dataTypes": ["String", "Number", "Boolean", "Null"],
          "size": 2
        }
      ]
    }
  ],
  "excludedPaths": []
}

Consistency Models

Azure Cosmos DB offers five well-defined consistency levels, allowing you to balance consistency, availability, and latency:

Strong: Guarantees that reads always return the most recent committed write.
Bounded Staleness: Reads are guaranteed to be no older than a specified version or time interval.
Session: Ensures that reads within a single client session are consistent.
Consistent Prefix: Guarantees that reads will return a prefix of writes, but not necessarily all of them.
Eventual: The weakest consistency, where reads may return stale data.

Throughput

Throughput in Azure Cosmos DB is measured in Request Units (RUs) per second. You can provision throughput in two ways:

Manual: You specify the exact number of RUs you need for a container or database.
Autoscale: The system automatically scales the provisioned throughput up and down within a specified range based on usage.

Each operation (e.g., reading an item, querying) consumes a certain number of RUs, depending on the operation type, item size, and indexing. Understanding RU consumption is key to cost management and performance tuning.

Regions and Replication

Azure Cosmos DB is a globally distributed service. You can replicate your data across multiple Azure regions for high availability and low latency access for users worldwide. You can configure which regions your account is deployed in and enable multi-master writes for active-active global distribution.

Global distribution and multi-region writes are key differentiators for building highly available and responsive applications.