Azure Event Hubs Developer's Guide: Core Concepts

Core Concepts

Event Hubs: The Foundation

Azure Event Hubs is a highly scalable data streaming platform and event ingestion service. It can receive and process millions of events per second. It enables you to process and analyze streaming data in real-time. Think of it as a massive, distributed log that can ingest vast amounts of data from various sources.

Events

An event is a small unit of information. In Event Hubs, an event is typically a serialized record of telemetry, application logs, or any other type of data. Events have a body (the payload) and a set of properties that provide metadata about the event.

Producers and Consumers

Producers are applications or services that send events to an Event Hub. These can be anything from IoT devices generating sensor data to web applications logging user activity.

Consumers are applications or services that read events from an Event Hub. They process the incoming stream of data for various purposes like real-time analytics, data warehousing, or triggering subsequent actions.

Namespaces, Event Hubs, and Partitions

The structure of Event Hubs follows a hierarchical model:

Namespace: A logical container for one or more Event Hubs. It provides a unique scope and management boundary. All Event Hubs within a namespace share common settings, such as region and authentication methods.
Event Hub: The actual data stream within a namespace. This is where events are sent and from where they are consumed. An Event Hub can have one or more partitions.
Partition: An Event Hub is divided into one or more partitions. Partitions are ordered, immutable sequences of events. Data is appended to partitions. This partitioning is key to Event Hubs' scalability and ability to handle high throughput. Consumers can read from partitions independently.

                    Key takeaway: Events are published to an Event Hub, which is composed of multiple ordered partitions. Producers send events, and consumers read them.
                

Consumer Groups

To allow multiple applications to independently consume events from the same Event Hub without interfering with each other, Event Hubs uses the concept of consumer groups. Each consumer group maintains its own read position within each partition. This means different applications can process the data stream at their own pace and with their own logic.

When you create an Event Hub, a default consumer group named $Default is automatically created. You can create additional consumer groups for specific application needs.

Partition Keys

Producers can specify a partition key when sending an event. If a partition key is provided, Event Hubs uses a hash of the partition key to deterministically select a partition for the event. This ensures that all events with the same partition key are sent to the same partition. This is crucial for maintaining order within a specific entity (e.g., all events for a specific device should go to the same partition).

If no partition key is specified, Event Hubs distributes events across partitions in a round-robin fashion, which maximizes throughput but doesn't guarantee ordering for specific entities.

Offsets

Each event within a partition is assigned a sequential, ordered identifier called an offset. Consumers use offsets to track their progress in reading events from a partition. An offset represents the position of an event within a partition. Consumers can restart reading from a specific offset if needed.

// Example of event structure
{
  "body": {
    "sensorId": "sensor-123",
    "timestamp": "2023-10-27T10:00:00Z",
    "value": 25.5
  },
  "properties": {
    "contentType": "application/json"
  }
}

Throughput and Quotas

Event Hubs is designed for high throughput. The specific throughput limits depend on the Event Hubs tier (Basic, Standard, Premium) and the number of partitions. Understanding these limits is important for designing scalable and performant streaming solutions.

Retention Policy

Event Hubs retains events for a configurable period, known as the retention period. After the retention period expires, events are automatically deleted. This policy helps manage storage costs and ensures that only relevant data is kept.