Core Concepts of Azure Event Hubs
Event Hubs Overview
Azure Event Hubs is a big data streaming platform and event-ingestion service. It can be used for real-time analytics, continuous data processing, and commanding and control scenarios. Event Hubs handles millions of events per second, allowing you to build applications that react to events as they happen.
Key characteristics of Event Hubs include:
- High Throughput: Designed to ingest and process massive volumes of streaming data.
- Low Latency: Enables near real-time event processing.
- Scalability: Easily scales to accommodate varying data loads.
- Durability: Provides reliable data storage and replay capabilities.
- Integration: Seamlessly integrates with other Azure services like Stream Analytics, Azure Functions, and Databricks.
Producer-Consumer Model
Event Hubs operates on a classic producer-consumer model:
- Producers: Applications or devices that send (publish) data to Event Hubs. These can be web servers, IoT devices, application logs, etc.
- Consumers: Applications that read (subscribe) data from Event Hubs. These applications process the incoming data for various purposes like analytics, alerting, or further storage.
Diagram illustrating the Producer-Consumer model for Event Hubs.
Event Hub
An Event Hub is the central entity in Event Hubs. It acts as a highly scalable publish-subscribe message broker. An event hub is a collection of event data. You can think of it as a database table for events. Data is sent to an event hub and stored for a configurable period before being deleted or archived.
When you create an Event Hubs namespace, you can then create one or more event hubs within that namespace. Each event hub has its own configuration, including retention period and partitioning strategy.
Partitions
Partitions are the fundamental unit of ordering and data storage within an event hub. An event hub is divided into one or more partitions. Data sent to an event hub is distributed across these partitions.
- Ordering: Events within a single partition are ordered. The order of events across different partitions is not guaranteed.
- Scalability: Partitions enable parallel processing. Consumers can read from different partitions simultaneously, increasing throughput.
- Load Balancing: Producers can send events to specific partitions or let Event Hubs decide. Consumers are typically assigned to read from a subset of partitions.
- Partition Key: Producers can use a partition key to ensure that events with the same key are always routed to the same partition. This is useful for maintaining order for related events (e.g., all events from a specific device). If no partition key is specified, Event Hubs distributes events in a round-robin fashion.
The number of partitions is chosen at the time of event hub creation and affects both scalability and cost. You can increase the number of partitions later, but you cannot decrease it.
Consumer Groups
A consumer group is an abstraction that allows multiple independent applications or services to read from the same event hub without interfering with each other. Each consumer group maintains its own offset (a pointer to the last read event) within each partition.
- Independent Consumption: Different applications can consume the same stream of events for different purposes. For example, one consumer group might process events for real-time dashboards, while another archives them to data lake.
- Read Progress: Each consumer group tracks its own reading progress independently.
- Default Consumer Group: Every event hub is created with a default consumer group. You can create additional consumer groups as needed.
Tip: A common pattern is to have one consumer group for each application that needs to process the event stream.
Events
An event is the fundamental unit of data processed by Event Hubs. Events are records containing information that has occurred. Event Hubs supports up to 256KB of data per event, with a maximum batch size of 1MB.
An event typically consists of:
- Body: The actual data payload of the event. This can be in any format, such as JSON, Avro, Protobuf, or plain text.
- Properties: Optional metadata associated with the event, such as content type, origin, timestamp, or custom application-specific attributes.
When an event is sent to Event Hubs, it is assigned an offset within a specific partition, which acts as its unique identifier within that partition.
Throughput Units (TUs)
Throughput Units (TUs) are the primary mechanism for provisioning capacity and managing throughput for Event Hubs. A TU is a unit of aggregated throughput that allows for ingress (incoming) and egress (outgoing) traffic.
- Ingress: Up to 1 MB per second or 1000 events per second.
- Egress: Up to 2 MB per second or 4096 events per second.
When you provision TUs for an event hub, you are allocating a certain amount of ingress and egress capacity. The number of partitions you choose also influences how this capacity is utilized. For example, if you have 10 partitions and 4 TUs, the total ingress capacity is 4 MB/sec or 4000 events/sec, and this capacity is distributed across the partitions.
Note: For very high-scale scenarios, Event Hubs offers auto-inflate, which can automatically increase the number of TUs as needed, and dedicated clusters for predictable, high-demand workloads.
Event Hubs Capture
Event Hubs Capture is a built-in feature that automatically and continuously streams event data from an event hub to an Azure Storage account (Blob Storage or Data Lake Storage Gen2).
Key features of Event Hubs Capture:
- Automatic Archiving: No custom code is needed to archive data.
- Configurable: You can specify the storage account, container, and the interval (time or size) at which capture occurs.
- Avro Format: Data is captured in the Avro format, which is efficient for large-scale data processing.
- Long-Term Storage: Enables cost-effective long-term storage and subsequent analysis of event data.
This feature is ideal for scenarios where you need to retain historical event data for compliance, auditing, or batch analytics.