Introduction
Azure Event Hubs Capture is a built-in feature that enables you to automatically and incrementally capture the output of your Event Hubs stream into a specified Azure Storage account (either Azure Blob Storage or Azure Data Lake Storage Gen2). This capability allows you to archive your event data for longer-term storage, replayability, batch analytics, and compliance purposes without requiring custom code.
Capture solves the common problem of wanting to retain and process streaming event data in a batch-friendly format. Instead of building and maintaining complex processing pipelines to offload data, Event Hubs Capture handles this seamlessly in the background.
How It Works
When Event Hubs Capture is enabled for an event hub, it continuously reads events from the event hub and batches them together. These batches are then written to your configured storage account in a format suitable for batch processing.
- Continuous Archiving: Capture runs constantly, ensuring that events are archived as they arrive.
- Batching: Events are batched based on time or size thresholds before being written to storage. This optimizes storage efficiency and reduces the number of files.
- Automatic Formatting: Data is automatically formatted into Apache Avro files. Avro is a compact and efficient binary serialization format well-suited for big data analytics.
- Configurable Storage: You specify the target storage account and container where the captured data will be saved.
- File Naming Convention: Captured files follow a predictable naming convention, allowing for easy discovery and processing. The format is typically:
{Namespace}/{EventHubName}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}
File Format (Apache Avro)
Event Hubs Capture uses Apache Avro for its efficiency and schema evolution capabilities. Each Avro file contains events from a specific partition within a defined time window. The schema of the Avro files includes metadata about the events, such as offset, sequence number, and timestamp, along with the event body itself.
A typical Avro schema might look like this (simplified):
{
"type": "record",
"name": "EventData",
"fields": [
{"name": "body", "type": ["null", "bytes"]},
{"name": "properties", "type": {"type": "map", "values": "string"}},
{"name": "systemProperties", "type": {"type": "map", "values": "string"}},
{"name": "offset", "type": "long"},
{"name": "sequenceNumber", "type": "long"},
{"name": "partitionKey", "type": ["null", "string"]},
{"name": "enqueuedTime", "type": "long"}
]
}
Key Features
- Fully Managed: No infrastructure to manage for the capture process.
- Automatic and Seamless: Works without custom code or external services.
- Scalable: Scales automatically with your Event Hubs throughput.
- Configurable: Control capture interval, file size, and destination.
- Integration: Designed to work seamlessly with Azure Blob Storage and Azure Data Lake Storage Gen2.
- Cost-Effective: Efficiently archives data, minimizing storage costs.
Benefits
- Long-Term Archiving: Satisfy compliance and regulatory requirements by retaining event data.
- Batch Analytics: Enable big data processing and machine learning on historical event data using services like Azure Databricks, Azure Synapse Analytics, or HDInsight.
- Data Replay: Replay events for debugging, testing, or re-processing in case of downstream failures.
- Simplified Architecture: Eliminates the need for custom data offloading solutions.
- Disaster Recovery: Provides a historical record of events in case of primary system issues.
Supported Storage
Event Hubs Capture supports the following Azure storage services:
- Azure Blob Storage: A cost-effective object storage solution for unstructured data.
- Azure Data Lake Storage Gen2 (ADLS Gen2): Optimized for big data analytics workloads, offering hierarchical namespace and excellent performance.
When configuring Capture, you will need to provide the connection string or managed identity details for your storage account and specify the target container.
Configuration
Event Hubs Capture can be enabled and configured directly from the Azure portal or programmatically using Azure SDKs, ARM templates, or Bicep.
Key configuration parameters include:
- Storage Account: The Azure Storage account where data will be captured.
- Storage Container: The specific container within the storage account.
- Capture Interval: The maximum time interval (in minutes) before a batch of events is written to storage.
- File Size Limit: The maximum size (in MB) of each captured file.
- Encoding: Typically Avro, but options might exist for other encodings in certain scenarios.
When configuring via the Azure portal, you navigate to your Event Hubs namespace, select the Event Hub you want to enable Capture for, and find the "Capture" setting.
Use Cases
- IoT Data Archiving: Store massive amounts of telemetry data from IoT devices for later analysis or auditing.
- Log Aggregation: Capture application and system logs streamed via Event Hubs for security analysis or troubleshooting.
- Financial Transaction Streaming: Archive transaction data for compliance and historical analysis.
- Real-time Dashboard Backends: Store raw data used to populate real-time dashboards for long-term retention and deeper analysis.
- Machine Learning Training Data: Collect and archive event data to train and retrain machine learning models.
Limitations
While powerful, Event Hubs Capture has some considerations:
- Not for Real-time Processing: Capture is designed for batch archiving, not for immediate real-time consumption. For real-time processing, use Event Hubs SDKs directly or services like Azure Stream Analytics or Azure Functions.
- Avro Format: Data is always captured in Avro format, which might require conversion for some downstream tools if Avro is not natively supported.
- No Data Transformation: Capture does not perform any data transformation; it simply archives the events as received.
- Storage Cost: While efficient, archiving large volumes of data will incur storage costs.
- Partition Granularity: Files are created per partition, which can lead to a large number of small files if you have many partitions and low data volume per partition.