What is Event Hubs Capture?
Azure Event Hubs Capture is a built-in feature that automatically and incrementally captures the stream of events in your Event Hub to a specified Azure Blob Storage account or Azure Data Lake Storage Gen2 account.
Capture is designed for scenarios where you need to archive your event streams for subsequent processing, batch analytics, or compliance reasons. It provides a fully managed, scalable, and cost-effective way to retain your event data.
How Capture Works
When Capture is enabled for an Event Hub, it continuously monitors the incoming event stream. Events are buffered and then written to your chosen storage destination (Azure Blob Storage or Azure Data Lake Storage Gen2) in batches based on configurable capture time and size intervals.
The data is written to storage in Apache Avro format, which is a compact, efficient binary format suitable for large-scale data processing.
Capture operates independently of your application's event producers and consumers, meaning it doesn't impact the throughput or latency of your real-time event processing pipelines.
Capture Process Flow:
- Events are sent to the Event Hub.
- Capture periodically buffers events.
- Batches of events are serialized into Avro format.
- Avro files are uploaded to the configured storage account.
- Files are organized into a date-time partitioned folder structure (e.g.,
<container>/<year>/<month>/<day>/<hour>/<CaptureFile.avro>).
Enabling Capture
Capture can be enabled and configured directly from the Azure portal when creating or updating an Event Hub namespace. You'll need to provide:
- A destination storage account (Azure Blob Storage or Azure Data Lake Storage Gen2).
- A destination container within the storage account.
- Permissions for Event Hubs to write to the storage account (typically via a Managed Identity or connection string).
You can also enable Capture using Azure Resource Manager (ARM) templates, Azure CLI, or PowerShell.
Capture Settings
You can fine-tune Capture's behavior through several configuration options:
Capture Frequency
Determines how often Capture creates a new file. You can set this to:
- Capture Time: The maximum time interval between captures (e.g., every 60 seconds).
Capture Size
Determines the maximum size of a capture file. Once this size is reached, Capture will create a new file, even if the time interval hasn't elapsed.
- Capture Size: The maximum size limit for a capture file (e.g., 100 MB).
Capture Format
The data format written by Capture is standardized.
- Avro: The default and only supported format for Capture output. Avro is a row-based binary format ideal for big data processing.
Destination Storage
Capture supports writing event data to two Azure storage services:
- Azure Blob Storage: A general-purpose object storage solution. Ideal for general archiving and compatibility with various tools.
- Azure Data Lake Storage Gen2: Built on Azure Blob Storage, it provides hierarchical namespace capabilities, optimized for big data analytics workloads.
When configuring Capture, you specify the storage account and container where the Avro files will be stored. The folder structure within the container is automatically managed by Capture to facilitate easy querying and retrieval.
Use Cases
Event Hubs Capture is invaluable for a wide range of scenarios:
- Data Archiving and Compliance: Retain event data for regulatory compliance or audit purposes.
- Batch Analytics: Process large volumes of historical event data using services like Azure Databricks, Azure Synapse Analytics, or Azure HDInsight.
- Machine Learning Training: Use historical event streams as training data for machine learning models.
- Debugging and Replay: Replay events from storage to debug issues or test new processing logic.
- Data Lake Integration: Seamlessly feed real-time event data into a data lake for unified data governance and analysis.