Event Hubs Capture
Azure Event Hubs Capture is a built-in feature that automatically and incrementally batches the data in your Event Hubs, saving it to a Microsoft Azure Storage account or Azure Data Lake Storage Gen2 account of your choice.
Capture is designed to offload the task of archiving event data from Event Hubs. It's a fully managed service that continuously archives any data written to your Event Hubs. It's ideal for scenarios where you need to archive all incoming events for downstream processing, historical analysis, or compliance purposes.
Key Features of Event Hubs Capture
- Automatic Archiving: No code required to capture events. It's a seamless integration.
- Incremental Batching: Data is batched and written to storage incrementally based on configurable time and size windows.
- Multiple Storage Options: Supports Azure Blob Storage and Azure Data Lake Storage Gen2.
- Avro Format: Captured data is saved in Avro format, which is efficient and widely supported.
- Configurable Capture Settings: You can customize the capture window size and interval.
- Scalability: Scales automatically with your Event Hubs throughput.
How Event Hubs Capture Works
When you enable Event Hubs Capture on an Event Hub namespace, you specify:
- The target storage account: Either Azure Blob Storage or Azure Data Lake Storage Gen2.
- The destination container/directory: Where the captured data will be stored.
- The capture interval: How frequently (in seconds) Capture should attempt to create a new blob or file.
- The capture size: The maximum size (in megabytes) of a blob or file before Capture creates a new one.
Event Hubs Capture operates by:
- Monitoring incoming events in your Event Hubs.
- Aggregating events into batches based on the configured time and size intervals.
- Writing these batches as Avro files to the specified Azure Storage account.
- The files are organized in a hierarchical structure within the storage account, typically including year, month, day, hour, and minute for easy querying and management.
Enabling Event Hubs Capture
You can enable Event Hubs Capture through the Azure portal, Azure CLI, PowerShell, or SDKs.
Azure Portal Steps:
- Navigate to your Event Hubs namespace in the Azure portal.
- In the left-hand menu, under "Settings," select "Create capture settings".
- Enable the Capture toggle.
- Select or create your Storage account and Blob container (or Data Lake Storage Gen2 file system).
- Configure the Capture interval (e.g., 300 seconds) and Capture size (e.g., 100 MB).
- Click "Save".
Important: Event Hubs Capture begins archiving events only *after* the capture setting is enabled. It does not capture historical data that arrived before the feature was turned on.
Use Cases for Event Hubs Capture
- Batch Analytics: Process large volumes of event data using tools like Azure Databricks, Apache Spark, or Azure Synapse Analytics.
- Archival and Compliance: Maintain a historical record of all events for regulatory compliance or auditing purposes.
- Machine Learning Training: Use captured data to train machine learning models that predict patterns or anomalies.
- Data Lake Integration: Feed raw event data directly into your data lake for unified access and advanced analytics.
- Replay Scenarios: In some cases, captured data can be used to reprocess events if an error occurs in a downstream consumer.
Data Format: Avro
Event Hubs Capture saves data in the Apache Avro format. Avro is a row-based data serialization system that provides rich data structures and a compact, fast, binary data format. Each captured file will contain multiple events. The Avro schema includes metadata about the event, such as offset, sequence number, timestamp, and properties, in addition to the event body.
You can use various tools and libraries to read Avro files, including:
- Apache Spark
- Apache Hive
- Azure Databricks
- Azure Synapse Analytics
- Python with the
fastavrooravro-python3libraries
Considerations
- Capture is enabled at the namespace level and applies to all Event Hubs within that namespace.
- The capture process incurs additional costs for Azure Storage.
- It's essential to monitor your storage account for consumed space.
- While Capture is automatic, understanding the batching intervals helps predict when data will become available in storage.
By leveraging Event Hubs Capture, you can efficiently integrate your real-time event streaming data with robust batch processing and analytical services.