Event Hubs Capture
Azure Event Hubs Capture is a fully managed, end-to-end solution for transforming streaming data from Event Hubs into Apache Parquet or Apache Avro files in Azure Blob Storage or Azure Data Lake Storage Gen1/Gen2.
How Event Hubs Capture Works
Capture is an opt-in feature that you enable on an Event Hubs namespace. Once enabled, it automatically and periodically ingests the data from the Event Hubs into your chosen storage account. The data is captured in batches, with configurable batch size and time intervals.
Key Features and Benefits:
- Automatic Archiving: Seamlessly archives streaming data for later processing, analysis, or compliance.
- Scalable: Scales automatically with your Event Hubs throughput.
- Managed Service: No infrastructure to manage. Azure handles the underlying compute and storage provisioning.
- Integration with Azure Ecosystem: Works natively with Blob Storage, Data Lake Storage, and downstream analytics services like Azure Databricks, Azure Synapse Analytics, and Azure HDInsight.
- File Formats: Supports industry-standard formats like Apache Parquet and Apache Avro, optimized for analytical workloads.
- Configurable: Allows customization of capture intervals and file sizes.
Enabling Event Hubs Capture
You can enable Capture via the Azure portal, Azure CLI, PowerShell, or ARM templates. The process involves:
- Creating or selecting an existing Event Hubs namespace.
- Configuring a destination storage account (Blob Storage or Data Lake Storage).
- Specifying the capture settings:
- Destination Type: Blob Storage or Data Lake Storage.
- Storage Account Name: The name of your storage account.
- Container Name: The container within the storage account to store captured data.
- File Format: Parquet or Avro.
- Capture Interval (in seconds): The maximum time to wait before creating a new file (e.g., 60 seconds).
- Capture Size (in MB): The maximum size of a file before it's closed and a new one is created (e.g., 100 MB).
Azure Portal Example:
Navigate to your Event Hubs namespace in the Azure portal, select Capture from the left-hand menu, and follow the prompts to configure your destination and settings.
Azure CLI Example:
az eventhubs namespace update \
--resource-group "MyResourceGroup" \
--name "myEventHubNamespace" \
--enable-capture true \
--capture-destination blob \
--capture-storage-account "myStorageAccount" \
--capture-container "eventhub-capture" \
--capture-file-format avro \
--capture-interval 300 \
--capture-size 100
Understanding Captured Data
Captured data is organized in a hierarchical directory structure within your storage account. The default naming convention is:
{container}/{namespace}/{eventhub}/{year}/{month}/{day}/{hour}/{capture-time}-{capture-sequence}.avro
Or for Parquet:
{container}/{namespace}/{eventhub}/{year}/{month}/{day}/{hour}/{capture-time}-{capture-sequence}.parquet
Where:
{capture-time}is the timestamp when the capture occurred.{capture-sequence}is a sequence number for files captured at the same time.
Data Schema:
The captured files contain metadata along with the event body. The schema depends on the chosen file format (Avro or Parquet) and the original event payload.
Note on Data Format
When using Capture with JSON payloads, the entire JSON object is serialized into the event body. If you are using Avro, the event body itself will be an Avro record. For Parquet, the data will be structured according to the Parquet schema.
Use Cases for Captured Data
- Batch Analytics: Process historical data using tools like Spark, Databricks, or Synapse Analytics.
- Machine Learning Training: Use captured data to train predictive models.
- Compliance and Auditing: Maintain long-term records of events for regulatory purposes.
- Data Warehousing: Load captured data into data warehouses for business intelligence.
- Replay Scenarios: Replay historical events for debugging or testing new applications.
Important Considerations:
- Capture is enabled at the namespace level and applies to all Event Hubs within that namespace.
- The data is captured in its raw format. If you need to transform or process it, you'll need downstream services.
- Monitor your storage account to ensure sufficient capacity and manage costs.
- Ensure the managed identity or access key used for Capture has the necessary permissions on the storage account.
By leveraging Event Hubs Capture, you can reliably archive your real-time data streams and unlock their value for a wide range of batch processing and analytical scenarios.
Back to Top