Event Hubs Capture

Azure Event Hubs Capture is a fully managed, end-to-end solution for transforming streaming data from Event Hubs into Apache Parquet or Apache Avro files in Azure Blob Storage or Azure Data Lake Storage Gen1/Gen2.

How Event Hubs Capture Works

Capture is an opt-in feature that you enable on an Event Hubs namespace. Once enabled, it automatically and periodically ingests the data from the Event Hubs into your chosen storage account. The data is captured in batches, with configurable batch size and time intervals.

Key Features and Benefits:

Enabling Event Hubs Capture

You can enable Capture via the Azure portal, Azure CLI, PowerShell, or ARM templates. The process involves:

  1. Creating or selecting an existing Event Hubs namespace.
  2. Configuring a destination storage account (Blob Storage or Data Lake Storage).
  3. Specifying the capture settings:
    • Destination Type: Blob Storage or Data Lake Storage.
    • Storage Account Name: The name of your storage account.
    • Container Name: The container within the storage account to store captured data.
    • File Format: Parquet or Avro.
    • Capture Interval (in seconds): The maximum time to wait before creating a new file (e.g., 60 seconds).
    • Capture Size (in MB): The maximum size of a file before it's closed and a new one is created (e.g., 100 MB).

Azure Portal Example:

Navigate to your Event Hubs namespace in the Azure portal, select Capture from the left-hand menu, and follow the prompts to configure your destination and settings.

Azure CLI Example:

az eventhubs namespace update \
    --resource-group "MyResourceGroup" \
    --name "myEventHubNamespace" \
    --enable-capture true \
    --capture-destination blob \
    --capture-storage-account "myStorageAccount" \
    --capture-container "eventhub-capture" \
    --capture-file-format avro \
    --capture-interval 300 \
    --capture-size 100

Understanding Captured Data

Captured data is organized in a hierarchical directory structure within your storage account. The default naming convention is:

{container}/{namespace}/{eventhub}/{year}/{month}/{day}/{hour}/{capture-time}-{capture-sequence}.avro

Or for Parquet:

{container}/{namespace}/{eventhub}/{year}/{month}/{day}/{hour}/{capture-time}-{capture-sequence}.parquet

Where:

Data Schema:

The captured files contain metadata along with the event body. The schema depends on the chosen file format (Avro or Parquet) and the original event payload.

Note on Data Format

When using Capture with JSON payloads, the entire JSON object is serialized into the event body. If you are using Avro, the event body itself will be an Avro record. For Parquet, the data will be structured according to the Parquet schema.

Use Cases for Captured Data

Important Considerations:

By leveraging Event Hubs Capture, you can reliably archive your real-time data streams and unlock their value for a wide range of batch processing and analytical scenarios.

Back to Top