Manage Data in Azure Machine Learning

How to Manage Data in Azure Machine Learning

Effective data management is crucial for building reliable and scalable machine‑learning solutions. Azure Machine Learning provides several services and patterns to help you ingest, store, version, and preprocess data.

Data Stores

Data Versioning

Data Prep (Data Wrangler)

Data Stores

You can register Azure storage resources as Data Stores in your workspace and reference them in pipelines.

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AzureBlobDatastore

ml_client = MLClient(subscription_id, resource_group, workspace_name)
blob_datastore = AzureBlobDatastore(
    name="myblobstore",
    account_name="myaccount",
    container_name="datasets",
    credential="my-credential"
)
ml_client.datastores.create_or_update(blob_datastore)

Supported store types:

Azure Blob Storage
Azure Data Lake Storage Gen2
Azure Files
Azure SQL Database

Data Versioning

Use Data Assets to version datasets. Each version is immutable and can be referenced by name and version number.

from azure.ai.ml import Input, command

# Register a dataset version
ml_client.data.create_or_update(
    name="customer_churn",
    version="v1",
    path="azureml://datastores/myblobstore/paths/churn.csv",
    type="uri_file"
)

# Reference in a pipeline step
step = command(
    name="train",
    inputs={"training_data": Input(type="uri_file", path="azureml:customer_churn@v1")},
    ...
)

Benefits:

Traceability of data used in experiments
Easy rollback to previous dataset versions
Integration with MLflow tracking

Data Preparation with Data Wrangler

Data Wrangler offers a visual UI for cleaning, transforming, and profiling data. It also generates reusable Python scripts.

Connect to registered data stores
Apply operations: filter, split, impute, encode
Export as a pipeline step or a reusable component

Example export:

wrangler_job = data_wrangler.create_job(
    source="azureml:customer_churn@v2",
    steps=[...],
    output="azureml:clean_churn@v1"
)

Best Practices

Store raw data in immutable locations (e.g., a dedicated Blob container).
Use Data Assets for every processed version.
Tag datasets with metadata (e.g., source, collection date).
Automate data validation in the pipeline (schema checks, statistic thresholds).
Leverage Azure Purview for data governance across the organization.

For more detailed guidance, see the Data Schema Validation and Machine Learning Pipelines topics.