Microsoft Logo

Microsoft Docs

How to Manage Data in Azure Machine Learning

Effective data management is crucial for building reliable and scalable machine‑learning solutions. Azure Machine Learning provides several services and patterns to help you ingest, store, version, and preprocess data.

Data Stores
Data Versioning
Data Prep (Data Wrangler)

Data Stores

You can register Azure storage resources as Data Stores in your workspace and reference them in pipelines.

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AzureBlobDatastore

ml_client = MLClient(subscription_id, resource_group, workspace_name)
blob_datastore = AzureBlobDatastore(
    name="myblobstore",
    account_name="myaccount",
    container_name="datasets",
    credential="my-credential"
)
ml_client.datastores.create_or_update(blob_datastore)

Supported store types:

  • Azure Blob Storage
  • Azure Data Lake Storage Gen2
  • Azure Files
  • Azure SQL Database

Data Versioning

Use Data Assets to version datasets. Each version is immutable and can be referenced by name and version number.

from azure.ai.ml import Input, command

# Register a dataset version
ml_client.data.create_or_update(
    name="customer_churn",
    version="v1",
    path="azureml://datastores/myblobstore/paths/churn.csv",
    type="uri_file"
)

# Reference in a pipeline step
step = command(
    name="train",
    inputs={"training_data": Input(type="uri_file", path="azureml:customer_churn@v1")},
    ...
)

Benefits:

  • Traceability of data used in experiments
  • Easy rollback to previous dataset versions
  • Integration with MLflow tracking

Data Preparation with Data Wrangler

Data Wrangler offers a visual UI for cleaning, transforming, and profiling data. It also generates reusable Python scripts.

  • Connect to registered data stores
  • Apply operations: filter, split, impute, encode
  • Export as a pipeline step or a reusable component

Example export:

wrangler_job = data_wrangler.create_job(
    source="azureml:customer_churn@v2",
    steps=[...],
    output="azureml:clean_churn@v1"
)

Best Practices

  1. Store raw data in immutable locations (e.g., a dedicated Blob container).
  2. Use Data Assets for every processed version.
  3. Tag datasets with metadata (e.g., source, collection date).
  4. Automate data validation in the pipeline (schema checks, statistic thresholds).
  5. Leverage Azure Purview for data governance across the organization.

For more detailed guidance, see the Data Schema Validation and Machine Learning Pipelines topics.