Azure Machine Learning Service – Data Management

Overview

Data management in Azure Machine Learning (AML) provides a unified experience for registering, versioning, and accessing data assets throughout the ML lifecycle. It supports cloud and on‑premise data stores, integrates with Azure Blob, Azure Data Lake, Azure SQL, and allows seamless data movement between compute targets.

Key Concepts

  • Data Stores – Secure references to storage accounts, databases, or file shares.
  • Datasets – Named, versioned pointers to data files or tables used for training and inference.
  • Data Versioning – Immutable snapshots of a dataset that enable reproducibility.
  • Data Ingestion – Automated pipelines to pull data from external sources into AML.

Register a Data Store

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import AzureBlobDatastore

credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

blob_datastore = AzureBlobDatastore(
    name="myblobstore",
    account_name="myblobaccount",
    container_name="ml-data",
    endpoint="https://myblobaccount.blob.core.windows.net"
)

ml_client.data_stores.create_or_update(blob_datastore)

Create a Versioned Dataset

from azure.ai.ml.entities import Data

dataset = Data(
    name="customer-churn",
    version="1",
    path="azureml://datastores/myblobstore/paths/churn.csv",
    type="uri_file",
    description="Customer churn dataset for classification"
)

ml_client.data.create_or_update(dataset)

Accessing Data in a Training Script

import os
from azure.ai.ml import Input

def main():
    data_path = os.getenv("AZUREML_DATAREFERENCE_customer_churn")
    import pandas as pd
    df = pd.read_csv(data_path)
    print(df.head())

if __name__ == "__main__":
    main()

Data Versioning Best Practices

PracticeDescription
Immutable versionsNever overwrite an existing version; create a new one.
Semantic versioningUse major.minor.patch to reflect data changes.
Metadata taggingAdd tags such as raw, processed, sample for discoverability.

Sample Pipeline with Data Ingestion

from azure.ai.ml import dsl, Input, Output
from azure.ai.ml.entities import CommandComponent

ingest_component = CommandComponent(
    name="ingest_raw_data",
    version="1",
    inputs={"source_url": Input(type="string")},
    outputs={"raw_dataset": Output(type="uri_folder")},
    command="python ingest.py --source ${{inputs.source_url}} --dest ${{outputs.raw_dataset}}"
)

@dsl.pipeline(name="churn-pipeline", description="End‑to‑end churn training")
def churn_pipeline(source_url: str):
    ingest = ingest_component(source_url=source_url)
    train = train_component(
        training_data=ingest.outputs.raw_dataset,
        model_output=Output(type="mlflow_model")
    )
    return {"model": train.outputs.model_output}

pipeline_job = churn_pipeline(source_url="https://example.com/churn.csv")
ml_client.jobs.create_or_update(pipeline_job, experiment_name="churn-experiment")