Data Management - Azure Machine Learning Service

Overview

Data management in Azure Machine Learning (AML) provides a unified experience for registering, versioning, and accessing data assets throughout the ML lifecycle. It supports cloud and on‑premise data stores, integrates with Azure Blob, Azure Data Lake, Azure SQL, and allows seamless data movement between compute targets.

Key Concepts

Data Stores – Secure references to storage accounts, databases, or file shares.
Datasets – Named, versioned pointers to data files or tables used for training and inference.
Data Versioning – Immutable snapshots of a dataset that enable reproducibility.
Data Ingestion – Automated pipelines to pull data from external sources into AML.

Register a Data Store

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import AzureBlobDatastore

credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

blob_datastore = AzureBlobDatastore(
    name="myblobstore",
    account_name="myblobaccount",
    container_name="ml-data",
    endpoint="https://myblobaccount.blob.core.windows.net"
)

ml_client.data_stores.create_or_update(blob_datastore)

Create a Versioned Dataset

from azure.ai.ml.entities import Data

dataset = Data(
    name="customer-churn",
    version="1",
    path="azureml://datastores/myblobstore/paths/churn.csv",
    type="uri_file",
    description="Customer churn dataset for classification"
)

ml_client.data.create_or_update(dataset)

Accessing Data in a Training Script

import os
from azure.ai.ml import Input

def main():
    data_path = os.getenv("AZUREML_DATAREFERENCE_customer_churn")
    import pandas as pd
    df = pd.read_csv(data_path)
    print(df.head())

if __name__ == "__main__":
    main()

Data Versioning Best Practices

Practice	Description
Immutable versions	Never overwrite an existing version; create a new one.
Semantic versioning	Use `major.minor.patch` to reflect data changes.
Metadata tagging	Add tags such as `raw`, `processed`, `sample` for discoverability.

Sample Pipeline with Data Ingestion

from azure.ai.ml import dsl, Input, Output
from azure.ai.ml.entities import CommandComponent

ingest_component = CommandComponent(
    name="ingest_raw_data",
    version="1",
    inputs={"source_url": Input(type="string")},
    outputs={"raw_dataset": Output(type="uri_folder")},
    command="python ingest.py --source ${{inputs.source_url}} --dest ${{outputs.raw_dataset}}"
)

@dsl.pipeline(name="churn-pipeline", description="End‑to‑end churn training")
def churn_pipeline(source_url: str):
    ingest = ingest_component(source_url=source_url)
    train = train_component(
        training_data=ingest.outputs.raw_dataset,
        model_output=Output(type="mlflow_model")
    )
    return {"model": train.outputs.model_output}

pipeline_job = churn_pipeline(source_url="https://example.com/churn.csv")
ml_client.jobs.create_or_update(pipeline_job, experiment_name="churn-experiment")