Overview
Data management in Azure Machine Learning (AML) provides a unified experience for registering, versioning, and accessing data assets throughout the ML lifecycle. It supports cloud and on‑premise data stores, integrates with Azure Blob, Azure Data Lake, Azure SQL, and allows seamless data movement between compute targets.
Key Concepts
- Data Stores – Secure references to storage accounts, databases, or file shares.
- Datasets – Named, versioned pointers to data files or tables used for training and inference.
- Data Versioning – Immutable snapshots of a dataset that enable reproducibility.
- Data Ingestion – Automated pipelines to pull data from external sources into AML.
Register a Data Store
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import AzureBlobDatastore
credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)
blob_datastore = AzureBlobDatastore(
name="myblobstore",
account_name="myblobaccount",
container_name="ml-data",
endpoint="https://myblobaccount.blob.core.windows.net"
)
ml_client.data_stores.create_or_update(blob_datastore)
Create a Versioned Dataset
from azure.ai.ml.entities import Data
dataset = Data(
name="customer-churn",
version="1",
path="azureml://datastores/myblobstore/paths/churn.csv",
type="uri_file",
description="Customer churn dataset for classification"
)
ml_client.data.create_or_update(dataset)
Accessing Data in a Training Script
import os
from azure.ai.ml import Input
def main():
data_path = os.getenv("AZUREML_DATAREFERENCE_customer_churn")
import pandas as pd
df = pd.read_csv(data_path)
print(df.head())
if __name__ == "__main__":
main()
Data Versioning Best Practices
Practice | Description |
---|---|
Immutable versions | Never overwrite an existing version; create a new one. |
Semantic versioning | Use major.minor.patch to reflect data changes. |
Metadata tagging | Add tags such as raw , processed , sample for discoverability. |
Sample Pipeline with Data Ingestion
from azure.ai.ml import dsl, Input, Output
from azure.ai.ml.entities import CommandComponent
ingest_component = CommandComponent(
name="ingest_raw_data",
version="1",
inputs={"source_url": Input(type="string")},
outputs={"raw_dataset": Output(type="uri_folder")},
command="python ingest.py --source ${{inputs.source_url}} --dest ${{outputs.raw_dataset}}"
)
@dsl.pipeline(name="churn-pipeline", description="End‑to‑end churn training")
def churn_pipeline(source_url: str):
ingest = ingest_component(source_url=source_url)
train = train_component(
training_data=ingest.outputs.raw_dataset,
model_output=Output(type="mlflow_model")
)
return {"model": train.outputs.model_output}
pipeline_job = churn_pipeline(source_url="https://example.com/churn.csv")
ml_client.jobs.create_or_update(pipeline_job, experiment_name="churn-experiment")