How to Manage Data in Azure Machine Learning
Effective data management is crucial for building reliable and scalable machine‑learning solutions. Azure Machine Learning provides several services and patterns to help you ingest, store, version, and preprocess data.
Data Stores
Data Versioning
Data Prep (Data Wrangler)
Data Stores
You can register Azure storage resources as Data Stores in your workspace and reference them in pipelines.
from azure.ai.ml import MLClient
from azure.ai.ml.entities import AzureBlobDatastore
ml_client = MLClient(subscription_id, resource_group, workspace_name)
blob_datastore = AzureBlobDatastore(
name="myblobstore",
account_name="myaccount",
container_name="datasets",
credential="my-credential"
)
ml_client.datastores.create_or_update(blob_datastore)
Supported store types:
- Azure Blob Storage
- Azure Data Lake Storage Gen2
- Azure Files
- Azure SQL Database
Data Versioning
Use Data Assets to version datasets. Each version is immutable and can be referenced by name and version number.
from azure.ai.ml import Input, command
# Register a dataset version
ml_client.data.create_or_update(
name="customer_churn",
version="v1",
path="azureml://datastores/myblobstore/paths/churn.csv",
type="uri_file"
)
# Reference in a pipeline step
step = command(
name="train",
inputs={"training_data": Input(type="uri_file", path="azureml:customer_churn@v1")},
...
)
Benefits:
- Traceability of data used in experiments
- Easy rollback to previous dataset versions
- Integration with MLflow tracking
Data Preparation with Data Wrangler
Data Wrangler offers a visual UI for cleaning, transforming, and profiling data. It also generates reusable Python scripts.
- Connect to registered data stores
- Apply operations: filter, split, impute, encode
- Export as a pipeline step or a reusable component
Example export:
wrangler_job = data_wrangler.create_job(
source="azureml:customer_churn@v2",
steps=[...],
output="azureml:clean_churn@v1"
)
Best Practices
- Store raw data in immutable locations (e.g., a dedicated Blob container).
- Use Data Assets for every processed version.
- Tag datasets with metadata (e.g., source, collection date).
- Automate data validation in the pipeline (schema checks, statistic thresholds).
- Leverage Azure Purview for data governance across the organization.
For more detailed guidance, see the Data Schema Validation and Machine Learning Pipelines topics.