Azure Machine Learning – How to Create Pipelines

Introduction

This guide walks you through building reproducible, scalable, and automated Machine Learning pipelines using Azure Machine Learning. Pipelines let you stitch together data processing, model training, evaluation, and deployment steps into a single workflow.

Prerequisites

Azure subscription with Machine Learning workspace provisioned.
Python 3.8+ installed locally.
Azure ML SDK for Python (azureml-core, azureml-pipeline).
Basic understanding of AML concepts.

pip install azureml-core azureml-pipeline azureml-dataprep pandas scikit-learn

Create a Pipeline

First, create a workspace object and define an Experiment that will hold your pipeline runs.

from azureml.core import Workspace, Experiment

ws = Workspace.from_config()
experiment = Experiment(ws, name="pipeline-demo")

Add Steps

Define individual steps (e.g., data preparation, training) using PythonScriptStep.

from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline

prep_step = PythonScriptStep(
    name="DataPrep",
    source_directory="scripts",
    script_name="prepare_data.py",
    compute_target="cpu-cluster",
    arguments=["--output", "datastore:prepared_data"]
)

train_step = PythonScriptStep(
    name="TrainModel",
    source_directory="scripts",
    script_name="train.py",
    compute_target="gpu-cluster",
    arguments=["--input", "datastore:prepared_data", "--model", "outputs/model.pkl"],
    inputs=[prep_step.outputs["output"]],
    outputs=[PipelineData("model", datastore=ws.get_default_datastore())],
    allow_reuse=False
)

pipeline = Pipeline(workspace=ws, steps=[prep_step, train_step])
pipeline.validate()

Run the Pipeline

Submit the pipeline to Azure ML and monitor the run.

run = experiment.submit(pipeline, regenerate_outputs=False)
run.wait_for_completion(show_output=True)

Monitoring & Logging

Use the Azure portal or SDK to track progress, view logs, and retrieve artifacts.

# List all runs
for r in experiment.get_runs():
    print(r.id, r.status)

# Download the model artifact
run.download_file(name="outputs/model.pkl", output_file_path="./model.pkl")

Best Practices

Version your data and code using PipelineData and Git.
Enable allow_reuse=True for steps that don’t change often to speed up subsequent runs.
Use Parameterization to make pipelines reusable across environments.
Leverage Azure ML Compute Clusters for scalable parallel execution.

For more advanced scenarios, explore Datasets, Model deployment, and AutoML pipelines.