PyTorch Distributed Training with Azure Machine Learning

This guide explains how to run PyTorch Distributed workloads on Azure Machine Learning (AML) using the torchrun launcher and AML compute clusters.

Prerequisites

An Azure subscription with access to Azure Machine Learning.
AML workspace set up (you can follow the Create workspace guide).
Azure CLI 2.30+ and az ml extension installed.
Python 3.8‑3.11 and torch>=2.0 in your environment.

Setup a Distributed Environment

Create a compute cluster capable of multi‑node training:

az ml compute create \
    --name pytorch-cluster \
    --type AmlCompute \
    --min-instances 2 \
    --max-instances 5 \
    --size STANDARD_NC6s_V3

Define a environment.yml for the training container:

name: pytorch-distributed
channels:
  - defaults
dependencies:
  - python=3.10
  - pip
  - pip:
    - torch
    - torchvision
    - azureml-core
    - azureml-defaults
    - pandas

Example workflow

Below is a minimal script (train.py) that uses torch.distributed to train a simple CNN on MNIST across two nodes.

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision import datasets, transforms

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 32, 3, 1)
        self.fc = nn.Linear(5408, 10)

    def forward(self, x):
        x = self.conv(x).relu()
        x = torch.flatten(x, 1)
        return self.fc(x)

def main():
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    setup(rank, world_size)

    torch.manual_seed(0)
    device = torch.device(f"cuda:{rank}")
    model = Net().to(device)
    ddp_model = DDP(model, device_ids=[rank])

    transform = transforms.Compose([transforms.ToTensor()])
    dataset = datasets.MNIST(root=".", train=True, download=True, transform=transform)
    sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = torch.utils.data.DataLoader(dataset, batch_size=64, sampler=sampler)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)

    for epoch in range(5):
        sampler.set_epoch(epoch)
        for batch_idx, (data, target) in enumerate(loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 100 == 0 and rank == 0:
                print(f"Epoch {epoch} [{batch_idx}/{len(loader)}] Loss: {loss.item():.4f}")

    cleanup()

if __name__ == "__main__":
    main()

Submit the training job using the AML SDK (Python):

from azure.ai.ml import MLClient, Input, command
from azure.identity import DefaultAzureCredential

ml_client = MLClient(DefaultAzureCredential(), subscription_id="YOUR_SUB_ID", resource_group_name="YOUR_RG", workspace_name="YOUR_WS")

job = command(
    name="pytorch-distributed-mnist",
    display_name="PyTorch Distributed MNIST",
    description="Train MNIST with DDP on Azure ML",
    command="torchrun --nnodes ${{azureml.distributed.nodes}} --nproc_per_node ${{azureml.distributed.processes_per_node}} train.py",
    compute="pytorch-cluster",
    environment="pytorch-distributed:1",
    distribution={"type": "pytorch", "process_count_per_node": 1},
    environment_variables={"PYTHONUNBUFFERED": "1"},
)

ml_client.jobs.create_or_update(job)

Note: Replace YOUR_SUB_ID, YOUR_RG, and YOUR_WS with your actual Azure subscription details.

Tips & best practices

Use nccl backend for GPU clusters; fall back to gloo for CPU.
Set torch.backends.cudnn.benchmark = True for faster convolutions.
Pin MASTER_ADDR and MASTER_PORT to stable values in multi‑node jobs (AML automatically injects them).
Leverage DistributedSampler to avoid duplicate data across workers.

PyTorch Distributed Training with Azure Machine Learning

Prerequisites

Setup a Distributed Environment

Example workflow

Tips & best practices

Additional resources