PyTorch Distributed Training with Azure Machine Learning

This guide explains how to run PyTorch Distributed workloads on Azure Machine Learning (AML) using the torchrun launcher and AML compute clusters.

Prerequisites

Setup a Distributed Environment

Create a compute cluster capable of multi‑node training:

az ml compute create \
    --name pytorch-cluster \
    --type AmlCompute \
    --min-instances 2 \
    --max-instances 5 \
    --size STANDARD_NC6s_V3

Define a environment.yml for the training container:

name: pytorch-distributed
channels:
  - defaults
dependencies:
  - python=3.10
  - pip
  - pip:
    - torch
    - torchvision
    - azureml-core
    - azureml-defaults
    - pandas

Example workflow

Below is a minimal script (train.py) that uses torch.distributed to train a simple CNN on MNIST across two nodes.

import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision import datasets, transforms

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 32, 3, 1)
        self.fc = nn.Linear(5408, 10)

    def forward(self, x):
        x = self.conv(x).relu()
        x = torch.flatten(x, 1)
        return self.fc(x)

def main():
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    setup(rank, world_size)

    torch.manual_seed(0)
    device = torch.device(f"cuda:{rank}")
    model = Net().to(device)
    ddp_model = DDP(model, device_ids=[rank])

    transform = transforms.Compose([transforms.ToTensor()])
    dataset = datasets.MNIST(root=".", train=True, download=True, transform=transform)
    sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = torch.utils.data.DataLoader(dataset, batch_size=64, sampler=sampler)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)

    for epoch in range(5):
        sampler.set_epoch(epoch)
        for batch_idx, (data, target) in enumerate(loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 100 == 0 and rank == 0:
                print(f"Epoch {epoch} [{batch_idx}/{len(loader)}] Loss: {loss.item():.4f}")

    cleanup()

if __name__ == "__main__":
    main()

Submit the training job using the AML SDK (Python):

from azure.ai.ml import MLClient, Input, command
from azure.identity import DefaultAzureCredential

ml_client = MLClient(DefaultAzureCredential(), subscription_id="YOUR_SUB_ID", resource_group_name="YOUR_RG", workspace_name="YOUR_WS")

job = command(
    name="pytorch-distributed-mnist",
    display_name="PyTorch Distributed MNIST",
    description="Train MNIST with DDP on Azure ML",
    command="torchrun --nnodes ${{azureml.distributed.nodes}} --nproc_per_node ${{azureml.distributed.processes_per_node}} train.py",
    compute="pytorch-cluster",
    environment="pytorch-distributed:1",
    distribution={"type": "pytorch", "process_count_per_node": 1},
    environment_variables={"PYTHONUNBUFFERED": "1"},
)

ml_client.jobs.create_or_update(job)
Note: Replace YOUR_SUB_ID, YOUR_RG, and YOUR_WS with your actual Azure subscription details.

Tips & best practices

Additional resources