PyTorch Distributed Training with Azure Machine Learning
This guide explains how to run PyTorch Distributed workloads on Azure Machine Learning (AML) using the torchrun launcher and AML compute clusters.
Prerequisites
- An Azure subscription with access to Azure Machine Learning.
- AML workspace set up (you can follow the Create workspace guide).
- Azure CLI 2.30+ and
az ml extensioninstalled. - Python 3.8‑3.11 and
torch>=2.0in your environment.
Setup a Distributed Environment
Create a compute cluster capable of multi‑node training:
az ml compute create \
--name pytorch-cluster \
--type AmlCompute \
--min-instances 2 \
--max-instances 5 \
--size STANDARD_NC6s_V3
Define a environment.yml for the training container:
name: pytorch-distributed
channels:
- defaults
dependencies:
- python=3.10
- pip
- pip:
- torch
- torchvision
- azureml-core
- azureml-defaults
- pandas
Example workflow
Below is a minimal script (train.py) that uses torch.distributed to train a simple CNN on MNIST across two nodes.
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torchvision import datasets, transforms
def setup(rank, world_size):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(1, 32, 3, 1)
self.fc = nn.Linear(5408, 10)
def forward(self, x):
x = self.conv(x).relu()
x = torch.flatten(x, 1)
return self.fc(x)
def main():
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
setup(rank, world_size)
torch.manual_seed(0)
device = torch.device(f"cuda:{rank}")
model = Net().to(device)
ddp_model = DDP(model, device_ids=[rank])
transform = transforms.Compose([transforms.ToTensor()])
dataset = datasets.MNIST(root=".", train=True, download=True, transform=transform)
sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = torch.utils.data.DataLoader(dataset, batch_size=64, sampler=sampler)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)
for epoch in range(5):
sampler.set_epoch(epoch)
for batch_idx, (data, target) in enumerate(loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = ddp_model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if batch_idx % 100 == 0 and rank == 0:
print(f"Epoch {epoch} [{batch_idx}/{len(loader)}] Loss: {loss.item():.4f}")
cleanup()
if __name__ == "__main__":
main()
Submit the training job using the AML SDK (Python):
from azure.ai.ml import MLClient, Input, command
from azure.identity import DefaultAzureCredential
ml_client = MLClient(DefaultAzureCredential(), subscription_id="YOUR_SUB_ID", resource_group_name="YOUR_RG", workspace_name="YOUR_WS")
job = command(
name="pytorch-distributed-mnist",
display_name="PyTorch Distributed MNIST",
description="Train MNIST with DDP on Azure ML",
command="torchrun --nnodes ${{azureml.distributed.nodes}} --nproc_per_node ${{azureml.distributed.processes_per_node}} train.py",
compute="pytorch-cluster",
environment="pytorch-distributed:1",
distribution={"type": "pytorch", "process_count_per_node": 1},
environment_variables={"PYTHONUNBUFFERED": "1"},
)
ml_client.jobs.create_or_update(job)
Note: Replace
YOUR_SUB_ID, YOUR_RG, and YOUR_WS with your actual Azure subscription details.
Tips & best practices
- Use
ncclbackend for GPU clusters; fall back togloofor CPU. - Set
torch.backends.cudnn.benchmark = Truefor faster convolutions. - Pin
MASTER_ADDRandMASTER_PORTto stable values in multi‑node jobs (AML automatically injects them). - Leverage
DistributedSamplerto avoid duplicate data across workers.