Azure Machine Learning Compute - MSDN Documentation

Introduction to Azure Machine Learning Compute

Azure Machine Learning compute is a fully managed cloud service that allows you to easily provision and manage compute resources for your machine learning workflows. It integrates seamlessly with Azure Machine Learning workspaces, providing scalable, reliable, and cost-effective compute for training, batch inference, and real-time inference.

Choosing the right compute target is crucial for optimizing performance and managing costs. Azure Machine Learning offers several types of compute resources, each suited for different stages of the ML lifecycle.

Types of Compute Targets

Azure Machine Learning supports a variety of compute targets, including:

Compute Clusters: Scalable clusters of VMs for training and batch inference.
Compute Instances: Dedicated development workstations in the cloud for experimentation and testing.
Inference Clusters: Kubernetes clusters for deploying ML models for real-time inference.
Attached Compute: Integration with existing Azure compute resources like Azure HDInsight, Azure Databricks, or Azure Kubernetes Service (AKS).

Compute Clusters for Training and Batch Inference

Compute clusters are the go-to compute target for training machine learning models and running large-scale batch inference jobs. They offer the following benefits:

Auto-scaling: Clusters automatically scale up or down based on job demand, ensuring you only pay for what you use.
Variety of VM Sizes: Choose from a wide range of VM sizes, including CPU and GPU-enabled instances, to match your workload requirements.
Node-exclusive Reservation: Ensure your training jobs have dedicated compute nodes for consistent performance.

Creating a Compute Cluster:


from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

ws = Workspace.from_config()

# Choose a name for your cluster
cluster_name = "cpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    compute_config = {
        "vm_size": "STANDARD_DS3_V2",
        "max_nodes": 4
    }
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

print(f"Compute target {cluster_name} is ready.")

Compute Instances for Development

Compute instances are fully managed cloud-based workstations designed for ML development. They provide a convenient and powerful environment for:

Running experiments directly in the cloud.
Developing and debugging code.
Connecting to notebooks and IDEs.

Key features include:

Pre-installed ML tools and libraries.
Access to GPUs for accelerated development.
Secure connection to your Azure ML workspace.

Inference Clusters for Deployment

For deploying your trained models to serve real-time predictions, inference clusters are essential. Azure Machine Learning supports deployment to managed AKS clusters or Azure Kubernetes Service (AKS) clusters you provision yourself.

These clusters are optimized for low-latency, high-throughput serving of your models.

Linked Compute Resources

You can also attach existing Azure compute resources to your Azure ML workspace, such as:

Azure Databricks: For large-scale data engineering and distributed training.
Azure HDInsight: For big data analytics.
Azure Virtual Machines: For custom compute environments.

This allows you to leverage your existing Azure investments within Azure Machine Learning.

Important: Compute resources incur costs. Always monitor your compute usage and scale down or delete unused resources to manage expenses.

Managing Compute Resources

You can manage all your compute targets through the Azure portal, the Azure ML SDK, or the Azure CLI.

Monitoring: Track the status, usage, and costs of your compute resources.
Scaling: Adjust the number of nodes in your compute clusters based on demand.
Deletion: Remove compute resources when they are no longer needed.

Best Practices for Compute Management

Right-sizing: Select VM sizes that match your workload's CPU, memory, and GPU requirements.
Auto-scaling: Configure auto-scaling rules effectively for compute clusters to balance cost and performance.
Idle Compute: Set policies to shut down compute instances or scale down clusters when idle to save costs.
Region Selection: Deploy compute resources in the same Azure region as your data and other services to minimize latency.

Tip: Consider using GPU-enabled VM sizes for deep learning training and inference to significantly speed up computations.