Introduction to Azure Machine Learning Compute
Azure Machine Learning compute is a fully managed cloud service that allows you to easily provision and manage compute resources for your machine learning workflows. It integrates seamlessly with Azure Machine Learning workspaces, providing scalable, reliable, and cost-effective compute for training, batch inference, and real-time inference.
Choosing the right compute target is crucial for optimizing performance and managing costs. Azure Machine Learning offers several types of compute resources, each suited for different stages of the ML lifecycle.
Types of Compute Targets
Azure Machine Learning supports a variety of compute targets, including:
- Compute Clusters: Scalable clusters of VMs for training and batch inference.
- Compute Instances: Dedicated development workstations in the cloud for experimentation and testing.
- Inference Clusters: Kubernetes clusters for deploying ML models for real-time inference.
- Attached Compute: Integration with existing Azure compute resources like Azure HDInsight, Azure Databricks, or Azure Kubernetes Service (AKS).
Compute Clusters for Training and Batch Inference
Compute clusters are the go-to compute target for training machine learning models and running large-scale batch inference jobs. They offer the following benefits:
- Auto-scaling: Clusters automatically scale up or down based on job demand, ensuring you only pay for what you use.
- Variety of VM Sizes: Choose from a wide range of VM sizes, including CPU and GPU-enabled instances, to match your workload requirements.
- Node-exclusive Reservation: Ensure your training jobs have dedicated compute nodes for consistent performance.
Creating a Compute Cluster:
from azureml.core import Workspace
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
ws = Workspace.from_config()
# Choose a name for your cluster
cluster_name = "cpu-cluster"
try:
compute_target = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target.')
except ComputeTargetException:
compute_config = {
"vm_size": "STANDARD_DS3_V2",
"max_nodes": 4
}
compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
print(f"Compute target {cluster_name} is ready.")
Compute Instances for Development
Compute instances are fully managed cloud-based workstations designed for ML development. They provide a convenient and powerful environment for:
- Running experiments directly in the cloud.
- Developing and debugging code.
- Connecting to notebooks and IDEs.
Key features include:
- Pre-installed ML tools and libraries.
- Access to GPUs for accelerated development.
- Secure connection to your Azure ML workspace.
Inference Clusters for Deployment
For deploying your trained models to serve real-time predictions, inference clusters are essential. Azure Machine Learning supports deployment to managed AKS clusters or Azure Kubernetes Service (AKS) clusters you provision yourself.
These clusters are optimized for low-latency, high-throughput serving of your models.
Linked Compute Resources
You can also attach existing Azure compute resources to your Azure ML workspace, such as:
- Azure Databricks: For large-scale data engineering and distributed training.
- Azure HDInsight: For big data analytics.
- Azure Virtual Machines: For custom compute environments.
This allows you to leverage your existing Azure investments within Azure Machine Learning.
Managing Compute Resources
You can manage all your compute targets through the Azure portal, the Azure ML SDK, or the Azure CLI.
- Monitoring: Track the status, usage, and costs of your compute resources.
- Scaling: Adjust the number of nodes in your compute clusters based on demand.
- Deletion: Remove compute resources when they are no longer needed.
Best Practices for Compute Management
- Right-sizing: Select VM sizes that match your workload's CPU, memory, and GPU requirements.
- Auto-scaling: Configure auto-scaling rules effectively for compute clusters to balance cost and performance.
- Idle Compute: Set policies to shut down compute instances or scale down clusters when idle to save costs.
- Region Selection: Deploy compute resources in the same Azure region as your data and other services to minimize latency.