Scaling Azure Batch Workloads

Understand how to effectively scale your applications with Azure Batch, ensuring optimal performance and cost efficiency.

Introduction to Batch Scaling

Azure Batch is designed to run large-scale parallel and high-performance computing (HPC) applications efficiently. A key aspect of achieving this efficiency is through effective scaling, which involves adjusting the number of compute resources (virtual machines) allocated to your Batch pool based on the workload demands.

Batch offers two primary scaling modes:

Manual Scaling: You explicitly set the number of compute nodes in a pool.
Automatic Scaling: Batch dynamically adjusts the node count based on a predefined scaling formula and metrics.

Manual Scaling

Manual scaling provides direct control over your compute resources. You can increase or decrease the number of nodes in a pool at any time. This is often used for predictable workloads or when you need precise resource management.

When to Use Manual Scaling:

Predictable job durations and resource requirements.
Testing and development scenarios where fine-grained control is essential.
Workloads with strict budget constraints that require manual oversight.

How to Implement Manual Scaling:

You can adjust the node count for a pool using the Azure portal, Azure CLI, PowerShell, or the Batch Management SDKs.

# Example using Azure CLI
az batch pool resize --pool-id <your-pool-id> --target-dedicated-nodes <number-of-nodes>

Automatic Scaling

Automatic scaling allows Batch to manage the number of compute nodes in a pool based on performance counters. This is ideal for dynamic or unpredictable workloads where the demand for compute resources fluctuates.

Key Components of Automatic Scaling:

Scaling Formula: A mathematical expression that calculates the desired number of nodes based on specific metrics.
Metrics: Performance counters that Batch monitors, such as CPU usage, queue length, or custom metrics.
Evaluation Interval: The frequency at which Batch evaluates the scaling formula.

Common Scaling Formulas:

A typical automatic scaling formula uses the number of active tasks and the average CPU load to determine the target node count. For example:

$TargetDedicatedNodes = ceil( ( $RunningTasks.Count + $PendingTasks.Count ) / $AutoUserSpecification.MaximumTasksPerNode )

This formula aims to provision enough nodes to run all pending and running tasks, assuming a maximum number of tasks per node. You can also incorporate CPU load:

$TargetDedicatedNodes = ceil( $TargetDedicatedNodes + $TargetDedicatedNodes * $CPU.LoadAverage )

Configuring Automatic Scaling:

Automatic scaling is configured by creating or updating a pool with an autoscale formula and specifying the evaluation interval.

# Example using Azure CLI
az batch pool autoscale --pool-id <your-pool-id> --formula "ceil( ( $RunningTasks.Count + $PendingTasks.Count ) / $AutoUserSpecification.MaximumTasksPerNode )" --autoscale-evaluation-interval "PT5M"

Best Practices for Scaling

Tip: Start with automatic scaling for most workloads. Monitor its performance and adjust the scaling formula as needed.

Understand Your Workload: Analyze your application's resource requirements and how they vary over time.
Choose Appropriate Metrics: Select metrics that accurately reflect your workload's demand for compute resources.
Set Realistic Limits: Define minimum and maximum node counts for your pools to control costs and prevent over-provisioning.
Monitor and Tune: Regularly review your scaling performance and adjust formulas and intervals to optimize resource utilization and costs.
Consider Spot Instances: For fault-tolerant workloads, consider using Spot VMs for significant cost savings on compute.

Scaling with Task Dependencies

When your tasks have dependencies, Batch can manage their execution order. For scaling, it's crucial that your scaling formula accounts for tasks that are waiting for prerequisites, not just actively running ones.

Example Scenario: Image Processing

Imagine an image processing application that receives jobs in batches. The number of incoming images can vary greatly.

Configuration:

Pool: Configured with automatic scaling.
Metric: Number of tasks in the Batch job queue.
Scaling Formula: $TargetDedicatedNodes = ceil($Add.PendingTasks.Count / $AutoUserSpecification.MaximumTasksPerNode)
Maximum Nodes: 50
Minimum Nodes: 2
Evaluation Interval: 30 seconds

With this setup, the pool will automatically scale up as more images are added to the queue and scale down as tasks are completed, ensuring efficient resource usage.

Proper scaling is critical for maximizing the ROI of your Azure Batch deployments. It ensures that your applications have the necessary compute power when needed, without incurring unnecessary costs.

Related Topics