Azure Databricks Compute Reference
This document provides detailed information about compute resources and configurations available within Azure Databricks.
Databricks Compute Options
Azure Databricks offers flexible compute options to meet the diverse needs of big data analytics and machine learning workloads. You can choose between pre-defined instance types or customize them to optimize for cost, performance, and specific task requirements.
Virtual Machine (VM) Instance Types
Databricks clusters run on Azure Virtual Machines. The choice of VM instance type significantly impacts performance and cost. Here are some common categories:
General Purpose Instances
Balanced CPU-to-memory ratio, suitable for most workloads.
| Instance Type | vCPUs | Memory (GiB) | Temp Storage (GiB) | Recommended Use Cases |
|---|---|---|---|---|
| Standard_DSv3 | 2 - 64 | 8 - 256 | - | General web servers, small to medium databases |
| Standard_DSv4 | 2 - 80 | 8 - 320 | - | Enhanced performance for general purpose |
Memory Optimized Instances
High memory-to-CPU ratio, ideal for in-memory analytics, large caches, and Big Data processing.
| Instance Type | vCPUs | Memory (GiB) | Temp Storage (GiB) | Recommended Use Cases |
|---|---|---|---|---|
| Standard_Esv3 | 2 - 64 | 16 - 432 | - | In-memory analytics, large caches |
| Standard_Esv4 | 2 - 80 | 16 - 640 | - | Enhanced performance for memory intensive tasks |
Compute Optimized Instances
High CPU-to-memory ratio, suitable for CPU-bound applications, batch processing, and HPC.
| Instance Type | vCPUs | Memory (GiB) | Temp Storage (GiB) | Recommended Use Cases |
|---|---|---|---|---|
| Standard_Fsv2 | 2 - 64 | 4 - 224 | - | Web servers, batch processing, HPC |
Storage Optimized Instances
High disk throughput and IOPS, suitable for Big Data analytics that require extensive local storage.
| Instance Type | vCPUs | Memory (GiB) | Temp Storage (GiB) | Recommended Use Cases |
|---|---|---|---|---|
| Standard_Lsv2 | 4 - 80 | 8 - 640 | Up to 1900 | Big Data analytics, large NoSQL databases |
Cluster Configuration Options
When creating an Azure Databricks cluster, you can configure various settings to tailor it to your needs:
Cluster Mode
- Standard: A single node cluster suitable for development and small workloads.
- High Concurrency: Designed for workloads with many users and many concurrent queries, optimizing for multi-tenancy and query latency.
Autoscaling
Enable autoscaling to automatically adjust the number of worker nodes in your cluster based on workload demands. This helps optimize costs by scaling down when idle and scaling up when busy.
Key settings:
- Min Workers: The minimum number of worker nodes.
- Max Workers: The maximum number of worker nodes.
Autotermination
Configure autotermination to shut down your cluster after a specified period of inactivity. This prevents incurring costs for idle clusters.
Termination Minutes: The time in minutes after which the cluster will terminate if idle.
Spark Version & Runtime
Select the appropriate Apache Spark version and Databricks Runtime (DBR). DBR includes optimized Spark, ML libraries, and other components.
Worker Type and Driver Type
Choose the VM instance type for your worker nodes and the driver node. They can be the same or different based on workload characteristics.
Spot Instances
Utilize Azure Spot Instances to leverage unused Azure capacity at a significant discount. Databricks can automatically re-provision spot instances if they are evicted.
Example Cluster Configuration (JSON)
Here's a simplified example of how you might define a cluster configuration:
{
"cluster_name": "my-databricks-cluster",
"spark_version": "11.3.x-scala2.12",
"azure_attributes": {
"availability": "ON_DEMAND",
"spot_interruption_behavior": "REVERT_TO_SPOT"
},
"node_type_all_public_ips": false,
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"aws_attributes": {
"availability_zone": "AUTO"
},
"autoscale": {
"min_workers": 2,
"max_workers": 8
},
"autotermination_minutes": 60,
"driver_node_type_id": "Standard_DSv4",
"worker_node_type_id": "Standard_Esv4",
"num_workers": 2
}
This configuration defines a cluster named "my-databricks-cluster" using a specific Spark version, configured for on-demand instances (though `aws_attributes` indicates a potential copy-paste error from another cloud example - in Azure, this would be `azure_attributes.vm_type_id` etc.), with autoscaling enabled between 2 and 8 workers, autotermination after 60 minutes, and specific driver and worker node types.
For more in-depth information on specific instance types, pricing, and advanced configurations, please refer to the official Azure VM Pricing and Databricks Cluster Management Guide.