Azure Databricks Compute Reference

This document provides detailed information about compute resources and configurations available within Azure Databricks.

Databricks Compute Options

Azure Databricks offers flexible compute options to meet the diverse needs of big data analytics and machine learning workloads. You can choose between pre-defined instance types or customize them to optimize for cost, performance, and specific task requirements.

Virtual Machine (VM) Instance Types

Databricks clusters run on Azure Virtual Machines. The choice of VM instance type significantly impacts performance and cost. Here are some common categories:

General Purpose Instances

Balanced CPU-to-memory ratio, suitable for most workloads.

Instance Type	vCPUs	Memory (GiB)	Temp Storage (GiB)	Recommended Use Cases
Standard_DSv3	2 - 64	8 - 256	-	General web servers, small to medium databases
Standard_DSv4	2 - 80	8 - 320	-	Enhanced performance for general purpose

Memory Optimized Instances

High memory-to-CPU ratio, ideal for in-memory analytics, large caches, and Big Data processing.

Instance Type	vCPUs	Memory (GiB)	Temp Storage (GiB)	Recommended Use Cases
Standard_Esv3	2 - 64	16 - 432	-	In-memory analytics, large caches
Standard_Esv4	2 - 80	16 - 640	-	Enhanced performance for memory intensive tasks

Compute Optimized Instances

High CPU-to-memory ratio, suitable for CPU-bound applications, batch processing, and HPC.

Instance Type	vCPUs	Memory (GiB)	Temp Storage (GiB)	Recommended Use Cases
Standard_Fsv2	2 - 64	4 - 224	-	Web servers, batch processing, HPC

Storage Optimized Instances

High disk throughput and IOPS, suitable for Big Data analytics that require extensive local storage.

Instance Type	vCPUs	Memory (GiB)	Temp Storage (GiB)	Recommended Use Cases
Standard_Lsv2	4 - 80	8 - 640	Up to 1900	Big Data analytics, large NoSQL databases

Tip: For cost-effectiveness and flexibility, consider using Azure Spot Instances for your Databricks clusters, especially for non-critical or fault-tolerant workloads.

Cluster Configuration Options

When creating an Azure Databricks cluster, you can configure various settings to tailor it to your needs:

Cluster Mode

Standard: A single node cluster suitable for development and small workloads.
High Concurrency: Designed for workloads with many users and many concurrent queries, optimizing for multi-tenancy and query latency.

Autoscaling

Enable autoscaling to automatically adjust the number of worker nodes in your cluster based on workload demands. This helps optimize costs by scaling down when idle and scaling up when busy.

Key settings:

Min Workers: The minimum number of worker nodes.
Max Workers: The maximum number of worker nodes.

Autotermination

Configure autotermination to shut down your cluster after a specified period of inactivity. This prevents incurring costs for idle clusters.

Termination Minutes: The time in minutes after which the cluster will terminate if idle.

Spark Version & Runtime

Select the appropriate Apache Spark version and Databricks Runtime (DBR). DBR includes optimized Spark, ML libraries, and other components.

Worker Type and Driver Type

Choose the VM instance type for your worker nodes and the driver node. They can be the same or different based on workload characteristics.

Spot Instances

Utilize Azure Spot Instances to leverage unused Azure capacity at a significant discount. Databricks can automatically re-provision spot instances if they are evicted.

Note: The availability and pricing of specific VM instance types can vary by Azure region. Always refer to the latest Azure pricing documentation for the most accurate information.

Example Cluster Configuration (JSON)

Here's a simplified example of how you might define a cluster configuration:

                
{
  "cluster_name": "my-databricks-cluster",
  "spark_version": "11.3.x-scala2.12",
  "azure_attributes": {
    "availability": "ON_DEMAND",
    "spot_interruption_behavior": "REVERT_TO_SPOT"
  },
  "node_type_all_public_ips": false,
  "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true"
  },
  "aws_attributes": {
    "availability_zone": "AUTO"
  },
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  },
  "autotermination_minutes": 60,
  "driver_node_type_id": "Standard_DSv4",
  "worker_node_type_id": "Standard_Esv4",
  "num_workers": 2
}
                
            

This configuration defines a cluster named "my-databricks-cluster" using a specific Spark version, configured for on-demand instances (though `aws_attributes` indicates a potential copy-paste error from another cloud example - in Azure, this would be `azure_attributes.vm_type_id` etc.), with autoscaling enabled between 2 and 8 workers, autotermination after 60 minutes, and specific driver and worker node types.

Important: Always validate your cluster configurations in the Azure Databricks UI or through the Azure CLI/SDK to ensure they meet your specific performance and cost requirements.

For more in-depth information on specific instance types, pricing, and advanced configurations, please refer to the official Azure VM Pricing and Databricks Cluster Management Guide.

Azure Documentation

Azure Databricks Compute Reference

Databricks Compute Options

Virtual Machine (VM) Instance Types

General Purpose Instances

Memory Optimized Instances

Compute Optimized Instances

Storage Optimized Instances

Cluster Configuration Options

Cluster Mode

Autoscaling

Autotermination

Spark Version & Runtime

Worker Type and Driver Type

Spot Instances

Example Cluster Configuration (JSON)