Introduction to Azure Databricks Security

Azure Databricks is a cloud-based platform built on Apache Spark. Security is a fundamental aspect of managing your data and analytics workloads. This document provides an overview of the security features and best practices available within Azure Databricks to protect your sensitive data and ensure compliance.

Key Security Pillars

Azure Databricks security is built around several key pillars:

  • Network Security: Controlling access to your Databricks environment from your virtual network.
  • Data Protection: Securing data at rest and in transit.
  • Identity and Access Management: Managing user access and permissions.
  • Compliance and Governance: Meeting regulatory requirements and maintaining audit trails.
  • Compute Security: Securing the compute clusters that process your data.

Network Security

Azure Databricks offers robust network isolation capabilities:

  • Virtual Network Injection: Deploy your Databricks workspace into your own Azure Virtual Network (VNet). This allows you to control network traffic using Network Security Groups (NSGs) and User Defined Routes (UDRs).
  • Private Link: Securely access your Databricks workspace using Azure Private Link, which removes the need for public internet exposure.
  • Firewall Rules: Configure workspace and cluster firewall rules to restrict access based on IP addresses or subnets.

Data Protection

Protecting your data is paramount:

  • Encryption at Rest: Data stored in Azure Data Lake Storage Gen2 or Azure Blob Storage can be encrypted using Microsoft-managed keys or customer-managed keys (CMK).
  • Encryption in Transit: All communication within the Databricks environment and to external services is secured using TLS/SSL.
  • Azure Key Vault Integration: Securely manage your secrets and encryption keys using Azure Key Vault.
Tip: Always use Azure Key Vault to manage sensitive information like database credentials and API keys used by your Databricks jobs.

Identity and Access Management (IAM)

Fine-grained access control is crucial for managing who can do what:

  • Azure Active Directory (Azure AD) Integration: Integrate your Databricks workspace with Azure AD for single sign-on (SSO) and centralized user management.
  • Workspace Access Control: Control access to notebooks, clusters, jobs, and tables.
  • Cluster Access Control: Define which users or groups can create, manage, and attach to specific clusters.
  • Table Access Control: Use Unity Catalog or table ACLs to manage permissions on data objects within Databricks.

Compute Security

Securing your compute resources:

  • Cluster Isolation: Ensure that compute resources are isolated between different workspaces or tenants.
  • Managed Services: Azure Databricks manages the underlying infrastructure, patching, and security updates for the Spark environment.
  • Accessing Sensitive Data: Utilize secrets management to avoid hardcoding credentials within notebooks or scripts.

Compliance and Governance

Meeting regulatory standards:

  • Audit Logs: Databricks provides comprehensive audit logs that capture user activities and system events, essential for security monitoring and compliance.
  • Data Governance with Unity Catalog: Unity Catalog provides a centralized governance solution for data and AI assets on your data lake, including lineage, access control, and auditing.

By leveraging these security features, you can build a secure and compliant data analytics environment on Azure Databricks.