Access Control in Azure Databricks

This document provides a comprehensive guide to implementing robust access control mechanisms within Azure Databricks to ensure data security and manage user permissions effectively.

Introduction
Workspace Access Control
- Users and Groups
- Permissions Overview
Data Access Control
- Table ACLs
- External Data Sources
Cluster Access Control
- Cluster Permissions
- Pool Permissions
Notebook and Job Access Control
Best Practices

Introduction

Azure Databricks offers granular control over who can access what resources within your workspace. This is crucial for maintaining data governance, preventing unauthorized access, and ensuring compliance with organizational policies.

Note: Access control is a layered approach. You need to consider permissions at the workspace level, data level, and resource level to establish a secure environment.

Workspace Access Control

The foundation of access control in Azure Databricks lies in managing users and groups and assigning them appropriate permissions within the workspace.

Users and Groups

Azure Databricks integrates with Azure Active Directory (Azure AD) for identity management. You can synchronize users and groups from Azure AD to your Databricks workspace.

Users: Individual identities that can be granted permissions.
Groups: Collections of users that simplify permission management. Granting permissions to a group automatically applies those permissions to all its members.

Permissions Overview

Permissions in Databricks can be categorized as follows:

Can View: Allows read-only access to a resource.
Can Run: Allows running a notebook or job.
Can Edit: Allows modifying a notebook or job.
Can Manage: Allows full control, including deleting and changing permissions.

These permissions are applied to various workspace objects:

Notebooks
Folders
Clusters
Jobs
Pools
Models (MLflow)

Tip: Leverage Azure AD groups to manage permissions for teams or roles, rather than assigning permissions to individual users. This streamlines administration and reduces the risk of misconfigurations.

Data Access Control

Controlling access to the data itself is paramount. Azure Databricks provides mechanisms to secure data stored in various locations.

Table ACLs (Access Control Lists)

For data stored in Unity Catalog or the Hive Metastore, Table ACLs allow you to define permissions on tables, views, and schemas. This enables fine-grained control over data access directly within Databricks.

Grant/Revoke Permissions: You can grant privileges like SELECT, MODIFY, CREATE, and DELETE on specific data objects.
Data Owners: Can manage permissions for their data assets.

Example SQL command:

GRANT SELECT ON TABLE sales_data TO "data-analysts-group";

External Data Sources

When accessing data from external sources like Azure Data Lake Storage (ADLS) Gen2 or Azure Blob Storage, access control is managed through:

Service Principals: Use Azure AD Service Principals with appropriate RBAC roles (e.g., Storage Blob Data Reader) for secure access to storage accounts.
Credential Passthrough: Inherit the identity of the user running the Databricks job to access data in ADLS Gen2.
Unity Catalog Volumes: Provides a secure way to access files in cloud object storage, managed by Databricks.

Cluster Access Control

Controlling who can create, manage, and use clusters is essential for resource governance and cost management.

Cluster Permissions

Users can be granted permissions to manage clusters, allowing them to:

Can Restart: Allows users to restart existing clusters.
Can Attach To: Allows users to attach notebooks to existing clusters.
Can Manage: Allows full control over clusters, including creation, deletion, and configuration changes.

Pool Permissions

Permissions can also be applied to cluster pools, controlling who can use them to launch clusters.

Notebook and Job Access Control

Secure your analytical workflows by controlling access to notebooks and jobs.

Notebook Permissions: Manage who can view, run, edit, or manage notebooks. This prevents unauthorized modifications or execution of sensitive code.
Job Permissions: Control who can view, run, edit, or manage scheduled jobs. This is critical for maintaining the integrity of automated data pipelines.

Best Practices

To effectively manage access control in Azure Databricks, consider the following best practices:

Principle of Least Privilege: Grant users only the permissions they need to perform their tasks.
Use Groups Extensively: Manage permissions via Azure AD groups for simplified administration.
Leverage Unity Catalog: For unified governance, discoverability, and fine-grained access control to data.
Regularly Audit Permissions: Periodically review user and group permissions to ensure they are still appropriate.
Secure Cluster Creation: Restrict who can create clusters and configure appropriate instance types and sizes to control costs.
Utilize Service Principals for Automation: For programmatic access to resources, use Service Principals with limited scopes.
Implement Data Masking and Row-Level Security: For highly sensitive data, consider these advanced techniques where applicable.

Important: Always implement a tiered approach to access control, combining workspace, data, and resource-level permissions for comprehensive security.

Access Control in Azure Databricks

Table of Contents

Introduction

Workspace Access Control

Users and Groups

Permissions Overview

Data Access Control

Table ACLs (Access Control Lists)

External Data Sources

Cluster Access Control

Cluster Permissions

Pool Permissions

Notebook and Job Access Control

Best Practices