Access Control in Azure Databricks
This document provides a comprehensive guide to implementing robust access control mechanisms within Azure Databricks to ensure data security and manage user permissions effectively.
Table of Contents
Introduction
Azure Databricks offers granular control over who can access what resources within your workspace. This is crucial for maintaining data governance, preventing unauthorized access, and ensuring compliance with organizational policies.
Workspace Access Control
The foundation of access control in Azure Databricks lies in managing users and groups and assigning them appropriate permissions within the workspace.
Users and Groups
Azure Databricks integrates with Azure Active Directory (Azure AD) for identity management. You can synchronize users and groups from Azure AD to your Databricks workspace.
- Users: Individual identities that can be granted permissions.
- Groups: Collections of users that simplify permission management. Granting permissions to a group automatically applies those permissions to all its members.
Permissions Overview
Permissions in Databricks can be categorized as follows:
- Can View: Allows read-only access to a resource.
- Can Run: Allows running a notebook or job.
- Can Edit: Allows modifying a notebook or job.
- Can Manage: Allows full control, including deleting and changing permissions.
These permissions are applied to various workspace objects:
- Notebooks
- Folders
- Clusters
- Jobs
- Pools
- Models (MLflow)
Data Access Control
Controlling access to the data itself is paramount. Azure Databricks provides mechanisms to secure data stored in various locations.
Table ACLs (Access Control Lists)
For data stored in Unity Catalog or the Hive Metastore, Table ACLs allow you to define permissions on tables, views, and schemas. This enables fine-grained control over data access directly within Databricks.
- Grant/Revoke Permissions: You can grant privileges like SELECT, MODIFY, CREATE, and DELETE on specific data objects.
- Data Owners: Can manage permissions for their data assets.
Example SQL command:
GRANT SELECT ON TABLE sales_data TO "data-analysts-group";
External Data Sources
When accessing data from external sources like Azure Data Lake Storage (ADLS) Gen2 or Azure Blob Storage, access control is managed through:
- Service Principals: Use Azure AD Service Principals with appropriate RBAC roles (e.g., Storage Blob Data Reader) for secure access to storage accounts.
- Credential Passthrough: Inherit the identity of the user running the Databricks job to access data in ADLS Gen2.
- Unity Catalog Volumes: Provides a secure way to access files in cloud object storage, managed by Databricks.
Cluster Access Control
Controlling who can create, manage, and use clusters is essential for resource governance and cost management.
Cluster Permissions
Users can be granted permissions to manage clusters, allowing them to:
- Can Restart: Allows users to restart existing clusters.
- Can Attach To: Allows users to attach notebooks to existing clusters.
- Can Manage: Allows full control over clusters, including creation, deletion, and configuration changes.
Pool Permissions
Permissions can also be applied to cluster pools, controlling who can use them to launch clusters.
Notebook and Job Access Control
Secure your analytical workflows by controlling access to notebooks and jobs.
- Notebook Permissions: Manage who can view, run, edit, or manage notebooks. This prevents unauthorized modifications or execution of sensitive code.
- Job Permissions: Control who can view, run, edit, or manage scheduled jobs. This is critical for maintaining the integrity of automated data pipelines.
Best Practices
To effectively manage access control in Azure Databricks, consider the following best practices:
- Principle of Least Privilege: Grant users only the permissions they need to perform their tasks.
- Use Groups Extensively: Manage permissions via Azure AD groups for simplified administration.
- Leverage Unity Catalog: For unified governance, discoverability, and fine-grained access control to data.
- Regularly Audit Permissions: Periodically review user and group permissions to ensure they are still appropriate.
- Secure Cluster Creation: Restrict who can create clusters and configure appropriate instance types and sizes to control costs.
- Utilize Service Principals for Automation: For programmatic access to resources, use Service Principals with limited scopes.
- Implement Data Masking and Row-Level Security: For highly sensitive data, consider these advanced techniques where applicable.