Security Reference for Azure Data Lake Storage Gen2
This document provides a comprehensive overview of the security features and best practices for Azure Data Lake Storage Gen2. Data Lake Storage Gen2 is built on Azure Blob Storage and inherits its security model while adding hierarchical namespace capabilities for big data analytics.
Core Security Concepts
Data Lake Storage Gen2 security is a multi-layered approach, combining Azure's robust platform security with specific configurations for your storage accounts. Key areas include:
- Authentication: How users and applications prove their identity to access data.
- Authorization: What actions authenticated users and applications are permitted to perform.
- Encryption: Protecting data at rest and in transit.
- Network Security: Controlling access to your storage account over the network.
- Auditing and Monitoring: Tracking access and operations for compliance and security analysis.
Access Control Mechanisms
Data Lake Storage Gen2 supports multiple access control models, allowing you to choose the most suitable option for your needs:
1. Azure Role-Based Access Control (RBAC)
RBAC is used to manage access to Azure resources, including storage accounts and containers. You assign roles (e.g., Storage Blob Data Owner, Storage Blob Data Reader) to users, groups, or service principals, granting them permissions at various scopes (subscription, resource group, storage account).
Key Benefits:
- Fine-grained control over access to storage data.
- Integration with Azure Active Directory (Azure AD) for centralized identity management.
- Can be used to manage access to the Azure Data Lake Storage Gen2 endpoint.
2. Access Control Lists (ACLs)
ACLs provide POSIX-like permissions for individual files and directories within a Data Lake Storage Gen2 filesystem. This allows for very granular control over who can read, write, or execute (for directories) specific data items.
Key Benefits:
- Hierarchical permissions mirroring the filesystem structure.
- Supports named user, named group, and mask entries.
- Essential for big data analytics workloads where different users or groups need access to specific datasets.
Recommendation: For optimal security and flexibility, it's recommended to use RBAC for broad access management and ACLs for fine-grained control over data within the filesystem.
Data Encryption
Data Lake Storage Gen2 encrypts all data at rest and in transit by default, ensuring your data is protected.
Encryption at Rest
All data stored in Data Lake Storage Gen2 is automatically encrypted using AES-256. You have two options for managing the encryption keys:
- Microsoft-managed keys: Azure automatically manages the keys used for encryption. This is the default and requires no configuration.
- Customer-managed keys: You can use your own encryption keys stored in Azure Key Vault. This provides greater control over key rotation and access policies.
Encryption in Transit
Data Lake Storage Gen2 enforces HTTPS for all communications, ensuring data is encrypted while being transmitted over the network. Clients must use TLS 1.2 or higher to connect.
# Example of enabling secure transfer (HTTPS) for a storage account
az storage account update --name yourstorageaccountname --resource-group yourresourcegroup --https-only true
Network Security
Control how your storage account is accessed from the network.
Firewalls and Virtual Networks
You can restrict access to your storage account by configuring firewall rules to allow traffic only from specific IP addresses or virtual networks. This is crucial for preventing unauthorized access from public networks.
Steps:
- Navigate to your storage account in the Azure portal.
- Under "Security + networking", select "Networking".
- Configure "Firewalls and virtual networks" to allow access from your trusted sources.
Private Endpoints
Private endpoints allow you to access your Data Lake Storage Gen2 account over a private IP address within your virtual network. This completely removes the need to expose your storage account to the public internet.
Important Note:
When using RBAC with Data Lake Storage Gen2, remember that RBAC controls access to the Azure resource (storage account, container), while ACLs control access to files and directories within the filesystem. Both are essential for a comprehensive security strategy.
Best Practices for Securing Data Lake Storage Gen2
- Principle of Least Privilege: Grant only the necessary permissions to users and applications.
- Use Azure AD for Identity Management: Centralize user and group management.
- Implement RBAC and ACLs together: Leverage the strengths of both for granular control.
- Enable Firewall Rules and Private Endpoints: Restrict network access to authorized sources.
- Use Customer-Managed Keys (CMK) for Sensitive Data: Enhance control over encryption keys.
- Regularly Audit and Monitor Access: Use Azure Monitor and Storage Analytics logs.
- Secure Service Principals and Managed Identities: Avoid using shared keys or overly broad permissions.
For more detailed information and configuration guides, please refer to the official Azure documentation on Data Lake Storage Gen2 security.