Azure Data Lake Storage Gen2 Python SDK
This documentation provides a comprehensive reference for using the Azure Data Lake Storage Gen2 SDK for Python. This SDK allows you to interact with your Data Lake Storage Gen2 accounts programmatically, enabling operations such as creating file systems, managing directories and files, and controlling access.
The Azure Data Lake Storage Gen2 SDK for Python simplifies the development of applications that leverage the scalability and performance of Azure Data Lake Storage Gen2.
Getting Started
Before you begin, ensure you have the following:
- An Azure subscription.
- An Azure Storage account with hierarchical namespace enabled (Data Lake Storage Gen2).
- Python 3.6 or later installed.
Installation
Install the Azure Data Lake Storage Gen2 SDK for Python using pip:
pip install azure-storage-file-datalake
Authentication
You can authenticate to Azure Data Lake Storage Gen2 using various methods. The most common are:
- Connection String: A connection string provides all the information needed to connect to your storage account.
- Azure Active Directory (Azure AD): Using service principals or managed identities for more secure authentication.
Using a Connection String
from azure.storage.filedatalake import DataLakeServiceClient
connection_string = "YOUR_CONNECTION_STRING"
datalake_service_client = DataLakeServiceClient.from_connection_string(connection_string)
Using Azure AD (DefaultAzureCredential)
Ensure you have the necessary Azure AD configurations and the azure-identity package installed:
pip install azure-identity
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
account_url = "https://youraccountname.dfs.core.windows.net"
credential = DefaultAzureCredential()
datalake_service_client = DataLakeServiceClient(account_url, credential)
Core Concepts
The SDK revolves around several key objects:
DataLakeServiceClient: The top-level client for interacting with Data Lake Storage Gen2.DataLakeFileSystemClient: Represents a file system (container) within your storage account.DataLakeDirectoryClient: Represents a directory within a file system.DataLakeFileClient: Represents a file within a directory or file system.
API Reference
DataLakeServiceClient
Provides methods to manage file systems and access Data Lake Storage Gen2.
Methods
-
create_file_system(file_system_name: str, **kwargs)
Creates a new file system. Returns a
DataLakeFileSystemClient.create_file_system(file_system_name: str, **kwargs)
file_system_name- Required. The name of the file system to create.
DataLakeFileSystemClient- A client for the newly created file system.
-
get_file_system_client(file_system_name: str, **kwargs)
Gets a client for an existing file system.
get_file_system_client(file_system_name: str, **kwargs)
file_system_name- Required. The name of the file system.
DataLakeFileSystemClient- A client for the specified file system.
-
list_file_systems(**kwargs)
Lists all file systems in the storage account.
list_file_systems(**kwargs)
Iterable[file_system]- An iterable of file system properties.
DataLakeFileSystemClient
Represents a file system (container) and provides methods for managing its contents.
Methods
-
create_directory(directory_name: str, **kwargs)
Creates a new directory within the file system. Returns a
DataLakeDirectoryClient.create_directory(directory_name: str, **kwargs)
directory_name- Required. The name of the directory to create.
DataLakeDirectoryClient- A client for the newly created directory.
-
get_directory_client(directory_name: str)
Gets a client for an existing directory.
get_directory_client(directory_name: str)
directory_name- Required. The name of the directory.
DataLakeDirectoryClient- A client for the specified directory.
-
get_file_client(file_name: str)
Gets a client for an existing file directly under the file system root.
get_file_client(file_name: str)
file_name- Required. The name of the file.
DataLakeFileClient- A client for the specified file.
-
list_files_and_directories(**kwargs)
Lists files and subdirectories within the file system.
list_files_and_directories(**kwargs)
Iterable[file_system_item]- An iterable of file system item properties.
-
delete_file(file_name: str, **kwargs)
Deletes a file directly under the file system root.
delete_file(file_name: str, **kwargs)
file_name- Required. The name of the file to delete.
DataLakeDirectoryClient
Represents a directory and provides methods for managing its contents.
Methods
-
create_subdirectory(subdirectory_name: str, **kwargs)
Creates a new subdirectory. Returns a
DataLakeDirectoryClient.create_subdirectory(subdirectory_name: str, **kwargs)
subdirectory_name- Required. The name of the subdirectory to create.
DataLakeDirectoryClient- A client for the newly created subdirectory.
-
get_subdirectory_client(subdirectory_name: str)
Gets a client for an existing subdirectory.
get_subdirectory_client(subdirectory_name: str)
subdirectory_name- Required. The name of the subdirectory.
DataLakeDirectoryClient- A client for the specified subdirectory.
-
get_file_client(file_name: str)
Gets a client for a file within this directory.
get_file_client(file_name: str)
file_name- Required. The name of the file.
DataLakeFileClient- A client for the specified file.
-
list_files_and_directories(**kwargs)
Lists files and subdirectories within this directory.
list_files_and_directories(**kwargs)
Iterable[file_system_item]- An iterable of file system item properties.
-
delete_file(file_name: str, **kwargs)
Deletes a file within this directory.
delete_file(file_name: str, **kwargs)
file_name- Required. The name of the file to delete.
-
delete_directory(recursive: bool = False, **kwargs)
Deletes the directory. If
recursiveis True, it will delete all contents.delete_directory(recursive: bool = False, **kwargs)
recursive- Optional. Defaults to False. If True, deletes all contents.
DataLakeFileClient
Represents a file and provides methods for reading, writing, and managing its properties.
Methods
-
upload_data(data: bytes, **kwargs)
Uploads data to the file. This overwrites the existing file if it exists.
upload_data(data: bytes, **kwargs)
data- Required. The bytes to upload.
-
append_data(data: bytes, **kwargs)
Appends data to the end of the file.
append_data(data: bytes, **kwargs)
data- Required. The bytes to append.
-
read_data(**kwargs)
Reads the entire content of the file.
read_data(**kwargs)
bytes- The content of the file as bytes.
-
delete_file(**kwargs)
Deletes the file.
delete_file(**kwargs)
Examples
Create a file system, upload a file, and read it
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
import os
# Replace with your storage account URL
account_url = "https://youraccountname.dfs.core.windows.net"
file_system_name = "myfilesystem"
directory_name = "mydata"
file_name = "sample.txt"
file_content = b"Hello, Azure Data Lake Storage Gen2!"
try:
# Authenticate
credential = DefaultAzureCredential()
datalake_service_client = DataLakeServiceClient(account_url, credential)
# Create file system if it doesn't exist
file_system_client = datalake_service_client.create_file_system(file_system_name)
print(f"File system '{file_system_name}' created.")
# Create directory if it doesn't exist
directory_client = file_system_client.create_directory(directory_name)
print(f"Directory '{directory_name}' created.")
# Get file client
file_client = directory_client.get_file_client(file_name)
# Upload data
file_client.upload_data(file_content, overwrite=True)
print(f"File '{file_name}' uploaded successfully.")
# Read data
downloaded_data = file_client.read_data()
print(f"Content of '{file_name}': {downloaded_data.decode()}")
except Exception as ex:
print(f"An error occurred: {ex}")
List files and directories
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
account_url = "https://youraccountname.dfs.core.windows.net"
file_system_name = "myfilesystem"
try:
credential = DefaultAzureCredential()
datalake_service_client = DataLakeServiceClient(account_url, credential)
file_system_client = datalake_service_client.get_file_system_client(file_system_name)
print(f"Listing items in file system '{file_system_name}':")
for item in file_system_client.list_files_and_directories():
print(f"- Name: {item.name}, Is Directory: {item.is_directory}")
except Exception as ex:
print(f"An error occurred: {ex}")