Azure Data Lake Storage Gen2 Python SDK

This documentation provides a comprehensive reference for using the Azure Data Lake Storage Gen2 SDK for Python. This SDK allows you to interact with your Data Lake Storage Gen2 accounts programmatically, enabling operations such as creating file systems, managing directories and files, and controlling access.

The Azure Data Lake Storage Gen2 SDK for Python simplifies the development of applications that leverage the scalability and performance of Azure Data Lake Storage Gen2.

Getting Started

Before you begin, ensure you have the following:

Installation

Install the Azure Data Lake Storage Gen2 SDK for Python using pip:

pip install azure-storage-file-datalake

Authentication

You can authenticate to Azure Data Lake Storage Gen2 using various methods. The most common are:

Using a Connection String

from azure.storage.filedatalake import DataLakeServiceClient

connection_string = "YOUR_CONNECTION_STRING"
datalake_service_client = DataLakeServiceClient.from_connection_string(connection_string)

Using Azure AD (DefaultAzureCredential)

Ensure you have the necessary Azure AD configurations and the azure-identity package installed:

pip install azure-identity
from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient

account_url = "https://youraccountname.dfs.core.windows.net"
credential = DefaultAzureCredential()
datalake_service_client = DataLakeServiceClient(account_url, credential)

Core Concepts

The SDK revolves around several key objects:

API Reference

DataLakeServiceClient

Provides methods to manage file systems and access Data Lake Storage Gen2.

Methods

DataLakeFileSystemClient

Represents a file system (container) and provides methods for managing its contents.

Methods

DataLakeDirectoryClient

Represents a directory and provides methods for managing its contents.

Methods

DataLakeFileClient

Represents a file and provides methods for reading, writing, and managing its properties.

Methods

Examples

Create a file system, upload a file, and read it

from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient
import os

# Replace with your storage account URL
account_url = "https://youraccountname.dfs.core.windows.net"
file_system_name = "myfilesystem"
directory_name = "mydata"
file_name = "sample.txt"
file_content = b"Hello, Azure Data Lake Storage Gen2!"

try:
    # Authenticate
    credential = DefaultAzureCredential()
    datalake_service_client = DataLakeServiceClient(account_url, credential)

    # Create file system if it doesn't exist
    file_system_client = datalake_service_client.create_file_system(file_system_name)
    print(f"File system '{file_system_name}' created.")

    # Create directory if it doesn't exist
    directory_client = file_system_client.create_directory(directory_name)
    print(f"Directory '{directory_name}' created.")

    # Get file client
    file_client = directory_client.get_file_client(file_name)

    # Upload data
    file_client.upload_data(file_content, overwrite=True)
    print(f"File '{file_name}' uploaded successfully.")

    # Read data
    downloaded_data = file_client.read_data()
    print(f"Content of '{file_name}': {downloaded_data.decode()}")

except Exception as ex:
    print(f"An error occurred: {ex}")

List files and directories

from azure.identity import DefaultAzureCredential
from azure.storage.filedatalake import DataLakeServiceClient

account_url = "https://youraccountname.dfs.core.windows.net"
file_system_name = "myfilesystem"

try:
    credential = DefaultAzureCredential()
    datalake_service_client = DataLakeServiceClient(account_url, credential)
    file_system_client = datalake_service_client.get_file_system_client(file_system_name)

    print(f"Listing items in file system '{file_system_name}':")
    for item in file_system_client.list_files_and_directories():
        print(f"- Name: {item.name}, Is Directory: {item.is_directory}")

except Exception as ex:
    print(f"An error occurred: {ex}")