Azure Data Lake Storage Gen2

Java SDK Reference

Introduction to Azure Data Lake Storage Gen2 Java SDK

This document provides a comprehensive reference for the Azure Data Lake Storage Gen2 (ADLS Gen2) Java SDK. ADLS Gen2 is a set of capabilities built on Azure Blob Storage, designed for big data analytics workloads. The Java SDK allows you to interact with your ADLS Gen2 data programmatically from your Java applications.

Getting Started

To begin using the ADLS Gen2 Java SDK, you need to include the necessary dependency in your project.

Maven Dependency

<dependency>
    <groupId>com.azure</groupId>
    <artifactId>azure-storage-blob</artifactId>
    <version>12.25.0</version> <!-- Check for the latest version -->
</dependency>

Gradle Dependency

implementation 'com.azure:azure-storage-blob:12.25.0' <em>// Check for the latest version</em>

Once the dependency is added, you can start writing Java code to interact with ADLS Gen2.

Core Concepts

  • Storage Account: The top-level container for all Azure Storage objects.
  • File System: A namespace within a storage account. Similar to a folder at the root of a file system.
  • Directory: A hierarchical structure within a file system, used for organizing files.
  • File: The data object stored within a directory.

Authentication

You can authenticate with ADLS Gen2 using several methods:

  • Connection String: A simple way to connect using a pre-defined connection string.
  • Shared Key Access: Using your storage account's access keys.
  • Azure Identity: Recommended for production environments, using managed identities or service principals for secure authentication.

The SDK provides classes like BlobServiceClientBuilder to configure these authentication methods.

Blob Client

The primary client for interacting with ADLS Gen2 is the BlobServiceClient. You can obtain a file system client (DataLakeFileSystemClient) from the blob service client.

import com.azure.storage.blob.BlobServiceClient;
import com.azure.storage.blob.BlobServiceClientBuilder;
import com.azure.storage.blob.models.BlobErrorCode;
import com.azure.storage.blob.models.BlobStorageException;
import com.azure.storage.blob.datalake.DataLakeFileSystemClient;

public class AdlsClientExample {
    public static void main(String[] args) {
        final String connectionString = System.getenv("AZURE_STORAGE_CONNECTION_STRING");
        final String fileSystemName = "myfilesystem";

        if (connectionString == null || connectionString.isEmpty()) {
            System.err.println("Please set the AZURE_STORAGE_CONNECTION_STRING environment variable.");
            return;
        }

        BlobServiceClient blobServiceClient = new BlobServiceClientBuilder()
            .connectionString(connectionString)
            .buildClient();

        DataLakeFileSystemClient fileSystemClient = blobServiceClient.getFileSystemClient(fileSystemName);

        try {
            if (!fileSystemClient.exists()) {
                System.out.println("Creating file system: " + fileSystemName);
                fileSystemClient.create();
            } else {
                System.out.println("File system " + fileSystemName + " already exists.");
            }
        } catch (BlobStorageException e) {
            if (e.getStatusCode() == 409) {
                System.out.println("File system already exists.");
            } else {
                System.err.println("Error creating file system: " + e.getMeaning());
                e.printStackTrace();
            }
        }
    }
}

Filesystem Operations

Manage your file systems using the DataLakeFileSystemClient.

Create File System

Creates a new file system.

fileSystemClient.create();

List File Systems

Retrieves a list of all file systems in the storage account.

for ( var filesystem : blobServiceClient.listFileSystems()) {
    System.out.println(filesystem.getName());
}

Delete File System

Deletes an existing file system and all its contents.

fileSystemClient.delete();

Directory Operations

Organize your data using directories. Get a DataLakeDirectoryClient from the file system client.

Create Directory

Creates a new directory within a file system.

final String directoryName = "my/nested/directory";
DataLakeDirectoryClient directoryClient = fileSystemClient.getDirectoryClient(directoryName);
directoryClient.create();

List Directories

Lists directories within a specified path.

for (var dir : fileSystemClient.listPaths("my/nested")) {
    if (dir.isDirectory()) {
        System.out.println(dir.getName());
    }
}

Delete Directory

Deletes a directory and all its contents. Use recursive=true to delete non-empty directories.

directoryClient.delete();
// For non-empty directories
fileSystemClient.deleteDirectory("my/directory", true);

File Operations

Handle file uploads, downloads, and reads. Get a DataLakeFileClient from the file system client.

Upload File

Uploads a local file to ADLS Gen2.

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

// ... within a method ...
final String fileName = "mydata.txt";
final String filePath = "upload/data.txt";
File localFile = new File(filePath);

try (FileInputStream fileStream = new FileInputStream(localFile)) {
    DataLakeFileClient fileClient = fileSystemClient.getFileClient(fileName);
    fileClient.upload(new BufferedInputStream(fileStream), localFile.length());
    System.out.println("Successfully uploaded " + fileName);
} catch (IOException e) {
    e.printStackTrace();
}

Download File

Downloads a file from ADLS Gen2 to a local path.

import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.channels.Channels;

// ... within a method ...
final String downloadPath = "download/data.txt";
final String remoteFileName = "mydata.txt";

try (FileOutputStream outputStream = new FileOutputStream(downloadPath)) {
    DataLakeFileClient fileClient = fileSystemClient.getFileClient(remoteFileName);
    fileClient.read(outputStream);
    System.out.println("Successfully downloaded " + remoteFileName);
} catch (IOException e) {
    e.printStackTrace();
}

Read File Content

Reads the content of a file directly into a string or stream.

// Read as String
DataLakeFileClient fileClient = fileSystemClient.getFileClient("path/to/your/file.txt");
String fileContent = fileClient.read().toString();
System.out.println("File content: " + fileContent);

// Read into an InputStream
try (InputStream inputStream = fileClient.read().toStream()) {
    // Process the inputStream
} catch (IOException e) {
    e.printStackTrace();
}

Delete File

Deletes a file from ADLS Gen2.

fileSystemClient.deleteFile("path/to/your/file.txt");

Advanced Features

  • Append Blobs: For large files, use append operations.
  • Access Control Lists (ACLs): Manage permissions for files and directories.
  • Leasing: Implement exclusive write access to a blob.
  • Metadata: Attach custom metadata to files and directories.

Code Examples

Explore more detailed code examples in the official GitHub repository.

API Reference

For a complete list of classes, methods, and their parameters, please refer to the Microsoft Learn API documentation.