Spark Notebook ETL Tutorial
Learn how to perform Extract, Transform, Load (ETL) operations using Apache Spark notebooks in Azure Synapse Analytics. This tutorial guides you through the entire process, from setting up your environment to loading transformed data.
Introduction
This tutorial will walk you through a common ETL scenario using Azure Synapse Analytics. We'll leverage the power of Apache Spark within Synapse notebooks to process data efficiently. This guide assumes basic familiarity with Spark and Python.
Note: Ensure you have an Azure subscription and an Azure Synapse workspace ready. If not, follow the setup guide first.
Step 1: Set up your environment
Before we begin with the ETL process, it's crucial to have your development environment configured correctly. This involves setting up a Spark pool in your Synapse workspace and creating a new notebook.
Creating a Spark Pool
1. Navigate to your Azure Synapse workspace in the Azure portal.
2. Under "Analytics pools", click on "+ New".
3. Select "Apache Spark pool".
4. Configure the node size, scale, and auto-pausing settings as per your requirements.
5. Click "Review + create" and then "Create".
Creating a New Notebook
1. Open your Synapse workspace and navigate to the "Develop" hub.
2. Click the "+" sign and select "Notebook".
3. Attach the notebook to the Spark pool you just created.
4. Choose your preferred language (Python is recommended for this tutorial).
Step 2: Ingest Data
The first phase of ETL is Extract. We'll ingest data from a source into our Synapse workspace. For this tutorial, we'll assume data is available in Azure Data Lake Storage Gen2.
Connecting to Data Lake Storage
You can use Spark's built-in connectors or the Synapse SDK to access data. Here's an example using PySpark to read a CSV file:
from pyspark.sql import SparkSession
# Initialize Spark Session (usually done automatically in Synapse notebooks)
spark = SparkSession.builder.appName("ETLTutorialIngest").getOrCreate()
# Define the path to your data in ADLS Gen2
# Replace 'your-storage-account.dfs.core.windows.net' and 'your-container/your-folder/data.csv'
data_path = "abfss://your-container@your-storage-account.dfs.core.windows.net/path/to/your/data.csv"
# Read the CSV file into a Spark DataFrame
try:
df_raw = spark.read.csv(data_path, header=True, inferSchema=True)
print("Successfully ingested data. Schema:")
df_raw.printSchema()
df_raw.show(5)
except Exception as e:
print(f"Error ingesting data: {e}")
This code snippet demonstrates reading a CSV file. You can adapt it for other formats like Parquet or JSON.
Step 3: Explore and Transform Data
This is the core of the ETL process: Transform. We'll clean, enrich, and restructure the data to meet our analytical needs.
Data Cleaning and Filtering
Let's assume our raw data has missing values or incorrect data types that need to be handled.
# Example: Handle missing values
# Replace 'column_name' with actual column names from your data
df_cleaned = df_raw.na.fill({"column_name_with_nulls": 0})
# Example: Filter out unwanted rows
df_filtered = df_cleaned.filter(df_cleaned["relevant_column"] > 100)
# Example: Cast column to a different type
# from pyspark.sql.functions import col
# df_transformed = df_filtered.withColumn("column_to_cast", col("column_to_cast").cast("integer"))
print("Data after cleaning and filtering:")
df_filtered.show(5)
Data Enrichment
Sometimes, you might want to join your data with other sources or add calculated fields.
# Example: Adding a new calculated column
# from pyspark.sql.functions import expr
# df_enriched = df_filtered.withColumn("new_calculated_column", expr("column_a * column_b"))
# Example: Joining with another DataFrame (assuming df_lookup is loaded)
# df_final = df_enriched.join(df_lookup, df_enriched["join_key"] == df_lookup["lookup_key"], "left")
# For this tutorial, we'll just show the filtered data
df_final = df_filtered
df_final.show(5)
Step 4: Load Transformed Data
The final phase of ETL is Load. We'll save our transformed data to a desired destination, such as Azure Data Lake Storage Gen2 in a different format (e.g., Parquet) or a Synapse SQL pool.
Loading to ADLS Gen2 (Parquet Format)
Parquet is often preferred for analytical workloads due to its columnar storage and compression capabilities.
# Define the output path
output_path = "abfss://your-container@your-storage-account.dfs.core.windows.net/path/to/output/transformed_data.parquet"
# Write the DataFrame to Parquet format
try:
df_final.write.mode("overwrite").parquet(output_path)
print(f"Successfully loaded transformed data to {output_path}")
except Exception as e:
print(f"Error loading data: {e}")
Loading to Synapse SQL Pool (Optional)
You can also load data directly into a dedicated SQL pool for traditional data warehousing scenarios. This requires setting up a linked service and potentially using a COPY statement or Spark SQL connector.
# Example using COPY statement (requires dedicated SQL pool and linked service)
# Note: This is a simplified representation. Actual implementation involves more setup.
# df_final.write.synapsesql("your_sql_pool_name.your_schema.your_table", "append")
Conclusion
Congratulations! You have successfully completed the Spark Notebook ETL tutorial in Azure Synapse Analytics. You've learned how to ingest data, perform transformations using Spark, and load the results. This foundational knowledge can be extended to more complex data pipelines and advanced analytics scenarios.
Explore other Synapse features like pipelines for orchestration, data flows for low-code transformations, and integrations with other Azure services.