Azure Synapse Analytics Pipelines

Last updated: 2023-10-27 Views: 1,234,567

This document provides a comprehensive reference for Azure Synapse Analytics Pipelines, including activities, triggers, datasets, linked services, and best practices for building data integration and orchestration solutions within Azure Synapse Analytics.

Overview

Azure Synapse Analytics pipelines are logical groupings of activities that together perform a task. Pipelines are used to automate processes, orchestrate data movement, and transform data. They offer a powerful way to build complex data workflows in the cloud.

Key Concepts

Activities: The individual processing steps within a pipeline (e.g., Copy Data, Execute SQL, Azure Function).
Triggers: Define when a pipeline execution should occur (e.g., schedule-based, event-based, manual).
Datasets: Represent the data structures within the data stores, which pipelines consume as inputs and produce as outputs.
Linked Services: Define the connection information needed for Synapse to connect to external resources.

Pipeline Activities

Copy Data Activity

The Copy Data activity is used to copy data from a source data store to a sink data store. It supports a wide range of connectors and data formats.

Properties

Source: Configuration for the source data store.
Sink: Configuration for the sink data store.
Parallelism: Controls the degree of parallelism for data copying.

            {
                "name": "CopyDataActivity",
                "type": "Copy",
                "dependsOn": [],
                "policy": {
                    "timeout": "0:30:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": true
                    },
                    "sink": {
                        "type": "ParquetSink",
                        "writeBatchSize": 1000
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "mappings": [
                            {
                                "source": { "name": "col1" },
                                "sink": { "name": "ColumnA" }
                            }
                        ]
                    }
                }
            }
            

Execute SQL Script Activity

This activity executes a SQL script against a relational database.

            {
                "name": "ExecuteSQL",
                "type": "Script",
                "dependsOn": [],
                "policy": {
                    "timeout": "1:00:00",
                    "retry": 3,
                    "retryIntervalInSeconds": 60,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "scripts": [
                        {
                            "sqlScript": "CREATE TABLE IF NOT EXISTS MyTable (Id INT, Name VARCHAR(100));"
                        }
                    ]
                }
            }
            

Databricks Notebook Activity

Allows you to execute a Databricks notebook as part of your pipeline.

Tip: Ensure your Synapse workspace is properly integrated with Azure Databricks for seamless execution.

Get Metadata Activity

Retrieves metadata from a data store, such as file names, sizes, and last modified dates.

Delete Activity

Deletes files or folders from a data store.

Stored Procedure Activity

Executes a stored procedure in a data store.

Azure Function Activity

Executes an Azure Function as a custom activity.

Pipeline Triggers

Schedule Trigger

Runs a pipeline at a specified time interval.

            {
                "type": "ScheduleTrigger",
                "typeProperties": {
                    "recurrence": {
                        "frequency": "Day",
                        "interval": 1,
                        "startTime": "2023-10-27T08:00:00Z",
                        "timeSkip": 1,
                        "skipLookbackBoundary": true,
                        "whenEndTime": "2024-12-31T23:59:00Z",
                        "count": 5
                    }
                }
            }
            

Event Trigger

Triggers a pipeline based on an event, such as a file arriving in Blob Storage.

Tumbling Window Trigger

A time-windowed trigger that processes data in discrete, non-overlapping intervals.

For Each Activity

Iterates over a collection of items and executes a set of activities for each item.

If Condition Activity

Executes a set of activities based on a specified condition.

Wait Activity

Pauses the execution of a pipeline for a specified duration.

Execute Pipeline Activity

Allows one pipeline to call another pipeline, enabling pipeline composition.

Datasets

Datasets represent the data within the linked data stores. They specify the data format, location, and schema.

            {
                "name": "SourceCSVDataset",
                "properties": {
                    "linkedServiceName": {
                        "referenceName": "AzureBlobStorageLinkedService",
                        "type": "LinkedServiceReference"
                    },
                    "type": "DelimitedText",
                    "typeProperties": {
                        "location": {
                            "type": "AzureBlobStorageLocation",
                            "container": "source-data"
                        },
                        "columnDelimiter": ",",
                        "firstRowAsHeader": true
                    }
                }
            }
            

Linked Services

Linked services define the connection information to external resources such as databases, file systems, and cloud services.

            {
                "name": "AzureBlobStorageLinkedService",
                "properties": {
                    "type": "AzureBlobStorage",
                    "typeProperties": {
                        "connectionString": "@{secrets('AzureKeyVault')}"
                    }
                }
            }
            

Note: Always use Azure Key Vault to store sensitive connection strings and credentials.

Best Practices

Organize your pipelines logically.
Use descriptive names for all components.
Implement robust error handling and logging.
Parameterize pipelines for reusability.
Monitor pipeline executions regularly.

Azure Synapse Analytics Pipelines

Overview

Key Concepts

Pipeline Activities

Data Movement Activities

Data Transformation Activities

Control Flow Activities

Copy Data Activity

Properties

Execute SQL Script Activity

Databricks Notebook Activity

Get Metadata Activity

Delete Activity

Stored Procedure Activity

Azure Function Activity

Pipeline Triggers

Schedule Trigger

Event Trigger

Tumbling Window Trigger

For Each Activity

If Condition Activity

Wait Activity

Execute Pipeline Activity

Datasets

Linked Services

Best Practices