Connections

Airflow Connections allow you to manage hostnames, ports, logins, passwords, and other parameters for external services that Airflow needs to interact with.

Connections are a fundamental part of how Airflow interacts with external systems, such as databases, cloud services, messaging queues, and more. Instead of hardcoding credentials and host information directly into your DAGs, you define these details as connections within Airflow. This approach enhances security, maintainability, and reusability.

What is a Connection?

A connection in Airflow is a named entry that stores configuration details for a specific external resource. Each connection has a unique identifier (often called a "conn_id") and contains attributes like:

Managing Connections

Connections can be managed through the Airflow UI or via the command-line interface (CLI). They can also be configured using environment variables or a `connections.yaml` file.

Via the Airflow UI

Navigate to the 'Admin' -> 'Connections' section in the Airflow UI. From there, you can:

Note: Passwords and sensitive information should be handled with care. Consider using secrets backends for more robust security.

Via the Airflow CLI

The Airflow CLI provides commands to manage connections:

Example CLI command to add a PostgreSQL connection:

airflow connections add \
    --conn-id my_postgres_db \
    --conn-type postgres \
    postgresql://user:password@host:port/database

Using Connections in DAGs

In your DAGs, you reference connections by their `conn_id`. Most operators and hooks in Airflow will have a `conn_id` parameter where you can specify the connection to use.

For example, using the PostgreSQL operator:

from __future__ import annotations

            import pendulum

            from airflow.models.dag import DAG
            from airflow.providers.postgres.operators.postgres import PostgresOperator

            with DAG(
                dag_id="example_postgres_connection",
                schedule=None,
                start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
                catchup=False,
                tags=["example", "postgres"],
            ) as dag:
                run_this_first = PostgresOperator(
                    task_id="create_table",
                    postgres_conn_id="my_postgres_db",  # Reference to your connection ID
                    sql="CREATE TABLE IF NOT EXISTS my_table (id SERIAL PRIMARY KEY, name VARCHAR(255));",
                )

                run_this_second = PostgresOperator(
                    task_id="insert_data",
                    postgres_conn_id="my_postgres_db",
                    sql="INSERT INTO my_table (name) VALUES ('Airflow User');",
                )

                run_this_first >> run_this_second

Connection URI Format

Connections can often be represented using a standard URI format:

scheme://login:password@host:port/database?extra_param1=value1&extra_param2=value2

For example:

The 'extra' field can be used to store JSON data for service-specific parameters that don't fit into the standard URI components. For instance, for an HTTP connection, you might store:

{"use_proxy": "True", "proxy_url": "http://proxy.example.com:8080"}

Secrets Management

For enhanced security, Airflow integrates with external secrets management systems like HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, and Azure Key Vault. Instead of storing secrets directly in Airflow Connections, you can configure Airflow to retrieve them dynamically from these services. This is highly recommended for production environments.

Tip: When using secrets backends, the 'Password' field in an Airflow connection can often store a reference (e.g., a path or key) to the actual secret in the secrets backend.

Connection Types

Airflow supports a wide range of built-in connection types, and you can define custom ones or use types provided by Airflow providers. Common types include:

Connection Type Description
Amazon Web Services (AWS) For connecting to AWS services.
Azure Resource Manager (ARM) For connecting to Azure services.
Google Cloud Platform (GCP) For connecting to GCP services.
PostgreSQL For PostgreSQL databases.
MySQL For MySQL databases.
HTTP For generic HTTP endpoints.
SFTP For Secure File Transfer Protocol.
Kafka For Apache Kafka.
RabbitMQ For RabbitMQ message broker.
Docker For interacting with Docker.

The exact list of available connection types depends on your Airflow installation and installed providers.

Conclusion

Mastering Airflow Connections is crucial for building robust and secure data pipelines. By centralizing external service configurations, you ensure consistency, simplify management, and improve the overall reliability of your Airflow workflows.