Reference Documentation

This section provides a comprehensive reference for Apache Airflow, covering its core components, functionalities, and best practices. Navigate through the sidebar to find detailed information on specific topics.

Introduction

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Airflow's architecture is designed to be flexible and scalable, allowing you to manage complex data pipelines with ease.

Key features include:

  • Dynamic Pipeline Generation: DAGs are defined in Python, allowing for dynamic generation.
  • Rich User Interface: A user-friendly web interface for monitoring and managing workflows.
  • Extensibility: A plugin architecture for adding custom operators, hooks, and more.
  • Scalability: Supports various executors to scale from a single machine to a distributed cluster.

Core Concepts

Directed Acyclic Graphs (DAGs)

A DAG is a collection of tasks that you want to run, organized in a way that reflects their relationships and dependencies. Tasks are the actual units of work in Airflow, and DAGs define the order and dependencies between them.

Example DAG structure:


from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id='my_first_dag',
    start_date=datetime(2023, 1, 1),
    schedule_interval='@daily',
    catchup=False
) as dag:
    task1 = BashOperator(
        task_id='print_date',
        bash_command='date'
    )

    task2 = BashOperator(
        task_id='sleep',
        bash_command='sleep 5'
    )

    task1 >> task2
                

Tasks and Operators

A Task is a unit of work within a DAG. An Operator is a template for a type of task. Airflow provides a rich set of built-in operators, and you can create custom operators to fit your needs.

Executors

Executors define how tasks are run. Common executors include:

  • SequentialExecutor: Runs tasks one after another (useful for debugging).
  • LocalExecutor: Runs tasks in parallel on the same machine.
  • CeleryExecutor: Distributes tasks to Celery workers.
  • KubernetesExecutor: Runs tasks as Kubernetes pods.

Connections

Connections store information about how to interact with external services and databases. They are managed via the Airflow UI or the CLI.

Variables

Variables are key-value pairs that can be used to store configuration settings or dynamic information that can be accessed within DAGs.

Operators and Hooks

Operators define the actual work that tasks perform. Hooks provide an interface to external platforms and databases.

Common Operators:

Operator Description
BashOperator Executes a bash command.
PythonOperator Executes a Python callable.
PostgresOperator Executes a SQL command on a PostgreSQL database.
S3Hook Hook for interacting with Amazon S3.

For a full list of operators and hooks, please refer to the Operators and Hooks documentation.

Authoring DAGs

DAGs are written in Python. They define the tasks, their dependencies, schedule, and other parameters.

Key Parameters:

  • dag_id: A unique identifier for the DAG.
  • start_date: The date from which the DAG should start running.
  • schedule_interval: How often the DAG should run (e.g., cron presets, timedelta).
  • catchup: Whether to run missed DAG runs since the start_date.
  • default_args: A dictionary of default arguments for all tasks in the DAG.

Configuration

Airflow's behavior is controlled by a configuration file (airflow.cfg) and environment variables. Key configuration sections include:

  • [core]: General Airflow settings.
  • [scheduler]: Scheduler-specific settings.
  • [webserver]: Webserver-specific settings.
  • [executor]: Executor settings.

See the Configuration page for more details.

Installation and Setup

Airflow can be installed using pip. It's recommended to use a virtual environment.


pip install apache-airflow
                

After installation, initialize the database and start the webserver and scheduler:


airflow db init
airflow webserver --port 8080
airflow scheduler
                

For detailed installation instructions, please visit the Installation Guide.

Command-Line Interface (CLI)

The Airflow CLI is a powerful tool for managing your Airflow environment.

Common Commands:

  • airflow dags list: List all DAGs.
  • airflow dags list-runs --dag-id your_dag_id: List runs for a specific DAG.
  • airflow tasks list --dag-id your_dag_id: List tasks in a DAG.
  • airflow tasks test your_dag_id your_task_id execution_date: Test a task without scheduling.
  • airflow db upgrade: Upgrade the database schema.

Explore all available commands with airflow --help or refer to the CLI Reference.

API Reference

Airflow exposes a REST API for programmatic interaction with the Airflow environment. This API allows you to trigger DAGs, check task statuses, and retrieve metadata.

Refer to the API Reference for detailed endpoint documentation.

Providers

Providers are packages that bundle integrations with external services. Instead of having all integrations in the core Airflow, they are now distributed as separate providers.

Examples include:

  • apache-airflow-providers-amazon
  • apache-airflow-providers-google
  • apache-airflow-providers-microsoft-azure

You can find a list of available providers and how to install them in the Providers documentation.

Web UI

The Airflow Web UI provides a visual interface for:

  • Viewing DAGs and their structure.
  • Monitoring DAG runs and task statuses.
  • Manually triggering DAG runs.
  • Managing Connections, Variables, and Pools.
  • Viewing logs.

Access the UI by running airflow webserver and navigating to http://localhost:8080 (default port).

Troubleshooting and Best Practices

This section offers guidance on common issues and recommended practices for using Airflow effectively.

  • Monitoring: Regularly monitor your DAGs and tasks for failures or long runtimes.
  • Idempotency: Design tasks to be idempotent so they can be retried safely.
  • Resource Management: Configure executors and workers appropriately to manage system resources.
  • Logging: Ensure proper logging is configured to aid in debugging.
  • Testing: Write comprehensive tests for your DAGs and custom operators.

More detailed information can be found in the Best Practices and FAQ sections.