Apache Airflow Documentation

Official Documentation for Apache Airflow

Creating and Managing DAGs

This guide provides practical steps and best practices for creating, managing, and organizing Directed Acyclic Graphs (DAGs) in Apache Airflow.

What is a DAG?

A DAG is a collection of tasks you want to run, organized in a way that reflects their relationships and dependencies. Airflow's core functionality is to orchestrate DAGs.

Key Concepts

  • Tasks: The smallest unit of work in Airflow.
  • Operators: Templates for tasks. You instantiate an operator to create a task.
  • Dependencies: The relationships between tasks, defining the order of execution.
  • Schedule: When the DAG should run.
  • DAG Runs: An instance of a DAG running for a specific schedule interval.

Creating Your First DAG

DAGs are defined in Python files. Here's a simple example:


from __future__ import annotations

import pendulum

from airflow.models.dag import DAG
from airflow.operators.empty import EmptyOperator

with DAG(
    dag_id="my_first_dag",
    schedule="@daily",
    start_date=pendulum.datetime(2023, 10, 26, tz="UTC"),
    catchup=False,
    tags=["example", "getting_started"],
) as dag:
    start_task = EmptyOperator(task_id="start")
    end_task = EmptyOperator(task_id="end")

    start_task >> end_task
                

Understanding DAG Arguments

  • dag_id: A unique identifier for your DAG.
  • schedule: How often the DAG should run (e.g., "@daily", "0 0 * * *", or a timedelta).
  • start_date: The date from which the DAG should start running. This is crucial for scheduling.
  • catchup: If True, Airflow will run DAGs for past missed schedules. Defaults to True.
  • tags: A list of strings to categorize your DAGs in the UI.
  • default_args: A dictionary of arguments to be applied to all tasks within the DAG.

Defining Task Dependencies

You can define dependencies using bitshift operators:

  • task1 >> task2: Task2 runs after Task1.
  • task1 << task2: Task1 runs after Task2.
  • [task1, task2] >> task3: Task3 runs after both Task1 and Task2 complete.
  • task1 >> [task2, task3]: Task2 and Task3 run after Task1.

Common DAG Patterns

Sequential DAG


task_a >> task_b >> task_c
                

Parallel Branches


task_a >> [task_b, task_c] >> task_d
                

Fork-Join Pattern


task_a >> task_b
task_a >> task_c
[task_b, task_c] >> task_d
                

Organizing DAGs

Keep your DAG files organized in the dags folder. For larger deployments, consider subdirectories for logical grouping.

Best Practice: Ensure your start_date is set to a past date. If your start_date is in the future, the DAG will not be scheduled until that date.
Tip: Use the tags argument to easily filter and group DAGs in the Airflow UI's DAGs view.

The TaskFlow API

For a more Pythonic way to build DAGs, explore the TaskFlow API. It allows you to write tasks as Python functions and Airflow handles the XCom passing and operator instantiation.

Refer to the TaskFlow API documentation for more details.

Dynamic DAG Generation

Airflow scans your dags folder periodically. You can generate DAGs dynamically within a Python file using loops or other logic.

Note: Be cautious with dynamic DAG generation. Complex logic can impact DAG parsing performance. Ensure your DAG generation code is efficient and error-free.

This section provides a foundational understanding of DAGs. For advanced configurations and specific operator usage, please consult other relevant documentation sections.