DAGs

A Directed Acyclic Graph (DAG) is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. DAGs are the core of Airflow. They define the workflows you want to automate.

What is a DAG?

In Airflow, a DAG is a Python script that defines a workflow. This script describes the tasks in your workflow, how they should be scheduled, and the dependencies between them. The "directed" aspect means that data flows in one direction, and "acyclic" means that the graph cannot contain cycles, preventing infinite loops.

Key Components of a DAG

Creating a DAG

DAGs are defined in Python files, typically placed in the Airflow DAGs folder. Here's a simple example:


from __future__ import annotations

import pendulum

from airflow.models.dag import DAG
from airflow.operators.empty import EmptyOperator

with DAG(
    dag_id="my_first_dag",
    schedule="@daily",
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example", "first"],
) as dag:
    start = EmptyOperator(task_id="start")
    end = EmptyOperator(task_id="end")

    start >> end
            

In this example:

Task Dependencies

You can define dependencies between tasks in several ways:

Example with multiple dependencies:


from airflow.operators.bash import BashOperator

with DAG(
    dag_id="complex_dependencies",
    schedule=None,
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    catchup=False,
) as dag:
    task_a = BashOperator(task_id="task_a", bash_command="echo 'A'")
    task_b = BashOperator(task_id="task_b", bash_command="echo 'B'")
    task_c = BashOperator(task_id="task_c", bash_command="echo 'C'")
    task_d = BashOperator(task_id="task_d", bash_command="echo 'D'")

    task_a >> [task_b, task_c] >> task_d
            

In this example, task_b and task_c will run after task_a, and task_d will run after both task_b and task_c have completed successfully.

Scheduling DAGs

Airflow offers flexible scheduling options:

DAG Runs and Task Instances

When a DAG is scheduled to run, it creates a DAG Run. Each time a specific task within a DAG Run executes, it's called a Task Instance. The state of DAG Runs and Task Instances (e.g., running, success, failed, skipped) can be monitored in the Airflow UI.

Tip: Understanding DAGs is fundamental to using Airflow effectively. Familiarize yourself with defining tasks, dependencies, and scheduling to build robust data pipelines.

Further Reading