DAGs (Directed Acyclic Graphs)
A Directed Acyclic Graph (DAG) is a collection of tasks with dependencies defined on them. Airflow represents all of its workflows as DAGs. A DAG is a Python script that is used to configure the DAG.
What is a DAG?
A DAG is a structured set of tasks that Airflow runs. Tasks are organized in a way that reflects their relationships and dependencies. The graph is directed because tasks can only depend on tasks that come before them, and it's acyclic because a task cannot depend on itself or on any task that has already completed.
Key Components of a DAG
- Tasks: The fundamental units of work in Airflow. Each task is an instance of an Operator.
- Dependencies: The relationships between tasks that define the order of execution.
- Operators: Pre-built templates for tasks, such as
BashOperator,PythonOperator,PostgresOperator, etc. - Task Instances: A specific run of a task for a specific DAG run.
- DAG Runs: A specific execution of a DAG.
Defining a DAG
DAGs are defined in Python files. Here's a simple example:
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.operators.bash import BashOperator
with DAG(
dag_id="my_simple_dag",
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
schedule=None,
catchup=False,
tags=["example", "core-concepts"],
) as dag:
# Define tasks
task_1 = BashOperator(
task_id="say_hello",
bash_command="echo 'Hello, World!'",
)
task_2 = BashOperator(
task_id="say_goodbye",
bash_command="echo 'Goodbye, Airflow!'",
)
# Define dependencies
task_1 >> task_2
Common DAG Arguments
| Argument | Description | Default |
|---|---|---|
dag_id |
A unique identifier for the DAG. | Required |
start_date |
The date from which the DAG should start running. Airflow uses this to determine when to schedule the first DAG run. | Required |
schedule |
The schedule interval for the DAG. Can be a cron expression, a timedelta object, or None for manual runs. |
@once |
catchup |
If True, Airflow will schedule DAG runs for all missing past intervals between start_date and the current date. |
True |
tags |
A list of tags to associate with the DAG for organization and filtering in the UI. | [] |
default_args |
A dictionary of default arguments to apply to all tasks within the DAG. | {} |
Task Dependencies
Dependencies define the execution order of tasks. Airflow supports several ways to define these relationships:
- Bitshift Operators: The most common and readable way.
task1 >> task2:task1must complete successfully beforetask2starts.task2 << task1: Same as above.task1 >> [task2, task3]:task1must complete before bothtask2andtask3start.[task1, task2] >> task3: Bothtask1andtask2must complete beforetask3starts.
set_upstream()andset_downstream()methods: More verbose but explicit.task1.set_downstream(task2) task2.set_upstream(task1)
Tip
For complex dependency graphs, consider using the chain() utility from airflow.utils.helpers.
Task States
Tasks and DAG runs can be in various states throughout their lifecycle. Common states include:
queued: The task is waiting to be picked up by an executor.running: The task is currently executing.success: The task completed successfully.failed: The task failed during execution.upstream_failed: The task was skipped because one of its upstream dependencies failed.skipped: The task was skipped (e.g., due to branching logic).scheduled: The DAG run is scheduled but not yet running.running: The DAG run is actively executing.success: The DAG run completed successfully.failed: The DAG run failed.
The Airflow UI for DAGs
The Airflow Web UI provides a visual representation of your DAGs, their structure, and their current status. You can monitor task progress, view logs, and trigger DAG runs from the UI.
Note
DAGs are parsed by the Airflow scheduler at regular intervals. Ensure your DAG files are placed in the DAGs folder configured in your Airflow environment.