Glossary of Terms
-
DAG (Directed Acyclic Graph)
A collection of tasks with defined dependencies between them. It is a Directed graph because the direction of dependencies matters, and Acyclic because the graph cannot contain cycles, meaning a task cannot depend on itself directly or indirectly.
-
Task
A unit of work within a DAG. Tasks are typically Python functions or operators that perform specific actions. For example, running a SQL query, sending an email, or executing a bash command.
-
Operator
A template for a type of task. Operators define the logic for executing a specific kind of operation. Airflow provides a wide range of built-in operators (e.g.,
BashOperator,PythonOperator,PostgresOperator) and allows for custom operator creation. -
Task Instance
A specific run of a task for a particular DAG run at a specific point in time. Each task instance has a state (e.g.,
running,success,failed,skipped). -
DAG Run
A specific execution of a DAG. A DAG run represents an instance of the entire workflow for a given logical date or interval. It is composed of multiple task instances.
-
Logical Date / Execution Date
The timestamp that represents the interval of data the DAG run is processing. For example, a DAG scheduled to run daily might have a logical date for yesterday, meaning it's processing data for that day.
-
Schedule Interval
The frequency at which a DAG is supposed to run. This can be defined using cron expressions, timedelta objects, or presets like
@daily,@hourly. -
Hook
A generic interface to communicate with external services or databases. Hooks provide a common way to manage connections and execute commands against these systems, often used by operators.
-
Connection
Configuration details (e.g., host, port, username, password) for connecting to external systems. Connections are managed securely within Airflow's UI or programmatically.
-
XCom (Cross-communication)
A mechanism for tasks to exchange small amounts of data. A task can "push" data using XComs, and downstream tasks can "pull" that data. It's ideal for passing small metadata, not large datasets.
-
Task Group
A UI-friendly way to group related tasks within a DAG. Task Groups help organize complex DAGs by visually clustering tasks in the Graph View.
-
Provider Package
A distribution mechanism for Airflow integrations with external services. Provider packages contain operators, hooks, sensors, and other components for specific technologies (e.g.,
apache-airflow-providers-amazon). -
Executor
The component responsible for managing task execution. Airflow supports various executors like
LocalExecutor,CeleryExecutor,KubernetesExecutor, which determine how tasks are run (e.g., on the same machine, distributed across workers). -
Sensor
A special type of operator that waits for a certain condition to be met before succeeding. Examples include waiting for a file to appear, a row to appear in a database, or a specific time to be reached.
-
Pool
A mechanism to limit the concurrency of tasks. Pools allow you to specify the maximum number of tasks that can run concurrently for a given resource or task group.