Pools

Pools are a mechanism in Apache Airflow to limit the concurrency of tasks. They are useful when you need to control how many instances of a particular type of task can run at the same time, or when you want to limit the number of external resources that Airflow can access concurrently.

What is a Pool?

A pool is essentially a named counter. When a task is assigned to a pool, it occupies one slot in that pool. If the pool is full (i.e., all slots are occupied by running tasks), any new tasks assigned to that pool will wait until a slot becomes available. This ensures that you don't overload external systems or exhaust your resources.

Creating and Managing Pools

Pools can be created and managed through the Airflow UI or programmatically. The UI provides a user-friendly interface for this purpose.

Using the Airflow UI

  1. Navigate to the Admin menu in the Airflow UI.
  2. Click on Pools.
  3. To create a new pool, click the + button.
  4. Enter a Pool Name (e.g., my_external_service).
  5. Set the Slots. This is the maximum number of concurrent tasks allowed for this pool.
  6. Optionally, provide a Description for the pool.
  7. Click Save.

You can also edit or delete existing pools from this page.

Programmatic Creation

You can also create pools using the Airflow CLI:

airflow pools set my_external_service 5 "Pool for my external service"

This command creates a pool named my_external_service with 5 slots and a description.

Assigning Tasks to Pools

To assign a task to a pool, you need to specify the pool parameter when defining your operator:

from __future__ import annotations

            import pendulum

            from airflow.models.dag import DAG
            from airflow.operators.empty import EmptyOperator

            with DAG(
                dag_id="pool_example",
                schedule=None,
                start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
                catchup=False,
                tags=["example", "pools"],
            ) as dag:
                start = EmptyOperator(task_id="start")

                # This task will be limited by the 'my_external_service' pool
                task_in_pool = EmptyOperator(
                    task_id="task_in_pool",
                    pool="my_external_service",
                    # You can also specify a pool_slots value if you need more than 1 slot
                    # pool_slots=2,
                )

                end = EmptyOperator(task_id="end")

                start >> task_in_pool >> end
            

In this example, task_in_pool will only run if there is an available slot in the my_external_service pool. If the pool has a capacity of 5, up to 5 instances of tasks assigned to this pool can run concurrently.

The pool_slots Parameter

By default, a task occupies one slot in its assigned pool. However, you can override this by setting the pool_slots parameter within the operator. For instance, if you set pool_slots=2 for a task, that task will consume two slots from the pool. This is useful for tasks that are more resource-intensive or have a greater impact on external systems.

Important Note on Pool Slot Calculation

The total number of slots used by all running tasks for a given pool at any point in time must not exceed the pool's defined Slots value. Airflow manages this by queuing tasks that would otherwise exceed the pool's capacity.

Pools and Task Execution Order

Pools influence when a task can start, but they do not dictate the order in which tasks within a DAG are executed relative to each other. Airflow's scheduling logic still applies: tasks will run based on their dependencies and the scheduler's decision. Pools act as a gatekeeper for resource availability.

Use Cases for Pools

Summary

Pools are a powerful feature in Apache Airflow for managing concurrency and controlling resource utilization. By defining named pools with specific slot capacities, you can effectively safeguard your external systems and ensure stable operation of your data pipelines.