Executor Guide

Introduction to Airflow Executors

The executor is a core component of Apache Airflow responsible for determining how and where your tasks are run. When a task needs to be executed, Airflow delegates this responsibility to the configured executor. The choice of executor significantly impacts Airflow's scalability, performance, and deployment complexity.

This guide provides a comprehensive overview of Airflow's executors, helping you understand their functionalities, choose the appropriate one for your needs, and configure them effectively.

Understanding Executor Types

Airflow offers a variety of built-in executors, each suited for different use cases and infrastructure setups.

SequentialExecutor

The SequentialExecutor runs tasks one after another in a single process. This is the default executor and is primarily useful for local development, testing, or very simple deployments where parallelism is not a concern.

# Not recommended for production
executor = SequentialExecutor

LocalExecutor

The LocalExecutor runs tasks in parallel on the same machine where the Airflow scheduler is running. It utilizes a pool of worker processes to achieve parallelism. This is a good option for development and small-scale production environments.

executor = LocalExecutor
parallelism = 32
dag_concurrency = 16

CeleryExecutor

The CeleryExecutor distributes task execution across multiple worker nodes. It leverages the Celery distributed task queue. This executor is highly scalable and suitable for production environments with moderate to high workloads.

To use the CeleryExecutor, you'll need a running Redis or RabbitMQ broker. Your Airflow workers will connect to this broker to pick up tasks.

executor = CeleryExecutor
celery_broker_url = redis://localhost:6379/1
celery_result_backend = redis://localhost:6379/1

KubernetesExecutor

The KubernetesExecutor is designed for cloud-native environments. It launches each task in its own Kubernetes pod. This provides excellent isolation, scalability, and resource management, making it ideal for dynamic and containerized workloads.

Each task gets its own container environment, ensuring clean execution and eliminating dependency conflicts between tasks.

executor = KubernetesExecutor
kubernetes_conn_id = kubernetes_default
worker_container_repository = apache/airflow
worker_container_tag = latest

DaskExecutor

The DaskExecutor allows you to distribute task execution across a Dask cluster. Dask is a flexible library for parallel computing in Python, making it a powerful choice for data-intensive workloads.

executor = DaskExecutor
dask_scheduler_address = tcp://dask-scheduler:8786

Other Executors

Airflow also supports other executors like the CeleryKubernetesExecutor, which combines Celery for task queuing with Kubernetes for task execution. Furthermore, the pluggable executor architecture allows for the development of custom executors.

Choosing the Right Executor

Selecting the appropriate executor depends on several factors:

Scalability Needs: For high scalability, CeleryExecutor or KubernetesExecutor are recommended.
Infrastructure: If you are running on Kubernetes, the KubernetesExecutor is a natural fit. For on-premises or simpler cloud setups, CeleryExecutor with a message broker is common.
Resource Isolation: The KubernetesExecutor provides the strongest resource isolation.
Simplicity: For development and testing, LocalExecutor is often sufficient.

Note: SequentialExecutor should almost never be used in production environments.

Executor Configuration

Executors are configured in Airflow's main configuration file, airflow.cfg, or via environment variables.

Configuration in `airflow.cfg`

The primary configuration for the executor is set under the [core] section:

[core]
executor = CeleryExecutor
; Other executor-specific configurations go here or in dedicated sections
; e.g.,
; [celery]
; celery_broker_url = redis://localhost:6379/1

Configuration via Environment Variables

You can override airflow.cfg settings using environment variables. The format is typically AIRFLOW__SECTION__KEY.

export AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
export AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY=my-custom-airflow-image

Tip: Using environment variables is highly recommended for containerized deployments (e.g., Docker, Kubernetes) as it allows for easier management of configurations without modifying files directly.

Advanced Topics

Explore more advanced aspects of Airflow executors to optimize your workflow.

Executor Scaling

Scaling your Airflow deployment often involves scaling the number of workers associated with your chosen executor. For CeleryExecutor, you would scale the number of Celery worker processes. For KubernetesExecutor, you would typically rely on Kubernetes' auto-scaling capabilities or adjust the number of worker pods.

Monitoring Executors

Effective monitoring is crucial for maintaining a healthy Airflow instance. Key metrics to watch include:

Task queuing and execution times.
Worker resource utilization (CPU, memory).
Broker health (for Celery).
Kubernetes pod status (for KubernetesExecutor).

Airflow's UI provides basic task status, and integrating with external monitoring tools like Prometheus, Grafana, or Datadog is highly recommended.

Custom Executors

If the built-in executors do not meet your specific requirements, Airflow's architecture allows you to develop and integrate custom executors. This involves implementing the BaseExecutor class and registering your executor with Airflow.

Important: Developing custom executors requires a deep understanding of Airflow's internal workings and your target execution environment.

Executor Guide

Introduction to Airflow Executors

Understanding Executor Types

SequentialExecutor

LocalExecutor

CeleryExecutor

KubernetesExecutor

DaskExecutor

Other Executors

Choosing the Right Executor

Executor Configuration

Configuration in airflow.cfg

Configuration via Environment Variables

Advanced Topics

Executor Scaling

Monitoring Executors

Custom Executors

Configuration in `airflow.cfg`