Apache Airflow

Orchestrate your workflows

Running Airflow on Kubernetes

This guide provides instructions and best practices for deploying and running Apache Airflow on a Kubernetes cluster.

Introduction

Kubernetes offers a powerful and scalable platform for running containerized applications. Airflow's architecture is well-suited for deployment on Kubernetes, allowing for dynamic scaling of workers and robust management of tasks.

Prerequisites

  • A running Kubernetes cluster.
  • kubectl configured to communicate with your cluster.
  • Helm installed (recommended for easier deployment).
  • Docker images for Airflow (we'll cover building or using pre-built ones).

Deployment Methods

There are several ways to deploy Airflow on Kubernetes:

1. Using the Official Helm Chart (Recommended)

The Apache Airflow project maintains an official Helm chart that simplifies the deployment process significantly. This is the most common and recommended method.

  1. Add the Airflow Helm repository:
    helm repo add apache-airflow https://airflow.apache.org/charts
    helm repo update
  2. Configure your Airflow deployment: Create a values.yaml file to customize your Airflow installation. You can find a comprehensive example in the Helm chart repository or use helm show values apache-airflow/airflow > values.yaml. Key configurations include:
    • executor: Set to KubernetesExecutor.
    • dags.persistence.enabled: Set to true for persistent DAGs.
    • fernetKey: Generate a Fernet key for encrypting connections.
    • redis.enabled: For the Celery message broker.
    • service.type: e.g., LoadBalancer or NodePort for accessing the UI.
  3. Install the Helm chart:
    helm install my-airflow apache-airflow/airflow -f values.yaml --namespace airflow --create-namespace

For detailed configuration options, refer to the official Helm chart documentation.

2. Manual Kubernetes Deployment

While more complex, you can manually create Kubernetes resources (Deployments, Services, PersistentVolumeClaims, etc.) using YAML files. This method offers maximum control but requires a deeper understanding of Kubernetes and Airflow's components.

You will need to define:

  • A Deployment for the Airflow Webserver and Scheduler.
  • A Deployment for the Airflow Workers (or configure the KubernetesExecutor to create Pods dynamically).
  • A Service for the Airflow Webserver UI.
  • A PersistentVolumeClaim for DAGs and logs.
  • A Secret for the Fernet key and any database credentials.
  • A ConfigMap for Airflow's configuration.

Refer to the Airflow source code for examples of Kubernetes manifests, but be aware that these might be less actively maintained than the Helm chart.

Key Airflow Components on Kubernetes

  • Webserver: The user interface for Airflow.
  • Scheduler: Monitors DAGs and triggers task instances.
  • Executor: How tasks are run. The KubernetesExecutor is crucial for dynamic task scaling.
  • Metadata Database: Stores Airflow's state (e.g., PostgreSQL, MySQL).
  • DAGs & Logs: Needs persistent storage, typically via PersistentVolumes.

KubernetesExecutor

The KubernetesExecutor is designed to launch each task instance as a separate Kubernetes Pod. This provides:

  • Scalability: Automatically scales worker pods based on task demand.
  • Isolation: Each task runs in its own environment.
  • Resource Management: Leverage Kubernetes resource requests and limits.

To use the KubernetesExecutor, ensure it's configured in your airflow.cfg or via environment variables (when using Helm).

When a task needs to be executed, the Airflow scheduler running in its own pod will:

  1. Construct a Kubernetes Pod definition based on the task's requirements and your Airflow configuration.
  2. Submit this Pod definition to the Kubernetes API.
  3. Kubernetes then schedules and runs this Pod.
  4. The Airflow worker image within the Pod executes the task.

Important Considerations

  • Image Management: Ensure your Airflow worker images contain all necessary dependencies for your tasks.
  • Secrets Management: Use Kubernetes Secrets for sensitive information like database passwords and API keys.
  • Resource Allocation: Define appropriate CPU and memory requests/limits for your Airflow components and task pods to ensure stability and efficiency.
  • Persistent Storage: Configure persistent volumes for DAGs, logs, and potentially the Airflow metadata database if it's not managed externally.
  • Network Policies: Implement Kubernetes Network Policies to secure communication between Airflow components and other services in your cluster.
  • Monitoring & Logging: Integrate Airflow's logging with your Kubernetes logging solution (e.g., Elasticsearch, Fluentd, Kibana - EFK stack) for centralized log aggregation.

Example DAGs for Kubernetes

When running with KubernetesExecutor, you can specify Kubernetes-specific parameters for your tasks using the executor_config argument. This allows you to define resource requests, node selectors, tolerations, and more for individual task pods.

from __future__ import annotations

import pendulum

from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator

@dag(
    dag_id="kubernetes_example",
    schedule="@daily",
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    catchup=False,
    tags=["kubernetes", "example"],
)
def kubernetes_example_dag():
    @task
    def run_on_kubernetes():
        return "Task executed on Kubernetes!"

    # Example of a BashOperator with executor_config
    task_with_k8s_config = BashOperator(
        task_id="bash_task_with_k8s_config",
        bash_command="echo 'Hello from a custom Kubernetes pod!'",
        executor_config={
            "pod_override": {
                "spec": {
                    "containers": [
                        {
                            "name": "base",  # Must match the container name in the base pod
                            "resources": {
                                "request_cpu": "100m",
                                "limit_cpu": "200m",
                                "request_memory": "128Mi",
                                "limit_memory": "256Mi",
                            },
                            "env": [
                                {"name": "MY_CUSTOM_ENV", "value": "my_value"}
                            ]
                        }
                    ]
                }
            }
        },
    )

    run_on_kubernetes()
    task_with_k8s_config()

kubernetes_example_dag()

Troubleshooting

Common issues include:

  • Pod Creation Failures: Check Kubernetes events and scheduler logs for reasons why pods might not be created (e.g., insufficient resources, incorrect image name, RBAC permissions).
  • Task Failures: Inspect the logs of the task Pod for error messages.
  • Connection Issues: Ensure network connectivity between Airflow components and the metadata database.
  • RBAC Permissions: The Airflow service account needs appropriate permissions to create and manage pods in Kubernetes.

Always refer to the official Airflow documentation and the Kubernetes documentation for the most up-to-date information and advanced configurations.