Integrating Airflow Monitoring with Prometheus & Grafana

TechGuru_Alex Oct 26, 2023, 10:15 AM

Hey everyone,

I'm looking to set up robust monitoring for our Airflow instances and decided to go with the popular combination of Prometheus and Grafana. I've done some initial research, but I'm facing a few hurdles in getting the Airflow metrics to be scraped by Prometheus effectively and visualized in Grafana.

Specifically, I'm interested in tracking:

DAG run statuses (success, failure, running)
Task instance statuses
Scheduler health and queue lengths
Worker resource utilization (if possible)

I've installed Prometheus and Grafana, and I'm familiar with their basic setup. The main challenge is configuring Airflow to expose metrics and ensuring Prometheus can find and scrape them. I've looked at the apache-airflow-providers-cncf-kubernetes and apache-airflow-providers-cncf-prometheus, but I'm not sure which is the best approach for my current setup (local Docker Compose, and soon to be Kubernetes).

Has anyone successfully integrated Airflow monitoring with Prometheus and Grafana? I'd love to hear about:

Your recommended Airflow configuration (e.g., airflow.cfg settings, enabling the metrics exporter).
Prometheus configuration (e.g., prometheus.yml, service discovery for Airflow).
Grafana dashboard examples or recommendations for Airflow.
Any common pitfalls or best practices to be aware of.

Any guidance or shared experiences would be greatly appreciated!

Thanks in advance!

DataOps_Ninja Oct 26, 2023, 11:30 AM

Hi Alex,

Great initiative! Integrating Airflow with Prometheus and Grafana is a very common and powerful setup. For exposing metrics, Airflow provides a built-in metrics API. You usually don't need a separate provider for basic metrics if you're not deeply embedded in specific orchestrations like Kubernetes yet.

Here's a common approach:

1. Airflow Configuration

In your airflow.cfg or via environment variables, ensure you have the metrics endpoint enabled. You'll likely want to set:

[metrics]
enable_metrics = True
port = 9090  # Or any other available port
bind_host = 0.0.0.0
prefix = airflow

Restart your Airflow scheduler and webserver for these changes to take effect.

2. Prometheus Configuration

Your prometheus.yml will need a scrape configuration. If you're using Docker Compose, you can define a service for Airflow and add it to your Prometheus scrape config. For example:

scrape_configs:
  - job_name: 'airflow'
    static_configs:
      - targets: ['airflow-webserver:9090', 'airflow-scheduler:9090'] # Adjust hostnames as per your docker-compose

If you move to Kubernetes, you'd use Kubernetes service discovery or annotations.

3. Grafana Dashboards

You can import pre-built dashboards from Grafana's dashboard repository. Search for "Airflow" and you'll find several excellent community-contributed dashboards that cover DAG runs, task statuses, and scheduler metrics. My personal favorite is often one that visualizes queue depths and worker concurrency.

Let me know if you need specific `airflow.cfg` examples or Prometheus config snippets!

TechGuru_Alex Oct 26, 2023, 12:05 PM

Thanks, DataOps_Ninja! That's super helpful. I've made the changes to airflow.cfg and restarted the services. I'll update my prometheus.yml with the scrape config.

One quick question: Are there specific metrics I should be looking for to monitor scheduler health? I'm concerned about it becoming a bottleneck.

DataOps_Ninja Oct 26, 2023, 12:45 PM

Good question! For scheduler health, keep an eye on:

airflow_scheduler_num_dags_total: The total number of DAGs the scheduler has loaded.
airflow_scheduler_jobs_duration_seconds_count: The count of scheduler job durations. High counts or increasing durations might indicate issues.
airflow_scheduler_tasks_duration_seconds_count: Similar to above, but for tasks being scheduled.
airflow_dag_processing_duration_seconds: How long it takes to process a DAG.
airflow_dag_run_schedule_delay_seconds: The delay between the scheduled time of a DAG run and when it actually started. High values mean your scheduler is falling behind.

You can find these and more by hitting the metrics endpoint directly (e.g., http://localhost:9090/metrics if running locally). These are excellent metrics to plot in Grafana to spot performance degradation.