Hey everyone,
I'm looking to set up robust monitoring for our Airflow instances and decided to go with the popular combination of Prometheus and Grafana. I've done some initial research, but I'm facing a few hurdles in getting the Airflow metrics to be scraped by Prometheus effectively and visualized in Grafana.
Specifically, I'm interested in tracking:
- DAG run statuses (success, failure, running)
- Task instance statuses
- Scheduler health and queue lengths
- Worker resource utilization (if possible)
I've installed Prometheus and Grafana, and I'm familiar with their basic setup. The main challenge is configuring Airflow to expose metrics and ensuring Prometheus can find and scrape them. I've looked at the apache-airflow-providers-cncf-kubernetes and apache-airflow-providers-cncf-prometheus, but I'm not sure which is the best approach for my current setup (local Docker Compose, and soon to be Kubernetes).
Has anyone successfully integrated Airflow monitoring with Prometheus and Grafana? I'd love to hear about:
- Your recommended Airflow configuration (e.g.,
airflow.cfgsettings, enabling the metrics exporter). - Prometheus configuration (e.g.,
prometheus.yml, service discovery for Airflow). - Grafana dashboard examples or recommendations for Airflow.
- Any common pitfalls or best practices to be aware of.
Any guidance or shared experiences would be greatly appreciated!
Thanks in advance!