Logging and Monitoring

Logging

Proper logging is crucial for understanding the behavior of your Airflow deployments, debugging issues, and auditing operations. Airflow generates logs at various levels: the scheduler, webserver, workers, and individual task instances all produce logs.

Airflow's logging system is designed to be flexible, allowing you to configure where and how logs are stored. This section covers the basics of how Airflow handles logging and how you can configure it to suit your needs.

Configuring Logging

Airflow's logging configuration is primarily managed through the airflow.cfg file. You can find the relevant settings under the [logging] section.

Log Formatter

You can define the format of your log messages. Airflow comes with a default formatter, but you can customize it.


[logging]
# The default log formatter
# Example: %(asctime)s %(levelname)s %(task_instance)s:%(task_id)s:%(dag_id)s:%(execution_date)s %(message)s
log_format = %(asctime)s [%(levelname)s] %(name)s: %(message)s
base_log_folder = /var/log/airflow/
logging_config_class = airflow.config_templates.logging_config.DEFAULT_LOGGING_CONFIG
                

Log Destination

Airflow supports various backends for storing logs:

  • Local File System: Logs are stored on the local disk of the Airflow components. This is the default.
  • Remote Storage: Logs can be sent to remote storage solutions like AWS S3, Google Cloud Storage, Azure Blob Storage, etc. This is highly recommended for production environments.

To configure remote logging, you'll need to set the appropriate configuration options in airflow.cfg. For example, to use S3:


[logging]
remote_logging = True
remote_log_conn_id = aws_default  # Or your specific S3 connection ID
remote_base_log_folder = s3://your-airflow-logs-bucket/
log_filename_template = {{ dag_id }}/{{ task_id }}/{{ execution_date }}/{{ try_number }}.log
                

Refer to the Airflow Configuration Reference for a complete list of logging options.

Log Viewing and Management

The Airflow UI provides a convenient way to view task instance logs. Navigate to the Grid View or Browse -> Task Instances, select a task, and click the "Log" button.

Tip

When using remote logging, the Airflow UI will fetch logs directly from your configured remote storage, providing a unified view regardless of where the logs are physically stored.

For more advanced log management, consider integrating Airflow with centralized logging systems like Elasticsearch, Splunk, or Datadog. These systems offer powerful search, analysis, and visualization capabilities.

Monitoring Airflow

Monitoring your Airflow environment is essential for ensuring its health, performance, and reliability. This involves keeping an eye on the Airflow components (scheduler, webserver, workers) and the tasks being executed.

Key Metrics

Here are some key metrics to monitor:

  • Scheduler Heartbeat: Ensure the scheduler is running and processing tasks. Look for its logs and process status.
  • Webserver Availability: Check if the Airflow UI is accessible.
  • Worker Health: Monitor worker processes to ensure they are alive and capable of executing tasks.
  • Task Execution Status: Track the number of successful, failed, and running tasks.
  • Task Duration: Identify tasks that are taking longer than expected.
  • Queue Sizes: If using Celery or other distributed executors, monitor the size of task queues.
  • Database Performance: Keep an eye on the Airflow metadata database, as it's critical for Airflow's operation.
  • Resource Utilization: Monitor CPU, memory, and network usage of Airflow components.

Airflow exposes metrics that can be scraped by monitoring systems like Prometheus. You can enable metrics reporting in your airflow.cfg:


[metrics]
enable_metrics = True
prometheus_multiproc_dir = /path/to/prometheus/metrics/dir
                

This will expose an endpoint (usually on the webserver) that Prometheus can scrape.

Alerting

Setting up alerts for critical events is a crucial part of a robust monitoring strategy. Airflow provides several ways to implement alerting:

  • Task Failure Callbacks: Airflow tasks can be configured with on_failure_callback, allowing you to trigger custom actions (like sending an email or calling a webhook) when a task fails.
  • Alerting DAGs: You can create dedicated DAGs that periodically check the status of other DAGs or tasks and trigger alerts.
  • External Monitoring Systems: Integrate with tools like PagerDuty, Slack, or VictorOps to receive notifications based on the metrics collected by your monitoring system.

Important

Ensure your alerting mechanisms are reliable and that you have clear procedures for responding to alerts. Avoid alert fatigue by tuning your alerts effectively.

Effective logging and monitoring are key to maintaining a healthy and performant Apache Airflow deployment. By carefully configuring these aspects and utilizing appropriate tools, you can gain deep insights into your workflow execution and proactively address any potential issues.