Airflow Database Configuration

This document details how to configure Airflow's metadata database. The metadata database is crucial for Airflow's operation, as it stores information about DAGs, tasks, task instances, connections, and more.

Choosing a Database Backend

Airflow supports several database backends. The default and most commonly used is PostgreSQL. Other supported backends include MySQL and SQLite. For production environments, PostgreSQL or MySQL are highly recommended due to their robustness and performance.

PostgreSQL

PostgreSQL is a powerful, open-source object-relational database system. It is the recommended database for most Airflow deployments.

To use PostgreSQL, ensure you have a PostgreSQL server running and accessible. You'll need the connection details:

MySQL

MySQL is another popular open-source relational database management system. It can also be used as an Airflow backend.

Similar to PostgreSQL, you'll need the connection details for your MySQL server.

SQLite

SQLite is a self-contained, serverless, zero-configuration, transactional SQL database engine. It's suitable for testing and development purposes but is **not recommended for production environments** due to performance and concurrency limitations.

Warning: SQLite should only be used for local development or testing. Its performance and reliability are not suitable for production workloads.

Configuring the Database Connection

Airflow's configuration is primarily managed through the airflow.cfg file or environment variables. The connection details for the metadata database are specified under the [core] section.

Using airflow.cfg

Locate your airflow.cfg file (often found in $AIRFLOW_HOME). Edit the [core] section as follows:

[core]
sql_alchemy_conn = postgresql+psycopg2://user:password@host:port/database
# For MySQL:
# sql_alchemy_conn = mysql://user:password@host:port/database
# For SQLite:
# sql_alchemy_conn = sqlite:////path/to/airflow.db

Replace the placeholder values with your actual database credentials and connection information.

Using Environment Variables

Alternatively, you can set the connection string using an environment variable:

export AIRFLOW__CORE__SQL_ALCHEMY_CONN="postgresql+psycopg2://user:password@host:port/database"

This method is often preferred in containerized or cloud environments for easier management.

Database Driver

The connection string specifies the database driver. Common drivers include:

Ensure you have the necessary database driver installed in your Airflow environment:

pip install apache-airflow[postgres] # For PostgreSQL
pip install apache-airflow[mysql] # For MySQL

If you installed Airflow without a specific backend, you may need to install the driver separately:

pip install psycopg2-binary # For PostgreSQL
pip install mysql-connector-python # For MySQL

Initializing the Database

After configuring the connection string, you need to initialize the Airflow metadata database. This creates the necessary tables and schema.

Run the following command in your Airflow environment:

airflow db upgrade

This command will connect to the database specified in your configuration and apply all pending database migrations.

Note: The airflow db upgrade command should be run whenever you upgrade Airflow to a new version to ensure your database schema is up-to-date.

Connection Pooling

For better performance, especially in high-load environments, you can configure SQLAlchemy's connection pooling. This is done by adding extra parameters to the sql_alchemy_conn string or by setting specific environment variables.

Example with pooling parameters in airflow.cfg:

[core]
sql_alchemy_conn = postgresql+psycopg2://user:password@host:port/database?options=--application_name%3Dairflow&pool_size=5&max_overflow=10&pool_timeout=30&pool_recycle=1800

Key parameters:

Common Issues and Troubleshooting

Connection refused: Ensure your database server is running and accessible from where Airflow is running. Check firewall rules and network connectivity.

Authentication failed: Verify your username, password, and database name are correct.

Missing driver: Make sure the necessary Python database driver is installed (e.g., psycopg2-binary, mysql-connector-python).

Database schema outdated: Run airflow db upgrade to apply pending migrations.

Example: PostgreSQL Setup

Assuming you have a PostgreSQL server running with a database named airflow_db, a user airflow_user with password airflow_password, on localhost:5432, your configuration would be:

[core]
sql_alchemy_conn = postgresql+psycopg2://airflow_user:airflow_password@localhost:5432/airflow_db

Then, initialize the database:

airflow db upgrade

By correctly configuring and initializing your metadata database, you ensure Airflow can reliably track and manage your workflows.