Airflow Database Configuration
This document details how to configure Airflow's metadata database. The metadata database is crucial for Airflow's operation, as it stores information about DAGs, tasks, task instances, connections, and more.
Choosing a Database Backend
Airflow supports several database backends. The default and most commonly used is PostgreSQL. Other supported backends include MySQL and SQLite. For production environments, PostgreSQL or MySQL are highly recommended due to their robustness and performance.
PostgreSQL
PostgreSQL is a powerful, open-source object-relational database system. It is the recommended database for most Airflow deployments.
To use PostgreSQL, ensure you have a PostgreSQL server running and accessible. You'll need the connection details:
- Hostname or IP address
- Port
- Database name
- Username
- Password
MySQL
MySQL is another popular open-source relational database management system. It can also be used as an Airflow backend.
Similar to PostgreSQL, you'll need the connection details for your MySQL server.
SQLite
SQLite is a self-contained, serverless, zero-configuration, transactional SQL database engine. It's suitable for testing and development purposes but is **not recommended for production environments** due to performance and concurrency limitations.
Configuring the Database Connection
Airflow's configuration is primarily managed through the airflow.cfg file or environment variables. The connection details for the metadata database are specified under the [core] section.
Using airflow.cfg
Locate your airflow.cfg file (often found in $AIRFLOW_HOME). Edit the [core] section as follows:
[core]
sql_alchemy_conn = postgresql+psycopg2://user:password@host:port/database
# For MySQL:
# sql_alchemy_conn = mysql://user:password@host:port/database
# For SQLite:
# sql_alchemy_conn = sqlite:////path/to/airflow.db
Replace the placeholder values with your actual database credentials and connection information.
Using Environment Variables
Alternatively, you can set the connection string using an environment variable:
export AIRFLOW__CORE__SQL_ALCHEMY_CONN="postgresql+psycopg2://user:password@host:port/database"
This method is often preferred in containerized or cloud environments for easier management.
Database Driver
The connection string specifies the database driver. Common drivers include:
- PostgreSQL:
postgresql+psycopg2(requires thepsycopg2-binaryPython package) - MySQL:
mysql+mysqlconnector(requires themysql-connector-pythonPython package) - SQLite:
sqlite
Ensure you have the necessary database driver installed in your Airflow environment:
pip install apache-airflow[postgres] # For PostgreSQL
pip install apache-airflow[mysql] # For MySQL
If you installed Airflow without a specific backend, you may need to install the driver separately:
pip install psycopg2-binary # For PostgreSQL
pip install mysql-connector-python # For MySQL
Initializing the Database
After configuring the connection string, you need to initialize the Airflow metadata database. This creates the necessary tables and schema.
Run the following command in your Airflow environment:
airflow db upgrade
This command will connect to the database specified in your configuration and apply all pending database migrations.
airflow db upgrade command should be run whenever you upgrade Airflow to a new version to ensure your database schema is up-to-date.
Connection Pooling
For better performance, especially in high-load environments, you can configure SQLAlchemy's connection pooling. This is done by adding extra parameters to the sql_alchemy_conn string or by setting specific environment variables.
Example with pooling parameters in airflow.cfg:
[core]
sql_alchemy_conn = postgresql+psycopg2://user:password@host:port/database?options=--application_name%3Dairflow&pool_size=5&max_overflow=10&pool_timeout=30&pool_recycle=1800
Key parameters:
pool_size: The number of connections to keep open in the pool.max_overflow: The maximum number of connections that can be opened beyondpool_size.pool_timeout: The number of seconds to wait for a connection before giving up.pool_recycle: The number of seconds after which a connection should be recycled.
Common Issues and Troubleshooting
Connection refused: Ensure your database server is running and accessible from where Airflow is running. Check firewall rules and network connectivity.
Authentication failed: Verify your username, password, and database name are correct.
Missing driver: Make sure the necessary Python database driver is installed (e.g., psycopg2-binary, mysql-connector-python).
Database schema outdated: Run airflow db upgrade to apply pending migrations.
Example: PostgreSQL Setup
Assuming you have a PostgreSQL server running with a database named airflow_db, a user airflow_user with password airflow_password, on localhost:5432, your configuration would be:
[core]
sql_alchemy_conn = postgresql+psycopg2://airflow_user:airflow_password@localhost:5432/airflow_db
Then, initialize the database:
airflow db upgrade
By correctly configuring and initializing your metadata database, you ensure Airflow can reliably track and manage your workflows.