Installation Guide
This guide provides detailed instructions on how to install Apache Airflow. We recommend reviewing the Prerequisites section before proceeding.
Prerequisites
Before installing Airflow, ensure you have the following:
- Python: Airflow requires Python 3.8 or higher. You can check your Python version by running:
python --version
- Pip: Ensure you have pip installed and up-to-date:
pip install --upgrade pip
- Virtual Environment (Recommended): It is highly recommended to install Airflow in a Python virtual environment to avoid conflicts with system-wide packages.
python -m venv airflow-venv source airflow-venv/bin/activate # On Windows use `airflow-venv\Scripts\activate`
Installing Airflow
The simplest way to install Airflow is using pip. For a basic local setup, you can install the core package:
pip install apache-airflow
This command installs the core Airflow package and its essential dependencies. For production or more advanced use cases, you might need to install additional providers depending on your requirements (e.g., for interacting with cloud services, databases, etc.).
Installing with Providers
To install Airflow with specific providers (e.g., for AWS, GCP, Snowflake), you can specify them in the installation command:
pip install apache-airflow[aws,gcp,snowflake]
Refer to the Providers documentation for a full list of available integrations.
Initial Configuration
Once Airflow is installed, you need to initialize its database and create an admin user.
-
Initialize the database: This command creates the necessary tables in your metadata database. By default, Airflow uses a SQLite database, which is suitable for local development and testing.
airflow db init
-
Create an admin user: You'll need this user to log in to the Airflow UI.
You will be prompted to set a password for this user.airflow users create \ --username admin \ --firstname YourFirstName \ --lastname YourLastName \ --role Admin \ --email admin@example.com
Starting the Webserver and Scheduler
To use Airflow, you typically need to run two components:
-
Webserver: This provides the Airflow User Interface (UI) where you can monitor and manage your workflows.
airflow webserver --port 8080
-
Scheduler: This component is responsible for scheduling and running your tasks. It should be run in a separate terminal session.
airflow scheduler
Open your web browser and navigate to http://localhost:8080
to access the Airflow UI. You can log in with the admin credentials you created.
Note: For production environments, you will likely want to configure a more robust backend database (like PostgreSQL or MySQL) and consider running the scheduler and webserver as system services.
Production Considerations
For production deployments, several aspects need careful consideration:
- Database Backend: Switch from SQLite to a production-grade database like PostgreSQL or MySQL.
- Executor: Choose an appropriate executor for your needs (e.g., CeleryExecutor, KubernetesExecutor). The default is the `SequentialExecutor`, which is not suitable for production.
- Security: Configure authentication and authorization, SSL/TLS for the webserver, and secure secrets management.
- High Availability: Set up multiple webserver and scheduler instances for redundancy.
- Monitoring and Logging: Implement robust monitoring and centralized logging solutions.
Refer to the Configuration documentation for details on customizing Airflow's behavior and the Executors documentation for choosing the right executor.