Installing Apache Airflow on a Google Cloud VM (No Docker)
Apache Airflow is a powerful workflow automation tool used for scheduling and monitoring workflows. Installing it on a Google Cloud VM without Docker ensures flexibility and better integration with cloud services. This guide walks you through the installation process step by step.
Watch on YouTube
Prerequisites
Before starting, ensure you have the following:
- A Google Cloud Platform (GCP) account.
- A Compute Engine VM instance (Ubuntu recommended).
- SSH access to the VM.
- Python 3.8 or later installed.
Step 1: Update and Install Dependencies
Start by updating the system and installing essential packages:
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv
Step 2: Create a Virtual Environment
Using a virtual environment helps isolate dependencies:
python3 -m venv airflow-venv
source airflow-venv/bin/activate
Step 3: Install Apache Airflow
Set the Airflow home directory and install Airflow:
export AIRFLOW_HOME=~/airflow
pip install apache-airflow==2.7.2 --constraint
"https://raw.githubusercontent.com/apache/
airflow/constraints-2.7.2/constraints-3.8.txt"
Step 4: Initialize the Airflow Database
Airflow requires a database to store metadata:
airflow db init
Step 5: Create an Admin User
Create an admin user to access the Airflow UI:
airflow users create \
--username admin \
--password admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com
Step 6: Start Airflow Services
Run the scheduler and webserver in separate terminals:
airflow scheduler
Open another terminal and run:
airflow webserver --port 8080
Step 7: Access Airflow UI
Once the webserver starts, access the UI via:
http://<VM_EXTERNAL_IP>:8080
To find the external IP, run:
gcloud compute instances list
Step 8: Set Up Airflow to Start on Boot (Optional)
To ensure Airflow starts automatically after reboot, create systemd service files for the scheduler and webserver.
Scheduler Service
sudo nano /etc/systemd/system/airflow-scheduler.service
sudo nano /etc/systemd/system/airflow-scheduler.service
Add the following:
[Unit]
Description=Apache Airflow Scheduler
After=network.target
[Service]
User=<your-user>
Group=<your-group>
Environment="AIRFLOW_HOME=/home/<your-user>/airflow"
ExecStart=/home/<your-user>/airflow-venv/bin/airflow scheduler
Restart=always
[Install]
WantedBy=multi-user.target
Webserver Service
sudo nano /etc/systemd/system/airflow-webserver.service
sudo nano /etc/systemd/system/airflow-webserver.service
Add the following:
[Unit]
Description=Apache Airflow Webserver
After=network.target
[Service]
User=<your-user>
Group=<your-group>
Environment="AIRFLOW_HOME=/home/<your-user>/airflow"
ExecStart=/home/<your-user>/airflow-venv/bin/airflow
webserver --port 8080
Restart=always
[Install]
WantedBy=multi-user.target
Enable the services:
sudo systemctl enable airflow-scheduler
sudo systemctl enable airflow-webserver
Start the services:
sudo systemctl start airflow-scheduler
sudo systemctl start airflow-webserver
Conclusion
You have now installed Apache Airflow on a Google Cloud VM without using Docker. This setup provides flexibility and better resource management, making it ideal for production use. You can now begin creating and scheduling workflows efficiently.
Watch on YouTube