A production-ready Apache Airflow deployment system for personal data engineering projects, optimized for Raspberry Pi and self-hosted environments with automated CI/CD pipeline.
This project provides a complete self-hosted Airflow infrastructure that can be deployed on Raspberry Pi or any Linux server. It serves as the backbone for multiple data engineering projects, providing reliable data pipelines, scheduling, and orchestration capabilities.
- Cost-Effective: Run on Raspberry Pi (~$100) instead of cloud services ($100+/month)
- Full Control: Complete control over your data pipeline infrastructure
- Privacy: Keep your data and workflows on your own hardware
- Learning: Perfect for personal projects and learning data engineering
- Scalable: Start small, scale as needed with additional workers
- Always On: 24/7 data pipeline execution without cloud bills
- Production-Ready: Battle-tested deployment with error handling and monitoring
- Automated CI/CD: GitHub Actions pipeline for testing and deployment
- Docker-Based: Containerized deployment with Docker Compose
- ARM-Optimized: Performance tuned for Raspberry Pi 5
- Secure Remote Access: Cloudflare Tunnel integration for secure SSH
- Multi-Project Support: Serves multiple downstream data projects
- Snowflake Integration: Native support for Snowflake data warehouse
- Extensible: Easy to add new DAGs and integrations
This Airflow infrastructure serves various personal data engineering projects:
-
Trading Agent Data Pipeline (Primary)
- Fetches Bitcoin hourly OHLCV data from CryptoCompare API
- Historical backfill from 2010 with intelligent batching
- Calculates 110+ technical indicators using TA-Lib
- Stores data in Snowflake for ML model training
- Provides fresh data for algorithmic trading strategies
-
Snowflake Documentation Scraper (Knowledge Base)
- Scrapes Snowflake documentation for SnowPro Core exam prep
- Delta-mode scraping (only new/updated pages)
- Parallel web scraping with rate limiting
- Vector embeddings with OpenAI for semantic search
- Pinecone vector database for RAG applications
-
AWS SAA Documentation Scraper (Knowledge Base)
- Scrapes AWS SAA documentation for SnowPro Core exam prep
- Delta-mode scraping (only new/updated pages)
- Parallel web scraping with rate limiting
- Vector embeddings with OpenAI for semantic search
- Pinecone vector database for RAG applications
-
Future Projects (Examples)
- ETL pipelines for personal analytics
- Social media data aggregation
- IoT sensor data processing
- Automated reporting and dashboards
The repository includes production-ready example DAGs:
bitcoin_ohlcv_dataset.py: Bitcoin price data collection with historical backfilltechnical_indicators_dag.py: Technical analysis calculationssnowflake_docs_db_dag.py: Documentation scraper with vector embeddingsaws_saa_docs_scraper.py: AWS documentation scraper for SAA certification with vector embeddings
These DAGs demonstrate best practices and can be used as templates for your own projects.
airflow-self-hosted/
βββ dags/ # Airflow DAGs directory
β βββ bitcoin_ohlcv_dataset.py # Bitcoin price data with backfill
β βββ technical_indicators_dag.py # Technical analysis calculations
β βββ snowflake_docs_db_dag.py # Documentation scraper + RAG
β βββ aws_saa_docs_scraper.py # AWS SAA documentation scraper + RAG
β βββ your_custom_dag.py # Add your own DAGs here
βββ config/
β βββ airflow.cfg # Airflow configuration
βββ docker/
β βββ data/ # Docker persistent volumes
β βββ airflow/ # Airflow logs and metadata
β βββ postgres/ # PostgreSQL database
βββ plugins/ # Airflow custom plugins
βββ .github/workflows/
β βββ deploy_airflow.yml # CI/CD pipeline
βββ docker-compose.yml # Docker Compose orchestration
βββ Dockerfile # Custom Airflow image
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Docker 20.10+ and Docker Compose 2.0+
- Git
- GitHub account with repository access
- 8GB+ RAM recommended for local testing
Hardware:
- Raspberry Pi >=4 (8GB+ RAM recommended) or any Linux server
- 32GB+ SD card or SSD (SSD highly recommended for performance)
- Stable internet connection
- Power supply and cooling (fan or heatsink)
Software:
- Ubuntu 20.04+ or Raspberry Pi OS (64-bit)
- Docker and Docker Compose installed
- SSH access enabled
- Snowflake Account: For data warehousing (free trial available)
- Cloudflare Tunnel: For secure remote access (free)
- Telegram Bot: For notifications (free)
- GitHub: For CI/CD automation (free)
- OpenAI API: For embeddings generation (pay-as-you-go)
- Pinecone: For vector database storage (free tier available)
git clone https://github.com/yourusername/airflow-self-hosted.git
cd airflow-self-hostedCreate a .env file in the project root:
# Airflow Configuration
AIRFLOW_UID=1000
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=your_secure_password_here
# Snowflake Configuration (if using Snowflake)
SNOWFLAKE_ACCOUNT=your_account.region
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=COMPUTE_WH
SNOWFLAKE_DATABASE=your_database
SNOWFLAKE_SCHEMA=your_schema
SNOWFLAKE_ROLE=your_role
# Telegram Notifications
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
# OpenAI & Pinecone (for documentation scraper)
OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_SF_INDEX_NAME=your_pinecone_sf_index_name
PINECONE_AWS_INDEX_NAME=your_pinecone_aws_index_name
# Build and start containers
docker compose up -d --build
# Wait for initialization (2-3 minutes)
docker compose logs -f airflow-init
# Access Airflow UI
# http://localhost:8080
# Username: admin
# Password: (from .env file)
# Check status
docker compose ps
# View logs
docker compose logs -f airflow-scheduler
# Stop containers
docker compose down -vCreate a new DAG in the dags/ directory:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
def my_task():
print("Hello from my custom DAG!")
# Your data engineering logic here
default_args = {
'owner': 'dataops',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'my_custom_dag',
default_args=default_args,
description='My custom data pipeline',
schedule='@daily', # or '0 * * * *' for hourly
catchup=False,
tags=['custom', 'my-project'],
) as dag:
task = PythonOperator(
task_id='my_task',
python_callable=my_task
)The DAG will automatically appear in the Airflow UI within seconds!
Set up GitHub Actions for automated deployment to your Raspberry Pi.
Go to your repository β Settings β Secrets and variables β Actions, and add:
Required Secrets:
AIRFLOW_UID # 1000
_AIRFLOW_WWW_USER_USERNAME # admin
_AIRFLOW_WWW_USER_PASSWORD # Strong password
RASPBERRY_PI_USER # SSH username
RASPBERRY_PI_PASSWORD # SSH password
CF_ACCESS_CLIENT_ID # Cloudflare Access Client ID
CF_ACCESS_CLIENT_SECRET # Cloudflare Access Client Secret
AIRFLOW_PATH # /home/user/airflow-self-hosted
TOKEN # GitHub PAT with repo access
Optional Secrets (for example DAGs):
SNOWFLAKE_ACCOUNT # If using Snowflake
SNOWFLAKE_USER
SNOWFLAKE_PASSWORD
SNOWFLAKE_WAREHOUSE
SNOWFLAKE_DATABASE
SNOWFLAKE_SCHEMA
SNOWFLAKE_ROLE
TELEGRAM_BOT_TOKEN # For notifications
TELEGRAM_CHAT_ID
The CI/CD pipeline automatically:
-
Test Job (on every push to main):
- Builds Docker image
- Validates DAG syntax
- Runs health checks
- Cleans up resources
-
Deploy Job (after successful test):
- Connects via Cloudflare Tunnel
- Transfers updated files
- Updates environment variables
- Restarts Docker containers
- Validates deployment
Simply push to main branch, and your Raspberry Pi will automatically update!
On your Raspberry Pi:
# Initial setup
cd ~
git clone https://github.com/yourusername/airflow-self-hosted.git
cd airflow-self-hosted
# Create .env file with your credentials
nano .env
# Start Airflow
docker compose up -d --build
# Check status
docker compose ps
docker compose logs -f airflow-webserver
# Access Airflow UI
# http://raspberry-pi-ip:8080Updates:
cd ~/airflow-self-hosted
git pull origin main
docker compose down
docker compose up -d --buildβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Airflow Self-Hosted β
β (Raspberry Pi / Linux) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ
β β Airflow β β Airflow β β PostgreSQL ββ
β β Webserver β β Scheduler β β Database ββ
β β (Port 8080) β β (DAG Runner) β β (Metadata) ββ
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β DAGs Directory (Your Pipelines) ββ
β β β’ bitcoin_ohlcv_dataset.py ββ
β β β’ technical_indicators_dag.py ββ
β β β’ your_custom_dag.py ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
β Data Flows
βΌ
βββββββββββββββββββββββββββββββββββββ
β External Services β
β β’ Snowflake (Data Warehouse) β
β β’ APIs (CryptoCompare, etc.) β
β β’ Databases (MySQL, Postgres) β
β β’ Cloud Storage (S3, GCS) β
βββββββββββββββββββββββββββββββββββββ
# docker-compose.yml settings
AIRFLOW__CORE__EXECUTOR: LocalExecutor # Lightweight
AIRFLOW__CORE__PARALLELISM: 8 # Max parallel tasks
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 4 # Tasks per DAG
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 1 # One run at a time
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE: 3 # DB connections
AIRFLOW__API__WORKERS: 2 # API workersThese settings ensure Airflow runs smoothly on Raspberry Pi 4 with 4GB RAM.
# dags/my_project_dag.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook
import requests
def fetch_data(**context):
"""Fetch data from your API"""
response = requests.get('https://api.example.com/data')
data = response.json()
context['ti'].xcom_push(key='raw_data', value=data)
print(f"β
Fetched {len(data)} records")
def transform_data(**context):
"""Transform your data"""
raw_data = context['ti'].xcom_pull(key='raw_data')
# Your transformation logic
transformed = [process(item) for item in raw_data]
context['ti'].xcom_push(key='transformed_data', value=transformed)
print(f"β
Transformed {len(transformed)} records")
def load_to_database(**context):
"""Load data to Snowflake or other DB"""
data = context['ti'].xcom_pull(key='transformed_data')
hook = SnowflakeHook(snowflake_conn_id='snowflake_default')
# Your load logic
print(f"β
Loaded {len(data)} records to database")
default_args = {
'owner': 'dataops',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'email_on_failure': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'my_project_etl',
default_args=default_args,
description='ETL pipeline for my project',
schedule='0 */6 * * *', # Every 6 hours
catchup=False,
tags=['my-project', 'etl'],
) as dag:
fetch = PythonOperator(
task_id='fetch_data',
python_callable=fetch_data
)
transform = PythonOperator(
task_id='transform_data',
python_callable=transform_data
)
load = PythonOperator(
task_id='load_to_database',
python_callable=load_to_database
)
fetch >> transform >> load# Add to requirements.txt
your-library==1.0.0
another-package>=2.0.0# Commit and push
git add dags/my_project_dag.py requirements.txt
git commit -m "Add my project DAG"
git push origin main
# CI/CD will automatically deploy to your Raspberry Pi!- Access Airflow UI:
http://your-raspberry-pi:8080 - Enable your DAG
- Trigger a test run
- Monitor logs and task status
- β Use environment variables for all credentials
- β Store secrets in GitHub Secrets for CI/CD
- β
Never commit
.envfile to repository - β Rotate passwords every 90 days
- β Use strong passwords (16+ characters, mixed case, numbers, symbols)
- β Use Cloudflare Tunnel for SSH (no exposed ports)
- β Configure UFW firewall on Raspberry Pi
- β Use HTTPS for Airflow web UI (via reverse proxy)
- β Restrict database access by IP when possible
- β Keep Raspberry Pi OS and Docker updated
- β Change default admin password immediately
- β Enable RBAC (Role-Based Access Control)
- β Use Fernet key encryption for connections
- β Regular security updates via Docker image rebuilds
- β Review DAG code before deployment
# On Raspberry Pi
curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb -o cloudflared.deb
sudo dpkg -i cloudflared.deb
# Authenticate
cloudflared tunnel login
# Create tunnel
cloudflared tunnel create airflow-pi
# Configure tunnel
nano ~/.cloudflared/config.ymlAirflow UI Dashboard:
- Check DAG run status
- Review task duration trends
- Monitor failure rates
- Check scheduler heartbeat
System Health:
# Check Docker containers
docker compose ps
# Check resource usage
docker stats
# Check disk space
df -h
# Check memory
free -h# Review logs
docker compose logs --tail=100 airflow-scheduler
# Check for updates
docker compose pull
# Backup Airflow metadata
docker compose exec postgres pg_dump -U airflow > backup.sql
# Clean old logs (optional)
find docker/data/airflow/logs -mtime +30 -delete- Review and optimize DAG performance
- Update Python dependencies
- Security audit and password rotation
- Review resource usage trends
- Plan capacity upgrades if needed
Key Indicators:
- DAG run duration: < 10 minutes (typical)
- Task success rate: > 99%
- Scheduler lag: < 1 second
- Memory usage: < 80% of available
- CPU usage: < 70% average
1. DAG Not Appearing in UI
# Check for parsing errors
docker compose exec airflow-webserver airflow dags list-import-errors
# Validate DAG syntax
docker compose exec airflow-webserver python /opt/airflow/dags/your_dag.py
# Restart scheduler
docker compose restart airflow-scheduler2. Out of Memory on Raspberry Pi
# Check memory usage
free -h
# Reduce parallelism in docker-compose.yml
AIRFLOW__CORE__PARALLELISM: 4
AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 2
# Restart services
docker compose down
docker compose up -d3. Connection to External Service Failed
# Test connection
docker compose exec airflow-webserver airflow connections test your_conn_id
# Check environment variables
docker compose exec airflow-webserver env | grep YOUR_VAR
# Update connection in Airflow UI
# Admin β Connections β Edit4. Disk Space Full
# Check disk usage
df -h
# Clean Docker system
docker system prune -a
# Clean old logs
find docker/data/airflow/logs -mtime +30 -delete
# Clean old DAG runs (in Airflow UI)
# Browse β DAG Runs β Delete old runs5. Scheduler Not Running
# Check scheduler status
docker compose ps airflow-scheduler
# View scheduler logs
docker compose logs airflow-scheduler
# Restart scheduler
docker compose restart airflow-scheduler# Airflow logs
docker/data/airflow/logs/
# Scheduler logs
docker compose logs airflow-scheduler
# Webserver logs
docker compose logs airflow-webserver
# PostgreSQL logs
docker compose logs postgres
# Specific DAG run logs
docker/data/airflow/logs/dag_id=your_dag/run_id=*/task_id=*/- Apache Airflow Documentation
- Docker Compose Documentation
- Raspberry Pi Documentation
- Cloudflare Tunnel Documentation
Purpose: Provide fresh market data for algorithmic trading
DAGs:
bitcoin_ohlcv_dataset.py: Fetches Bitcoin OHLCV data with historical backfilltechnical_indicators_dag.py: Calculates 110+ technical indicators
Features:
- Intelligent historical data initialization from 2010
- Branching logic to skip backfill if data exists
- Batch processing with rate limiting
- Delta updates for recent data
Schedule: Daily at 00:05 UTC Data Destination: Snowflake data warehouse Downstream Use: ML model training, backtesting, live trading
Purpose: Create searchable vector database of Snowflake documentation
DAG:
snowflake_docs_db_dag.py: Scrapes and vectorizes documentation
Features:
- Delta-mode scraping (only new/changed URLs)
- Parallel web scraping with ThreadPoolExecutor
- OpenAI embeddings for semantic search
- Pinecone vector storage for RAG applications
- Airflow Variable tracking of scraped URLs
Schedule: Weekly on Sunday at 2 AM Data Destination: Pinecone vector database Downstream Use: SnowPro Core exam preparation, RAG chatbot
Examples:
- Web scraping for price monitoring
- Social media sentiment analysis
- IoT sensor data aggregation
- Personal finance tracking
- Weather data collection
- News aggregation and analysis
- Email report automation
- Database backup automation
- Clone repository
- Create
.envfile with credentials - Start Airflow locally:
docker compose up -d - Access UI at
http://localhost:8080 - Review example DAGs
- Create your first custom DAG
- Test DAG locally
- Configure GitHub Secrets
- Deploy to Raspberry Pi
- Set up monitoring and alerts
- Add more projects as needed!
This project is licensed under the MIT License - feel free to use it for your personal data engineering projects!
- Apache Airflow community for excellent orchestration platform
- Raspberry Pi Foundation for affordable computing
- Docker team for containerization technology
- Cloudflare for secure tunnel solution
- Open source community for inspiration and support
Project Type: Self-Hosted Data Engineering Infrastructure
Primary Use: Personal data pipeline orchestration
Example Projects: Trading agent, web scraping, IoT, analytics
Status: Production Ready β
Last Updated: February 2, 2026
Version: 1.1.0