Global news analytics with GDELT + AI + modern data stack
Live Demo • Features • Architecture • Tech Stack • Quick Start • Cost Efficiency
A full-stack data engineering project that ingests, processes, and visualizes 100,000+ daily global news events from the GDELT Project. Includes AI chat for natural language queries and a live analytics dashboard.
| Metric | Value |
|---|---|
| Cumulative Events | 16M+ processed |
| Daily Ingestion | 100K+ events/day |
| Data History | 3.5+ months live data |
| Languages | 100+ monitored |
| Countries | 200+ covered |
| Query Speed | <1 second |
| Monthly Cost | $0 |
The GDELT Project monitors the world's news media from nearly every country in 100+ languages, identifying people, locations, themes, and emotions driving global society.
| Feature | Description |
|---|---|
| 📊 Real-Time Dashboard | Live metrics, trending news, sentiment analysis, geographic distribution |
| 🧠 Emotion Analytics | GKG-powered emotion tracking: Fear, Joy, Positive/Negative, Global Mood Index |
| 🤖 AI Chat Interface | Ask questions in plain English → Get SQL-powered answers |
| ⚡ 15-Min Updates | GitHub Actions (15-min cron) + Dagster job runner (CLI subprocess) |
| 🔍 Data Quality Gates | Custom data validation prevents bad data |
| 🌍 Global Coverage | Events from 200+ countries with country code mapping |
| 📈 Trend Analysis | 30-day time series, intensity tracking, actor monitoring |
| 🔥 Trending Topics | AI-extracted themes from global news (GKG) |
| 🎨 Dark Mode UI | Custom dark theme, responsive Plotly charts |
┌─────────────────────────────────────────────────────────────────────────┐
│ PRODUCTION PIPELINE ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐
│ GDELT Events │ │ GDELT GKG │
└──────┬───────┘ └──────┬───────┘
│ │
└────────────┬────────────┘
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ INGESTION (Every 15 min) │
│ GitHub Actions → Dagster → Polars (10x faster) → custom validation │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TRANSFORMATION │
│ dbt Core: staging (stg_events) → marts (fct_daily, dim_actors, etc.) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE & AI │
│ MotherDuck (DWH) ← Voyage AI (Embeddings) → Cerebras LLM (RAG/SQL) │
│ └── gkg_emotions: Fear, Joy, Tone, Topics │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ PRESENTATION │
│ Streamlit: HOME | FEED | EMOTIONS | AI Chat | ABOUT │
└─────────────────────────────────────────────────────────────────────────┘
- Extract: GDELT Events API + GKG Feed → Polars (10x faster than Pandas)
- Validate: Custom schema + threshold data quality checks
- Load: Deduplicated data into MotherDuck (serverless DuckDB)
- Transform: dbt models create staging views and mart tables
- Emotions: GKG data → Extract tone, fear, joy, topics (rolling 24h)
- Embed: Voyage AI generates vectors every 12 hours
- Serve: Streamlit dashboard with AI chat (SQL + RAG modes)
| Tool | Purpose | Replaces |
|---|---|---|
| Polars | High-performance DataFrame processing (10x faster) | Pandas |
| dbt Core | SQL transformations with staging/marts pattern | Raw SQL |
| DataQualityValidator | Custom schema + threshold validation & testing | Manual checks |
| Dagster | Pipeline orchestration with asset-based design | Apache Airflow |
| DuckDB/MotherDuck | Serverless cloud OLAP warehouse | Snowflake/Redshift |
| GitHub Actions | CI/CD with 15-min + 12-hr scheduled jobs | AWS Lambda |
| Tool | Purpose | Replaces |
|---|---|---|
| Cerebras | LLM inference (Llama 3.1 8B) | OpenAI GPT-4 |
| LlamaIndex | Text-to-SQL query engine | Custom NLP |
| Voyage AI | Vector embeddings for RAG | OpenAI Embeddings |
| MotherDuck Vectors | Native vector similarity search | Pinecone / Weaviate |
| Tool | Purpose | Replaces |
|---|---|---|
| Streamlit | Interactive dashboard framework | Tableau / Power BI |
| Plotly | Dynamic charts and visualizations | D3.js / Chart.js |
- Python (Polars, Pandas, RegEx, API integration)
- SQL (Complex queries, window functions, dbt models)
- Data Quality (custom data quality validation, schema testing)
- ELT Pipelines (Extract, Load, Transform with dbt)
- CI/CD (GitHub Actions, cron scheduling)
- Vector Search (Embeddings, cosine similarity, RAG)
- Python 3.10+
- MotherDuck Account (free tier)
- Cerebras API Key (free tier)
# Clone the repository
git clone https://github.com/Mohith-akash/Global-News-Intel-Platform.git
cd Global-News-Intel-Platform
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
.\venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
MOTHERDUCK_TOKEN=your_motherduck_token
CEREBRAS_API_KEY=your_cerebras_api_key
VOYAGE_API_KEY=your_voyage_api_key # Optional: enables RAG modestreamlit run app.py# Polars-powered ingestion (15-min schedule)
python -m dagster job execute -f etl/pipeline_polars.py -j gdelt_ingestion_job
# Embedding generation (12-hour schedule)
python -m dagster job execute -f etl/embedding_job.py -j gdelt_embedding_job
# Run dbt models
cd dbt && dbt runThis project demonstrates how to achieve enterprise-grade capabilities at zero cost:
| Enterprise Tool | Monthly Cost | My Alternative | My Cost |
|---|---|---|---|
| Databricks/Spark | ~$500 | DuckDB | $0 |
| Snowflake/BigQuery | ~$300 | MotherDuck | $0 |
| Managed Airflow | ~$300 | Dagster + GitHub Actions | $0 |
| dbt Cloud | ~$100 | dbt Core (self-hosted) | $0 |
| Pinecone/Weaviate | ~$70 | MotherDuck Vectors | $0 |
| OpenAI Embeddings | ~$50 | Voyage AI | $0 |
| OpenAI GPT-4 | ~$100 | Cerebras | $0 |
| Tableau/Power BI | ~$70 | Streamlit | $0 |
| TOTAL | $1,490+ | $0 |
Key Insight: MotherDuck's native vector search eliminates the need for a separate vector database like Pinecone.
This project evolved through multiple iterations to optimize for cost and performance:
❄️ Snowflake (trial) → 🦆 MotherDuck (free tier)
- Started with Snowflake trial for learning enterprise DWH
- Migrated to MotherDuck to eliminate costs while keeping SQL compatibility
✨ Gemini 2.0/2.5 Flash → ⚡ Groq (Llama 3.3 70B) → 🧠 Cerebras (Llama 3.1 8B)
- Tested Gemini models for natural language queries
- Tried Groq's fast inference with larger Llama models
- Settled on Cerebras for reliable free tier and good performance
🚀 Voyage AI (embeddings) + 🦆 MotherDuck (vector search)
- Voyage AI creates 1024-dim embeddings for semantic search
- MotherDuck's native
array_cosine_similarity()replaces Pinecone - Dual-mode AI: SQL for precise queries, RAG for semantic exploration
Key Learning: The best tool isn't always the most expensive—it's the one that solves your problem within constraints.
gdelt_project/
├── app.py # Streamlit dashboard entry point
├── src/ # Core modules
│ ├── config.py # Configuration constants
│ ├── database.py # Database connection
│ ├── queries.py # SQL query functions
│ ├── ai_engine.py # LLM/AI setup (Cerebras + LlamaIndex)
│ ├── rag_engine.py # RAG engine (Voyage AI + vector search)
│ ├── data_processing.py # Headline extraction
│ ├── utils.py # Utility functions
│ └── styles.py # CSS styling
├── etl/ # Data pipeline
│ ├── pipeline_polars.py # 🆕 Polars ingestion + custom validation
│ └── embedding_job.py # 🆕 12-hour embedding generation
├── dbt/ # 🆕 dbt transformation layer
│ ├── dbt_project.yml # dbt configuration
│ ├── profiles.yml # MotherDuck connection
│ └── models/
│ ├── staging/ # stg_events (cleaned data)
│ └── marts/ # fct_daily_events, dim_actors, dim_countries
├── components/ # UI components
│ ├── render.py # Dashboard rendering
│ ├── ai_chat.py # AI chat interface
│ └── about.py # About page
├── requirements.txt # Python dependencies
├── .env # Environment variables (not in repo)
└── .github/workflows/
├── gdelt_ingest_15min.yml # 🆕 15-min Polars ingestion
└── gdelt_embeddings_12hr.yml # 🆕 12-hour embedding job
-
Add dbt transformations for advanced modeling✅ Done! -
Upgrade to Polars for faster processing✅ Done! -
Add data quality validation✅ Done! - Implement event clustering with ML
- Add email/Slack alerts for crisis events
- Expand AI chat with multi-turn conversations
- Add export functionality (CSV, PDF reports)
Contributions are welcome! Please feel free to submit a Pull Request.
Mohith Akash
This project is licensed under the MIT License - see the LICENSE file for details.
Built with ☕ and curiosity • Data sourced from GDELT Project





