Skip to content

Mohith-akash/Global-News-Intel-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Polars dbt DuckDB Dagster License Pipeline

🌐 Global News Intelligence Platform

Global news analytics with GDELT + AI + modern data stack

Live Demo

Live DemoFeaturesArchitectureTech StackQuick StartCost Efficiency


🎯 Overview

A full-stack data engineering project that ingests, processes, and visualizes 100,000+ daily global news events from the GDELT Project. Includes AI chat for natural language queries and a live analytics dashboard.

📊 By the Numbers

Metric Value
Cumulative Events 16M+ processed
Daily Ingestion 100K+ events/day
Data History 3.5+ months live data
Languages 100+ monitored
Countries 200+ covered
Query Speed <1 second
Monthly Cost $0

What is GDELT?

The GDELT Project monitors the world's news media from nearly every country in 100+ languages, identifying people, locations, themes, and emotions driving global society.


📸 Dashboard Preview

Home - KPIs & Trending News

Dashboard Home

Emotions - GKG Mood Analysis (NEW!)

Emotions Tab

Analytics - Actors & Countries

Dashboard Charts

AI Chat - Natural Language Queries

AI Chat

RAG Chat - AI Analysis of World Events

RAG Chat

Feed - Event Stream

Feed Tab


✨ Features

Feature Description
📊 Real-Time Dashboard Live metrics, trending news, sentiment analysis, geographic distribution
🧠 Emotion Analytics GKG-powered emotion tracking: Fear, Joy, Positive/Negative, Global Mood Index
🤖 AI Chat Interface Ask questions in plain English → Get SQL-powered answers
⚡ 15-Min Updates GitHub Actions (15-min cron) + Dagster job runner (CLI subprocess)
🔍 Data Quality Gates Custom data validation prevents bad data
🌍 Global Coverage Events from 200+ countries with country code mapping
📈 Trend Analysis 30-day time series, intensity tracking, actor monitoring
🔥 Trending Topics AI-extracted themes from global news (GKG)
🎨 Dark Mode UI Custom dark theme, responsive Plotly charts

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         PRODUCTION PIPELINE ARCHITECTURE                 │
└─────────────────────────────────────────────────────────────────────────┘

              ┌──────────────┐          ┌──────────────┐
              │ GDELT Events │          │  GDELT GKG   │
              └──────┬───────┘          └──────┬───────┘
                     │                         │
                     └────────────┬────────────┘
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  INGESTION (Every 15 min)                                                │
│  GitHub Actions → Dagster → Polars (10x faster) → custom validation      │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  TRANSFORMATION                                                          │
│  dbt Core: staging (stg_events) → marts (fct_daily, dim_actors, etc.)   │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  STORAGE & AI                                                            │
│  MotherDuck (DWH) ← Voyage AI (Embeddings) → Cerebras LLM (RAG/SQL)     │
│  └── gkg_emotions: Fear, Joy, Tone, Topics                              │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  PRESENTATION                                                            │
│  Streamlit: HOME | FEED | EMOTIONS | AI Chat | ABOUT                    │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow (ELT Pipeline)

  1. Extract: GDELT Events API + GKG Feed → Polars (10x faster than Pandas)
  2. Validate: Custom schema + threshold data quality checks
  3. Load: Deduplicated data into MotherDuck (serverless DuckDB)
  4. Transform: dbt models create staging views and mart tables
  5. Emotions: GKG data → Extract tone, fear, joy, topics (rolling 24h)
  6. Embed: Voyage AI generates vectors every 12 hours
  7. Serve: Streamlit dashboard with AI chat (SQL + RAG modes)

🛠️ Tech Stack

Data Engineering

Tool Purpose Replaces
Polars High-performance DataFrame processing (10x faster) Pandas
dbt Core SQL transformations with staging/marts pattern Raw SQL
DataQualityValidator Custom schema + threshold validation & testing Manual checks
Dagster Pipeline orchestration with asset-based design Apache Airflow
DuckDB/MotherDuck Serverless cloud OLAP warehouse Snowflake/Redshift
GitHub Actions CI/CD with 15-min + 12-hr scheduled jobs AWS Lambda

AI/ML

Tool Purpose Replaces
Cerebras LLM inference (Llama 3.1 8B) OpenAI GPT-4
LlamaIndex Text-to-SQL query engine Custom NLP
Voyage AI Vector embeddings for RAG OpenAI Embeddings
MotherDuck Vectors Native vector similarity search Pinecone / Weaviate

Frontend

Tool Purpose Replaces
Streamlit Interactive dashboard framework Tableau / Power BI
Plotly Dynamic charts and visualizations D3.js / Chart.js

Skills Demonstrated

  • Python (Polars, Pandas, RegEx, API integration)
  • SQL (Complex queries, window functions, dbt models)
  • Data Quality (custom data quality validation, schema testing)
  • ELT Pipelines (Extract, Load, Transform with dbt)
  • CI/CD (GitHub Actions, cron scheduling)
  • Vector Search (Embeddings, cosine similarity, RAG)

🚀 Quick Start

Prerequisites

Installation

# Clone the repository
git clone https://github.com/Mohith-akash/Global-News-Intel-Platform.git
cd Global-News-Intel-Platform

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

MOTHERDUCK_TOKEN=your_motherduck_token
CEREBRAS_API_KEY=your_cerebras_api_key
VOYAGE_API_KEY=your_voyage_api_key  # Optional: enables RAG mode

Run the Dashboard

streamlit run app.py

Run the Pipeline Manually

# Polars-powered ingestion (15-min schedule)
python -m dagster job execute -f etl/pipeline_polars.py -j gdelt_ingestion_job

# Embedding generation (12-hour schedule)
python -m dagster job execute -f etl/embedding_job.py -j gdelt_embedding_job

# Run dbt models
cd dbt && dbt run

💰 Enterprise Tools vs My Stack

This project demonstrates how to achieve enterprise-grade capabilities at zero cost:

Enterprise Tool Monthly Cost My Alternative My Cost
Databricks/Spark ~$500 DuckDB $0
Snowflake/BigQuery ~$300 MotherDuck $0
Managed Airflow ~$300 Dagster + GitHub Actions $0
dbt Cloud ~$100 dbt Core (self-hosted) $0
Pinecone/Weaviate ~$70 MotherDuck Vectors $0
OpenAI Embeddings ~$50 Voyage AI $0
OpenAI GPT-4 ~$100 Cerebras $0
Tableau/Power BI ~$70 Streamlit $0
TOTAL $1,490+ $0

Key Insight: MotherDuck's native vector search eliminates the need for a separate vector database like Pinecone.


🔄 Technology Evolution

This project evolved through multiple iterations to optimize for cost and performance:

Data Warehouse

❄️ Snowflake (trial) → 🦆 MotherDuck (free tier)
  • Started with Snowflake trial for learning enterprise DWH
  • Migrated to MotherDuck to eliminate costs while keeping SQL compatibility

AI/LLM Provider

✨ Gemini 2.0/2.5 Flash → ⚡ Groq (Llama 3.3 70B) → 🧠 Cerebras (Llama 3.1 8B)
  • Tested Gemini models for natural language queries
  • Tried Groq's fast inference with larger Llama models
  • Settled on Cerebras for reliable free tier and good performance

RAG Embeddings

🚀 Voyage AI (embeddings) + 🦆 MotherDuck (vector search)
  • Voyage AI creates 1024-dim embeddings for semantic search
  • MotherDuck's native array_cosine_similarity() replaces Pinecone
  • Dual-mode AI: SQL for precise queries, RAG for semantic exploration

Key Learning: The best tool isn't always the most expensive—it's the one that solves your problem within constraints.


📁 Project Structure

gdelt_project/
├── app.py                    # Streamlit dashboard entry point
├── src/                      # Core modules
│   ├── config.py             # Configuration constants
│   ├── database.py           # Database connection
│   ├── queries.py            # SQL query functions
│   ├── ai_engine.py          # LLM/AI setup (Cerebras + LlamaIndex)
│   ├── rag_engine.py         # RAG engine (Voyage AI + vector search)
│   ├── data_processing.py    # Headline extraction
│   ├── utils.py              # Utility functions
│   └── styles.py             # CSS styling
├── etl/                      # Data pipeline
│   ├── pipeline_polars.py    # 🆕 Polars ingestion + custom validation
│   └── embedding_job.py      # 🆕 12-hour embedding generation
├── dbt/                      # 🆕 dbt transformation layer
│   ├── dbt_project.yml       # dbt configuration
│   ├── profiles.yml          # MotherDuck connection
│   └── models/
│       ├── staging/          # stg_events (cleaned data)
│       └── marts/            # fct_daily_events, dim_actors, dim_countries
├── components/               # UI components
│   ├── render.py             # Dashboard rendering
│   ├── ai_chat.py            # AI chat interface
│   └── about.py              # About page
├── requirements.txt          # Python dependencies
├── .env                      # Environment variables (not in repo)
└── .github/workflows/
    ├── gdelt_ingest_15min.yml    # 🆕 15-min Polars ingestion
    └── gdelt_embeddings_12hr.yml # 🆕 12-hour embedding job

🔮 Future Enhancements

  • Add dbt transformations for advanced modeling ✅ Done!
  • Upgrade to Polars for faster processing ✅ Done!
  • Add data quality validation ✅ Done!
  • Implement event clustering with ML
  • Add email/Slack alerts for crisis events
  • Expand AI chat with multi-turn conversations
  • Add export functionality (CSV, PDF reports)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


📬 Contact

Mohith Akash

GitHub LinkedIn


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ☕ and curiosity • Data sourced from GDELT Project

About

AI-powered geopolitical news intelligence platform. Ingests 100K+ daily events from GDELT, stores in MotherDuck (DuckDB), orchestrates with Dagster, and features an AI chat interface with Text-to-SQL. Full data engineering stack at $0/month.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages