🌐 Global News Intelligence Platform

Global news analytics with GDELT + AI + modern data stack

Live Demo • Features • Architecture • Tech Stack • Quick Start • Cost Efficiency

🎯 Overview

A full-stack data engineering project that ingests, processes, and visualizes 100,000+ daily global news events from the GDELT Project. Includes AI chat for natural language queries and a live analytics dashboard.

📊 By the Numbers

Metric	Value
Cumulative Events	16M+ processed
Daily Ingestion	100K+ events/day
Data History	3.5+ months live data
Languages	100+ monitored
Countries	200+ covered
Query Speed	<1 second
Monthly Cost	$0

What is GDELT?

The GDELT Project monitors the world's news media from nearly every country in 100+ languages, identifying people, locations, themes, and emotions driving global society.

📸 Dashboard Preview

Home - KPIs & Trending News

Emotions - GKG Mood Analysis (NEW!)

Analytics - Actors & Countries

AI Chat - Natural Language Queries

RAG Chat - AI Analysis of World Events

Feed - Event Stream

✨ Features

Feature	Description
📊 Real-Time Dashboard	Live metrics, trending news, sentiment analysis, geographic distribution
🧠 Emotion Analytics	GKG-powered emotion tracking: Fear, Joy, Positive/Negative, Global Mood Index
🤖 AI Chat Interface	Ask questions in plain English → Get SQL-powered answers
⚡ 15-Min Updates	GitHub Actions (15-min cron) + Dagster job runner (CLI subprocess)
🔍 Data Quality Gates	Custom data validation prevents bad data
🌍 Global Coverage	Events from 200+ countries with country code mapping
📈 Trend Analysis	30-day time series, intensity tracking, actor monitoring
🔥 Trending Topics	AI-extracted themes from global news (GKG)
🎨 Dark Mode UI	Custom dark theme, responsive Plotly charts

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         PRODUCTION PIPELINE ARCHITECTURE                 │
└─────────────────────────────────────────────────────────────────────────┘

              ┌──────────────┐          ┌──────────────┐
              │ GDELT Events │          │  GDELT GKG   │
              └──────┬───────┘          └──────┬───────┘
                     │                         │
                     └────────────┬────────────┘
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  INGESTION (Every 15 min)                                                │
│  GitHub Actions → Dagster → Polars (10x faster) → custom validation      │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  TRANSFORMATION                                                          │
│  dbt Core: staging (stg_events) → marts (fct_daily, dim_actors, etc.)   │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  STORAGE & AI                                                            │
│  MotherDuck (DWH) ← Voyage AI (Embeddings) → Cerebras LLM (RAG/SQL)     │
│  └── gkg_emotions: Fear, Joy, Tone, Topics                              │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  PRESENTATION                                                            │
│  Streamlit: HOME | FEED | EMOTIONS | AI Chat | ABOUT                    │
└─────────────────────────────────────────────────────────────────────────┘

Data Flow (ELT Pipeline)

Extract: GDELT Events API + GKG Feed → Polars (10x faster than Pandas)
Validate: Custom schema + threshold data quality checks
Load: Deduplicated data into MotherDuck (serverless DuckDB)
Transform: dbt models create staging views and mart tables
Emotions: GKG data → Extract tone, fear, joy, topics (rolling 24h)
Embed: Voyage AI generates vectors every 12 hours
Serve: Streamlit dashboard with AI chat (SQL + RAG modes)

🛠️ Tech Stack

Data Engineering

Tool	Purpose	Replaces
Polars	High-performance DataFrame processing (10x faster)	Pandas
dbt Core	SQL transformations with staging/marts pattern	Raw SQL
DataQualityValidator	Custom schema + threshold validation & testing	Manual checks
Dagster	Pipeline orchestration with asset-based design	Apache Airflow
DuckDB/MotherDuck	Serverless cloud OLAP warehouse	Snowflake/Redshift
GitHub Actions	CI/CD with 15-min + 12-hr scheduled jobs	AWS Lambda

AI/ML

Tool	Purpose	Replaces
Cerebras	LLM inference (Llama 3.1 8B)	OpenAI GPT-4
LlamaIndex	Text-to-SQL query engine	Custom NLP
Voyage AI	Vector embeddings for RAG	OpenAI Embeddings
MotherDuck Vectors	Native vector similarity search	Pinecone / Weaviate

Frontend

Tool	Purpose	Replaces
Streamlit	Interactive dashboard framework	Tableau / Power BI
Plotly	Dynamic charts and visualizations	D3.js / Chart.js

Skills Demonstrated

Python (Polars, Pandas, RegEx, API integration)
SQL (Complex queries, window functions, dbt models)
Data Quality (custom data quality validation, schema testing)
ELT Pipelines (Extract, Load, Transform with dbt)
CI/CD (GitHub Actions, cron scheduling)
Vector Search (Embeddings, cosine similarity, RAG)

🚀 Quick Start

Prerequisites

Python 3.10+
MotherDuck Account (free tier)
Cerebras API Key (free tier)

Installation

# Clone the repository
git clone https://github.com/Mohith-akash/Global-News-Intel-Platform.git
cd Global-News-Intel-Platform

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

MOTHERDUCK_TOKEN=your_motherduck_token
CEREBRAS_API_KEY=your_cerebras_api_key
VOYAGE_API_KEY=your_voyage_api_key  # Optional: enables RAG mode

Run the Dashboard

streamlit run app.py

Run the Pipeline Manually

# Polars-powered ingestion (15-min schedule)
python -m dagster job execute -f etl/pipeline_polars.py -j gdelt_ingestion_job

# Embedding generation (12-hour schedule)
python -m dagster job execute -f etl/embedding_job.py -j gdelt_embedding_job

# Run dbt models
cd dbt && dbt run

💰 Enterprise Tools vs My Stack

This project demonstrates how to achieve enterprise-grade capabilities at zero cost:

Enterprise Tool	Monthly Cost	My Alternative	My Cost
Databricks/Spark	~$500	DuckDB	$0
Snowflake/BigQuery	~$300	MotherDuck	$0
Managed Airflow	~$300	Dagster + GitHub Actions	$0
dbt Cloud	~$100	dbt Core (self-hosted)	$0
Pinecone/Weaviate	~$70	MotherDuck Vectors	$0
OpenAI Embeddings	~$50	Voyage AI	$0
OpenAI GPT-4	~$100	Cerebras	$0
Tableau/Power BI	~$70	Streamlit	$0
TOTAL	$1,490+		$0

Key Insight: MotherDuck's native vector search eliminates the need for a separate vector database like Pinecone.

🔄 Technology Evolution

This project evolved through multiple iterations to optimize for cost and performance:

Data Warehouse

❄️ Snowflake (trial) → 🦆 MotherDuck (free tier)

Started with Snowflake trial for learning enterprise DWH
Migrated to MotherDuck to eliminate costs while keeping SQL compatibility

AI/LLM Provider

✨ Gemini 2.0/2.5 Flash → ⚡ Groq (Llama 3.3 70B) → 🧠 Cerebras (Llama 3.1 8B)

Tested Gemini models for natural language queries
Tried Groq's fast inference with larger Llama models
Settled on Cerebras for reliable free tier and good performance

RAG Embeddings

🚀 Voyage AI (embeddings) + 🦆 MotherDuck (vector search)

Voyage AI creates 1024-dim embeddings for semantic search
MotherDuck's native array_cosine_similarity() replaces Pinecone
Dual-mode AI: SQL for precise queries, RAG for semantic exploration

Key Learning: The best tool isn't always the most expensive—it's the one that solves your problem within constraints.

📁 Project Structure

gdelt_project/
├── app.py                    # Streamlit dashboard entry point
├── src/                      # Core modules
│   ├── config.py             # Configuration constants
│   ├── database.py           # Database connection
│   ├── queries.py            # SQL query functions
│   ├── ai_engine.py          # LLM/AI setup (Cerebras + LlamaIndex)
│   ├── rag_engine.py         # RAG engine (Voyage AI + vector search)
│   ├── data_processing.py    # Headline extraction
│   ├── utils.py              # Utility functions
│   └── styles.py             # CSS styling
├── etl/                      # Data pipeline
│   ├── pipeline_polars.py    # 🆕 Polars ingestion + custom validation
│   └── embedding_job.py      # 🆕 12-hour embedding generation
├── dbt/                      # 🆕 dbt transformation layer
│   ├── dbt_project.yml       # dbt configuration
│   ├── profiles.yml          # MotherDuck connection
│   └── models/
│       ├── staging/          # stg_events (cleaned data)
│       └── marts/            # fct_daily_events, dim_actors, dim_countries
├── components/               # UI components
│   ├── render.py             # Dashboard rendering
│   ├── ai_chat.py            # AI chat interface
│   └── about.py              # About page
├── requirements.txt          # Python dependencies
├── .env                      # Environment variables (not in repo)
└── .github/workflows/
    ├── gdelt_ingest_15min.yml    # 🆕 15-min Polars ingestion
    └── gdelt_embeddings_12hr.yml # 🆕 12-hour embedding job

🔮 Future Enhancements

~~Add dbt transformations for advanced modeling~~ ✅ Done!
~~Upgrade to Polars for faster processing~~ ✅ Done!
~~Add data quality validation~~ ✅ Done!
Implement event clustering with ML
Add email/Slack alerts for crisis events
Expand AI chat with multi-turn conversations
Add export functionality (CSV, PDF reports)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📬 Contact

Mohith Akash

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

_{Built with ☕ and curiosity • Data sourced from GDELT Project}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
components		components
dbt		dbt
docs/images		docs/images
etl		etl
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌐 Global News Intelligence Platform

🎯 Overview

📊 By the Numbers

What is GDELT?

📸 Dashboard Preview

Home - KPIs & Trending News

Emotions - GKG Mood Analysis (NEW!)

Analytics - Actors & Countries

AI Chat - Natural Language Queries

RAG Chat - AI Analysis of World Events

Feed - Event Stream

✨ Features

🏗️ Architecture

Data Flow (ELT Pipeline)

🛠️ Tech Stack

Data Engineering

AI/ML

Frontend

Skills Demonstrated

🚀 Quick Start

Prerequisites

Installation

Configuration

Run the Dashboard

Run the Pipeline Manually

💰 Enterprise Tools vs My Stack

🔄 Technology Evolution

Data Warehouse

AI/LLM Provider

RAG Embeddings

📁 Project Structure

🔮 Future Enhancements

🤝 Contributing

📬 Contact

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages