Skip to content

bhavyashah10/wikipedia-rag-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia RAG System

A fully functional, locally-hosted RAG (Retrieval Augmented Generation) system using Wikipedia as knowledge base, with Ollama for LLM serving and optional MCP for agentic capabilities.

Overview

This project demonstrates a complete end-to-end RAG pipeline that:

  • Processes Wikipedia dumps into searchable chunks
  • Generates semantic embeddings for intelligent retrieval
  • Uses FAISS for fast vector similarity search
  • Connects to local Ollama LLMs for generating contextual answers
  • Provides both CLI and web interfaces for interaction

Features

  • Wikipedia Processing: Parse and clean 262K+ articles from Simple English Wikipedia
  • Semantic Search: FAISS-powered vector search over 569K text chunks
  • Local LLM: Ollama integration with Mistral/Llama2 models
  • Web Interface: Chat UI via OpenWebUI or Flask
  • Source Attribution: Every answer includes relevant Wikipedia citations
  • Offline Operation: Complete system runs locally
  • Memory Efficient: Optimized for 8GB RAM with batch processing

Architecture

Wikipedia XML → Parser → Chunker → Embeddings → FAISS Index
                                                      ↓
                                              RAG Retrieval ← User Query
                                                      ↓
                                            Context Formation
                                                      ↓
                                            Ollama LLM (Mistral)
                                                      ↓
                                              Generated Answer + Sources

What's Done

Data Processing:

  • Wikipedia parsing: 262,105 articles extracted
  • Text chunking: 569,456 searchable segments
  • Embeddings: 384-dim vectors using sentence-transformers
  • FAISS index: 834MB, cosine similarity search

LLM Integration:

  • Ollama setup with Mistral 7B model
  • RAG pipeline connecting search to generation
  • Context-aware response generation
  • Source citation system

Interfaces:

  • Interactive CLI chat
  • Flask web interface
  • OpenWebUI Docker integration

Conversation Memory:

  • Multi-turn dialogue support
  • Session-based conversation tracking
  • Context-aware follow-up questions
  • Persistent conversation storage

Hybrid Search:

  • BM25 keyword search integration
  • Reciprocal Rank Fusion (RRF)
  • Improved retrieval for exact terms and formulas
  • 10-20% better retrieval quality

MCP Agents:

  • Multi-step planning and execution
  • 4 Wikipedia tools (search, compare, multi-search, summarize)
  • Automatic mode selection (simple vs agentic)
  • Tool call tracing and synthesis

Quick Start

What I'm using

  • MacBook Air M1 (8GB RAM)
  • Python 3.9+, Docker (for OpenWebUI)
  • ~5GB for Simple Wikipedia, ~150GB for full Wikipedia

Installation

# Clone the repository
git clone https://github.com/bhavyashah10/wikipedia-rag-project.git
cd wikipedia-rag-project

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Test setup
python test_setup.py

Data Processing Pipeline

# 1. Download Simple English Wikipedia
curl -O https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2
mv simplewiki-latest-pages-articles.xml.bz2 data/raw/

# 2. Parse articles from XML
python test_parser.py

# 3. Create text chunks
python src/data_processing/text_chunker.py

# 4. Generate embeddings (~)
python src/embeddings/embedding_generator.py

# 5. Build FAISS index 
python src/retrieval/faiss_indexer.py

# 6. Build BM25 index for hybrid search 
python src/retrieval/hybrid_search.py

LLM Setup

# Install Ollama
brew install ollama

# Start Ollama service
brew services start ollama

# Pull Mistral model (~4GB, takes 5-10 minutes)
ollama pull mistral

# Or use Llama2
# ollama pull llama2:7b-chat

Running the System

Option 1: CLI Chat Interface

# Start interactive chat (without memory)
python src/llm_integration/rag_pipeline.py

# Start interactive chat (with conversation memory)
python src/llm_integration/rag_with_memory.py

# Start interactive chat (with hybrid search + memory)
python src/llm_integration/rag_with_hybrid_search.py

# Start interactive chat (with MCP agents - full agentic mode)
python src/mcp_agents/rag_with_mcp.py

# Example queries:
# - "What is artificial intelligence?" (simple)
# - "Compare quantum computing and classical computing" (agentic)
# - "What are the differences between mitochondria and chloroplasts?" (multi-step)

Option 2: OpenWebUI (Recommended)

# Install and run OpenWebUI with Docker
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Open browser to http://localhost:3000
# Login and start chatting

Option 3: Flask Web Interface (Optional - Use if not using Option 2)

# Start web server
cd src/llm_integration
python web_interface.py

# Open browser to http://localhost:5000

Project Structure

wikipedia-rag-project/
├── data/
│   ├── raw/                    # Wikipedia XML dumps
│   ├── processed/              # 569K clean chunks (JSON)
│   └── embeddings/             # FAISS index + vectors (1.6GB)
├── src/
│   ├── data_processing/
│   │   ├── wikipedia_parser.py    # XML parsing & cleaning
│   │   └── text_chunker.py        # Semantic text chunking
│   ├── embeddings/
│   │   └── embedding_generator.py # Vector embeddings (sentence-transformers)
│   ├── retrieval/
│   │   ├── faiss_indexer.py       # FAISS index & similarity search
│   │   └── hybrid_search.py       # BM25 + FAISS hybrid search
│   ├── llm_integration/
│   │   ├── rag_pipeline.py        # Complete RAG pipeline
│   │   ├── rag_with_memory.py     # RAG with conversation memory
│   │   ├── rag_with_hybrid_search.py  # RAG with hybrid search + memory
│   │   ├── conversation_memory.py # Conversation management
│   │   ├── web_interface.py       # Flask web server
│   │   └── templates/
│   │       └── chat.html          # Web UI
│   └── mcp_agents/                # MCP agentic layer
│       ├── tools.py               # Wikipedia MCP tools
│       ├── agent.py               # Planning & execution
│       └── rag_with_mcp.py        # RAG with MCP integration
├── config/
│   └── config.yaml             # System configuration
├── logs/                       # Application logs
├── requirements.txt            # Python dependencies
├── test_setup.py              # Setup verification
├── test_parser.py             # Parser testing
└── check_setup.py             # System status check

Configuration

Edit config/config.yaml to customize:

Processing Settings:

processing:
  chunk_size: 1000              # Characters per chunk
  chunk_overlap: 200            # Overlap between chunks
  min_article_length: 100       # Filter short articles

Embedding Settings:

embeddings:
  model_name: "sentence-transformers/all-MiniLM-L6-v2"
  dimension: 384
  normalize: true

RAG Settings:

rag:
  top_k: 5                      # Number of chunks to retrieve
  score_threshold: 0.7          # Minimum similarity score
  max_context_length: 4000      # Max characters in context

LLM Settings:

llm:
  model: "mistral:latest"       # Ollama model to use
  temperature: 0.7              # Response creativity
  max_tokens: 2048              # Max response length

MCP Settings:

mcp:
  enabled: true                 # Enable MCP agents
  max_tool_calls: 5             # Maximum tool iterations
  planning_temperature: 0.3     # Temperature for planning

Troubleshooting

Issue: Ollama connection failed

# Check if Ollama is running
brew services list | grep ollama

# Restart if needed
brew services restart ollama

# Test connection
curl http://localhost:11434/api/tags

Issue: Out of memory during embedding generation

# Edit config.yaml, reduce batch size
processing:
  batch_size: 50  # Reduce from 100

Issue: FAISS index not found

# Rebuild the index
python src/retrieval/faiss_indexer.py

Issue: Slow query responses

# Check if using GPU acceleration
python -c "import torch; print(torch.backends.mps.is_available())"

# Reduce top_k in config to retrieve fewer chunks
rag:
  top_k: 3  # Reduce from 5

Contact

For questions or feedback, please open an issue on GitHub.

About

Locally hosting a RAG system using Wikipedia as knowledge base.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors