A fully functional, locally-hosted RAG (Retrieval Augmented Generation) system using Wikipedia as knowledge base, with Ollama for LLM serving and optional MCP for agentic capabilities.
This project demonstrates a complete end-to-end RAG pipeline that:
- Processes Wikipedia dumps into searchable chunks
- Generates semantic embeddings for intelligent retrieval
- Uses FAISS for fast vector similarity search
- Connects to local Ollama LLMs for generating contextual answers
- Provides both CLI and web interfaces for interaction
- Wikipedia Processing: Parse and clean 262K+ articles from Simple English Wikipedia
- Semantic Search: FAISS-powered vector search over 569K text chunks
- Local LLM: Ollama integration with Mistral/Llama2 models
- Web Interface: Chat UI via OpenWebUI or Flask
- Source Attribution: Every answer includes relevant Wikipedia citations
- Offline Operation: Complete system runs locally
- Memory Efficient: Optimized for 8GB RAM with batch processing
Wikipedia XML → Parser → Chunker → Embeddings → FAISS Index
↓
RAG Retrieval ← User Query
↓
Context Formation
↓
Ollama LLM (Mistral)
↓
Generated Answer + Sources
Data Processing:
- Wikipedia parsing: 262,105 articles extracted
- Text chunking: 569,456 searchable segments
- Embeddings: 384-dim vectors using sentence-transformers
- FAISS index: 834MB, cosine similarity search
LLM Integration:
- Ollama setup with Mistral 7B model
- RAG pipeline connecting search to generation
- Context-aware response generation
- Source citation system
Interfaces:
- Interactive CLI chat
- Flask web interface
- OpenWebUI Docker integration
Conversation Memory:
- Multi-turn dialogue support
- Session-based conversation tracking
- Context-aware follow-up questions
- Persistent conversation storage
Hybrid Search:
- BM25 keyword search integration
- Reciprocal Rank Fusion (RRF)
- Improved retrieval for exact terms and formulas
- 10-20% better retrieval quality
MCP Agents:
- Multi-step planning and execution
- 4 Wikipedia tools (search, compare, multi-search, summarize)
- Automatic mode selection (simple vs agentic)
- Tool call tracing and synthesis
- MacBook Air M1 (8GB RAM)
- Python 3.9+, Docker (for OpenWebUI)
- ~5GB for Simple Wikipedia, ~150GB for full Wikipedia
# Clone the repository
git clone https://github.com/bhavyashah10/wikipedia-rag-project.git
cd wikipedia-rag-project
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Test setup
python test_setup.py# 1. Download Simple English Wikipedia
curl -O https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2
mv simplewiki-latest-pages-articles.xml.bz2 data/raw/
# 2. Parse articles from XML
python test_parser.py
# 3. Create text chunks
python src/data_processing/text_chunker.py
# 4. Generate embeddings (~)
python src/embeddings/embedding_generator.py
# 5. Build FAISS index
python src/retrieval/faiss_indexer.py
# 6. Build BM25 index for hybrid search
python src/retrieval/hybrid_search.py# Install Ollama
brew install ollama
# Start Ollama service
brew services start ollama
# Pull Mistral model (~4GB, takes 5-10 minutes)
ollama pull mistral
# Or use Llama2
# ollama pull llama2:7b-chat# Start interactive chat (without memory)
python src/llm_integration/rag_pipeline.py
# Start interactive chat (with conversation memory)
python src/llm_integration/rag_with_memory.py
# Start interactive chat (with hybrid search + memory)
python src/llm_integration/rag_with_hybrid_search.py
# Start interactive chat (with MCP agents - full agentic mode)
python src/mcp_agents/rag_with_mcp.py
# Example queries:
# - "What is artificial intelligence?" (simple)
# - "Compare quantum computing and classical computing" (agentic)
# - "What are the differences between mitochondria and chloroplasts?" (multi-step)# Install and run OpenWebUI with Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
# Open browser to http://localhost:3000
# Login and start chatting# Start web server
cd src/llm_integration
python web_interface.py
# Open browser to http://localhost:5000wikipedia-rag-project/
├── data/
│ ├── raw/ # Wikipedia XML dumps
│ ├── processed/ # 569K clean chunks (JSON)
│ └── embeddings/ # FAISS index + vectors (1.6GB)
├── src/
│ ├── data_processing/
│ │ ├── wikipedia_parser.py # XML parsing & cleaning
│ │ └── text_chunker.py # Semantic text chunking
│ ├── embeddings/
│ │ └── embedding_generator.py # Vector embeddings (sentence-transformers)
│ ├── retrieval/
│ │ ├── faiss_indexer.py # FAISS index & similarity search
│ │ └── hybrid_search.py # BM25 + FAISS hybrid search
│ ├── llm_integration/
│ │ ├── rag_pipeline.py # Complete RAG pipeline
│ │ ├── rag_with_memory.py # RAG with conversation memory
│ │ ├── rag_with_hybrid_search.py # RAG with hybrid search + memory
│ │ ├── conversation_memory.py # Conversation management
│ │ ├── web_interface.py # Flask web server
│ │ └── templates/
│ │ └── chat.html # Web UI
│ └── mcp_agents/ # MCP agentic layer
│ ├── tools.py # Wikipedia MCP tools
│ ├── agent.py # Planning & execution
│ └── rag_with_mcp.py # RAG with MCP integration
├── config/
│ └── config.yaml # System configuration
├── logs/ # Application logs
├── requirements.txt # Python dependencies
├── test_setup.py # Setup verification
├── test_parser.py # Parser testing
└── check_setup.py # System status check
Edit config/config.yaml to customize:
Processing Settings:
processing:
chunk_size: 1000 # Characters per chunk
chunk_overlap: 200 # Overlap between chunks
min_article_length: 100 # Filter short articlesEmbedding Settings:
embeddings:
model_name: "sentence-transformers/all-MiniLM-L6-v2"
dimension: 384
normalize: trueRAG Settings:
rag:
top_k: 5 # Number of chunks to retrieve
score_threshold: 0.7 # Minimum similarity score
max_context_length: 4000 # Max characters in contextLLM Settings:
llm:
model: "mistral:latest" # Ollama model to use
temperature: 0.7 # Response creativity
max_tokens: 2048 # Max response lengthMCP Settings:
mcp:
enabled: true # Enable MCP agents
max_tool_calls: 5 # Maximum tool iterations
planning_temperature: 0.3 # Temperature for planningIssue: Ollama connection failed
# Check if Ollama is running
brew services list | grep ollama
# Restart if needed
brew services restart ollama
# Test connection
curl http://localhost:11434/api/tagsIssue: Out of memory during embedding generation
# Edit config.yaml, reduce batch size
processing:
batch_size: 50 # Reduce from 100Issue: FAISS index not found
# Rebuild the index
python src/retrieval/faiss_indexer.pyIssue: Slow query responses
# Check if using GPU acceleration
python -c "import torch; print(torch.backends.mps.is_available())"
# Reduce top_k in config to retrieve fewer chunks
rag:
top_k: 3 # Reduce from 5For questions or feedback, please open an issue on GitHub.