This project implements a lightweight semantic search system built on the 20 Newsgroups dataset. The system combines vector embeddings, fuzzy clustering, and a custom semantic cache to provide efficient query retrieval through a FastAPI service.
Unlike traditional keyword search systems, this implementation performs semantic retrieval, meaning queries that are phrased differently but share similar meaning can retrieve relevant results.
The system also includes a semantic caching mechanism that avoids recomputation for similar queries without relying on external caching tools such as Redis or Memcached.
The system is composed of four primary components:
Documents are converted into vector representations using:
- TF-IDF vectorization
- Truncated SVD (Latent Semantic Analysis)
This approach reduces high-dimensional sparse vectors into a compact semantic representation.
Pipeline:
Dataset → TF-IDF Vectorizer → Truncated SVD → Dense Embeddings
Instead of assigning each document to a single cluster, the system uses Gaussian Mixture Models (GMM) to compute soft cluster memberships.
This reflects the real-world nature of documents, where content can belong to multiple topics.
For each document:
Cluster Membership = Probability Distribution across clusters
Example:
Document → Cluster 3 : 0.62 Cluster 7 : 0.28 Cluster 12 : 0.10
The dominant cluster is the cluster with the highest probability.
Document embeddings are stored in an in-memory vector store.
Query processing works as follows:
- Query is converted to embedding
- Cosine similarity is computed against document embeddings
- Top-k most similar documents are retrieved
This provides semantic similarity rather than keyword matching.
Traditional caches only work when queries are identical.
This project implements a semantic cache, which detects when a query is similar to a previously asked query.
Steps:
- Query embedding is generated
- Similarity is computed against cached query embeddings
- If similarity exceeds a threshold → cache hit
- Otherwise → cache miss and computation
Similarity metric: Cosine similarity
Default cache threshold:
0.90
This value controls how strict the cache matching is.
semantic-search-cache
│
├── src
│ ├── api.py # FastAPI application
│ ├── search.py # Main search system logic
│ ├── cache.py # Semantic cache implementation
│ ├── embeddings.py # TF-IDF + SVD embedding pipeline
│ ├── clustering.py # Gaussian Mixture fuzzy clustering
│ └── vector_store.py # Vector similarity search
│
├── run.sh # Startup script
├── requirements.txt # Python dependencies
└── README.md
This project uses the 20 Newsgroups dataset available through scikit-learn.
It contains approximately 20,000 documents across 20 categories, including:
- politics
- religion
- computer graphics
- space
- sports
- electronics
The dataset is loaded directly via:
sklearn.datasets.fetch_20newsgroups
Noise such as headers, footers, and quotes is removed during loading to improve semantic quality.
POST /query
Request:
{
"query": "space shuttle launch failure"
}
Response:
{
"query": "...",
"cache_hit": true,
"matched_query": "...",
"similarity_score": 0.91,
"result": "...",
"dominant_cluster": 3
}
Fields:
| Field | Description |
|---|---|
| query | user query |
| cache_hit | indicates whether result came from cache |
| matched_query | cached query matched |
| similarity_score | cosine similarity score |
| result | retrieved document snippet |
| dominant_cluster | cluster ID with highest probability |
GET /cache/stats
Example response:
{
"total_entries": 42,
"hit_count": 17,
"miss_count": 25,
"hit_rate": 0.405
}
DELETE /cache
Clears all cached entries and resets statistics.
GET /health
Response:
{
"status": "ok"
}
The project can be run on Linux, macOS, or Windows.
git clone <repository-url>
cd semantic-search-cache
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
chmod +x run.sh
./run.sh
The API will start at:
http://localhost:8000
Interactive API documentation:
http://localhost:8000/docs
git clone <repository-url>
cd semantic-search-cache
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
Navigate to the src folder and run:
python -m uvicorn api:app --host 0.0.0.0 --port 8000
Open in browser:
http://localhost:8000/docs
space shuttle launch
gun control debate
computer graphics rendering
religion vs atheism
hockey playoffs
TF-IDF + Truncated SVD was chosen because:
- Lightweight
- No external model downloads
- Good semantic compression for news articles
Gaussian Mixture Models provide soft clustering, allowing documents to belong to multiple topics.
This is important because many news articles contain mixed themes.
The cache is implemented from scratch using:
- Query embeddings
- Cosine similarity
- Threshold based lookup
No external caching systems were used.
The threshold controls how strict the semantic cache is.
| Threshold | Behaviour |
|---|---|
| 0.70 | aggressive caching |
| 0.85 | balanced |
| 0.95 | very strict |
Lower threshold → more cache hits but risk of incorrect matches.
Higher threshold → more accurate but fewer hits.
Possible improvements include:
- Cluster-aware cache lookup
- FAISS vector database
- transformer embeddings (Sentence Transformers)
- persistent cache storage
- distributed deployment
Start the service and send queries through the API documentation interface:
http://localhost:8000/docs
Repeat a similar query to observe semantic cache hits.
Example:
Query 1:
space shuttle launch
Query 2:
nasa rocket launch failure
The second query may be served from the cache if similarity exceeds the threshold.