A hybrid movie recommendation engine combining Collaborative Filtering and Content-Based Filtering to deliver personalized, diverse, and explainable recommendations. Built to address real-world challenges including cold-start, popularity bias, and lack of transparency.
- Features
- Architecture
- Installation
- Quick Start
- Usage Guide
- Evaluation Metrics
- Project Structure
- Datasets
- Testing
- Contributing
- License
| Feature | Description |
|---|---|
| Hybrid Architecture | Combines SVD-based Collaborative Filtering + TF-IDF Content-Based Filtering |
| Cold-Start Handling | Adaptive blending shifts weight to content features for new users |
| Diversity Promotion | MMR-based re-ranking reduces popularity bias and promotes long-tail items |
| Explainability | Human-readable explanations for every recommendation |
| Comprehensive Metrics | RMSE, Precision@K, NDCG, Coverage, Diversity, Novelty |
| Multiple Datasets | Supports MovieLens 100K, 1M, 10M, and 20M variants |
This system tackles key challenges in modern recommender systems:
- Cold-Start Problem: New users/items lack interaction history β Content-based fallback + preference elicitation
- Data Sparsity: Most user-item ratings are missing β Matrix factorization handles sparse data efficiently
- Popularity Bias: Popular items dominate β MMR re-ranking + long-tail promotion
- Lack of Diversity: Recommendations too similar β Intra-list diversity optimization
- Explainability: Black-box models lack trust β Genre and similarity-based explanations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hybrid Recommender System β
β β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Collaborative β β Content-Based β β
β β Filtering β β Filtering β β
β β β β β β
β β SVD Matrix β β TF-IDF Genres β β
β β Factorization β β + User Profiles β β
β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β
β β β β
β β ββββββββββββββββ β β
β ββββββΊβ Adaptive βββββββββ β
β β Blending β β
β β (Ξ± weight) β β
β ββββββββ¬ββββββββ β
β β β
β ββββββββΌββββββββ β
β β Diversity β β
β β Reranker β β
β β (MMR) β β
β ββββββββ¬ββββββββ β
β β β
β ββββββββΌββββββββ β
β β Explanation β β
β β Generator β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Technology | Purpose |
|---|---|---|
| Collaborative Filter | SVD (scikit-surprise) | Learn latent user/item factors from ratings |
| Content-Based Filter | TF-IDF (scikit-learn) | Build item profiles from genres, user profiles from history |
| Hybrid Combiner | Weighted average | Blend CF and CBF scores with adaptive Ξ± |
| Diversity Reranker | MMR algorithm | Balance relevance vs. diversity in final list |
| Explanation Generator | Rule-based NLG | Generate human-readable recommendation reasons |
- Python 3.9 or higher
- pip package manager
# Clone the repository
git clone https://github.com/pronzzz/movie-recommendation-system.git
cd movie-recommendation-system
# Install dependencies
pip install numpy pandas scipy scikit-learn scikit-surprise requests
# Optional: Install development dependencies
pip install pytest pytest-cov black flake8# Run tests to verify everything works
PYTHONPATH=src pytest tests/ -vfrom movie_recommender import MovieLensLoader, HybridRecommender
# Load MovieLens dataset (auto-downloads if needed)
loader = MovieLensLoader(variant="100k")
ratings, movies = loader.load()
# Train hybrid model
recommender = HybridRecommender(alpha=0.7) # 70% CF, 30% CBF
recommender.fit(ratings, movies)
# Get recommendations with explanations
recs = recommender.recommend(user_id=1, n=10, explain=True)
for item_id, score, explanation in recs:
movie = movies[movies["item_id"] == item_id].iloc[0]
print(f"π¬ {movie['title']}")
print(f" Score: {score:.2f} | {explanation}\n")Sample Output:
π¬ Star Wars (1977)
Score: 4.72 | Recommended because you enjoy Sci-Fi and Action movies.
π¬ Raiders of the Lost Ark (1981)
Score: 4.65 | Similar to 'Indiana Jones' which you rated highly.
π¬ The Matrix (1999)
Score: 4.58 | Users with similar taste have rated this highly.
from movie_recommender.data import MovieLensLoader, train_test_split
# Load different dataset sizes
loader = MovieLensLoader(variant="100k") # Options: 100k, 1m, 10m, 20m
ratings, movies = loader.load()
# Get dataset statistics
stats = loader.get_statistics()
print(f"Users: {stats['n_users']}, Movies: {stats['n_movies']}")
print(f"Ratings: {stats['n_ratings']}, Sparsity: {stats['sparsity']:.2%}")
# Split for training/testing
train_data, test_data = train_test_split(ratings, test_size=0.2)from movie_recommender.models import CollaborativeFilter, ContentBasedFilter
# Collaborative Filtering only
cf = CollaborativeFilter(n_factors=50, n_epochs=20)
cf.fit(ratings)
cf_recs = cf.recommend(user_id=1, n=10)
# Content-Based Filtering only
cbf = ContentBasedFilter(use_genres=True)
cbf.fit(ratings, movies)
cbf_recs = cbf.recommend(user_id=1, n=10)
# Find similar items
similar = cf.get_similar_items(item_id=50, n=5)from movie_recommender.models import HybridRecommender
# Configure hybrid model
hybrid = HybridRecommender(
alpha=0.7, # CF weight (0.0 = pure CBF, 1.0 = pure CF)
cold_start_threshold=5, # Users with fewer ratings get more CBF
cf_params={"n_factors": 100, "n_epochs": 25},
)
hybrid.fit(train_data, movies)
# Recommendations with explanations
recs = hybrid.recommend(user_id=42, n=10, explain=True)from movie_recommender.models import DiversityReranker
# Get base recommendations
base_recs = recommender.recommend(user_id=1, n=50)
# Re-rank for diversity
reranker = DiversityReranker(lambda_param=0.5) # 0=diversity, 1=relevance
diverse_recs = reranker.rerank(base_recs, n=10)from movie_recommender.evaluation import evaluate_recommender
metrics = evaluate_recommender(
recommender=hybrid,
test_data=test_data,
train_data=train_data,
k=10,
relevance_threshold=4.0,
)
print(f"RMSE: {metrics['rmse']:.3f}")
print(f"Precision@10: {metrics['precision@k']:.3f}")
print(f"NDCG@10: {metrics['ndcg@k']:.3f}")
print(f"Coverage: {metrics['coverage']:.2%}")
print(f"Novelty: {metrics['novelty']:.2f}")| Metric | Description | Optimal |
|---|---|---|
| RMSE | Root Mean Square Error for rating prediction | Lower β |
| MAE | Mean Absolute Error | Lower β |
| Precision@K | Fraction of relevant items in top-K | Higher β |
| Recall@K | Fraction of relevant items retrieved | Higher β |
| NDCG@K | Normalized Discounted Cumulative Gain | Higher β |
| Hit Rate | Users with at least one relevant recommendation | Higher β |
| MRR | Mean Reciprocal Rank of first relevant item | Higher β |
| Metric | Description | Purpose |
|---|---|---|
| Coverage | % of catalog ever recommended | Reduce filter bubbles |
| Diversity | Intra-list pairwise dissimilarity | Avoid repetitive lists |
| Novelty | Average inverse popularity | Promote long-tail discovery |
movie-recommendation-system/
βββ π pyproject.toml # Package metadata & dependencies
βββ π README.md # This file
βββ π LICENSE # MIT License
βββ π CONTRIBUTING.md # Contribution guidelines
βββ π GUIDE.md # Detailed implementation guide
β
βββ π src/movie_recommender/ # Main package
β βββ π data/ # Data loading & preprocessing
β β βββ loader.py # MovieLens dataset loader
β β βββ preprocessing.py # Train/test split, normalization
β β
β βββ π models/ # Recommendation algorithms
β β βββ base.py # Abstract base recommender
β β βββ cf.py # Collaborative filtering (SVD)
β β βββ cbf.py # Content-based filtering (TF-IDF)
β β βββ hybrid.py # Hybrid combiner
β β βββ reranker.py # Diversity re-ranking (MMR)
β β
β βββ π evaluation/ # Metrics & evaluation
β β βββ metrics.py # RMSE, Precision@K, NDCG, etc.
β β
β βββ π explainability/ # Explanation generation
β β βββ explanations.py # Human-readable reasons
β β
β βββ utils.py # Logging, formatting helpers
β
βββ π tests/ # Unit tests (44 tests)
β βββ test_data.py # Data layer tests
β βββ test_models.py # Model tests
β βββ test_evaluation.py # Metrics tests
β
βββ π examples/ # Usage examples
β βββ demo_usage.py # End-to-end demo script
β
βββ π data/ # Downloaded datasets (gitignored)
This project uses the MovieLens datasets from GroupLens Research.
| Dataset | Users | Movies | Ratings | Size |
|---|---|---|---|---|
| MovieLens 100K | 943 | 1,682 | 100,000 | 5 MB |
| MovieLens 1M | 6,040 | 3,706 | 1,000,209 | 25 MB |
| MovieLens 10M | 69,878 | 10,677 | 10,000,054 | 265 MB |
| MovieLens 20M | 138,493 | 27,278 | 20,000,263 | 500 MB |
Datasets are automatically downloaded on first use and cached in the data/ directory.
# Run all tests
PYTHONPATH=src pytest tests/ -v
# Run with coverage report
PYTHONPATH=src pytest tests/ --cov=src/movie_recommender --cov-report=html
# Run specific test file
PYTHONPATH=src pytest tests/test_models.py -vTest Coverage:
test_data.py- 12 tests for data loading and preprocessingtest_models.py- 12 tests for CF, CBF, Hybrid, and Rerankertest_evaluation.py- 20 tests for all evaluation metrics
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
# Clone and install dev dependencies
git clone https://github.com/pronzzz/movie-recommendation-system.git
cd movie-recommendation-system
pip install -e ".[dev]"
# Run linting
black src/ tests/
flake8 src/ tests/
# Run tests
pytest tests/ -vThis project is licensed under the MIT License - see the LICENSE file for details.
- GroupLens Research for the MovieLens datasets
- Surprise Library for SVD implementation
- Research papers on hybrid recommender systems and fairness in ML
Made with β€οΈ by pronzzz