🏠 House Price Prediction using Machine Learning
📌 Project Overview
This project focuses on predicting house prices using multiple regression techniques and selecting the best-performing model through systematic evaluation and hyperparameter tuning. The goal is to build a robust, interpretable, and well-generalized regression pipeline suitable for real-world price prediction tasks.
🎯 Problem Statement
Accurately predict house prices based on various numerical and categorical features such as location, size, and property characteristics.
🧠 Models Implemented
Linear Regression
Ridge Regression
Lasso Regression
Decision Tree Regressor
Random Forest Regressor
🛠️ Tech Stack & Tools
Python
Pandas, NumPy
Scikit-learn
Matplotlib
Jupyter Notebook
🔄 ML Pipeline
Data Cleaning & Missing Value Imputation
One-Hot Encoding for Categorical Features
Train–Test Split
Feature Scaling (StandardScaler)
Model Training & Evaluation
Overfitting Analysis
Ensemble Learning
Hyperparameter Tuning
Feature Importance Analysis
📊 Evaluation Metrics
R² Score
Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE)
📈 Model Performance Summary Model R² Score RMSE Linear / Ridge / Lasso ~0.336 ~48,665 Decision Tree -0.39 ~70,474 Random Forest (Base) 0.352 ~48,092 Random Forest (Tuned) 0.369 ⭐ 47,447 ⭐ ⚙️ Hyperparameter Tuning
Hyperparameter optimization was performed using RandomizedSearchCV with 5-fold cross-validation to improve generalization and reduce RMSE while keeping computational cost reasonable.
Best Parameters:
n_estimators: 300
max_depth: 10
min_samples_split: 2
min_samples_leaf: 2
🔍 Feature Importance
Feature importance analysis was conducted using the tuned Random Forest model to identify the most influential features affecting house prices, improving model interpretability and validating domain relevance.
🏆 Final Model Selection
The tuned Random Forest Regressor was selected as the final model due to:
Highest R² score (0.369)
Lowest RMSE (47,447)
Better generalization compared to baseline and tree-based models
🚀 Key Learnings
Linear models may underperform on complex, non-linear datasets
Decision Trees tend to overfit without proper constraints
Ensemble models with tuning provide better bias–variance tradeoff
Hyperparameter tuning is critical for production-ready ML models