This project analyzes and forecasts retail store sales using historical sales data and store-level metadata.
The dataset combines daily sales records with store attributes such as promotions, assortment type, competition, and more.
This project demonstrates the full ML lifecycle: data cleaning, feature engineering, exploratory data analysis, modeling, evaluation, and deployment via a Streamlit dashboard.
- Loaded train (daily sales) and store datasets using Python (pandas).
- Merged datasets on
StoreID. - Converted
Datecolumn to datetime and inspected data types. - Checked for missing values, duplicates, and basic statistics.
- Saved merged and cleaned dataset for further processing.
- Handled missing values for:
CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2SinceWeek,Promo2SinceYear,PromoInterval - Standardized
StateHoliday(categorical β numeric) - Extracted time-based features:
Year,Month,Day,WeekOfYear,DayOfYear - Created competition features:
CompetitionOpenDate,DaysSinceCompetitionOpen - Created promotion feature:
IsPromoMonth - One-hot encoded
StoreTypeandAssortment - Final dataset: 29 ML-ready features
Exported as β cleaned_dataset.csv
- Focused on open stores (
Open == 1) - Generated 7 visualizations highlighting:
- Promotions increase sales by ~39%
- December is the peak month
- Store type "b" outperforms others
- Weekday trends: Monday strongest, Sunday weakest
- Sales β Customers correlation: 0.82
- Saved visualizations in
/images
Added β day3_explanations.txt
Notebook β 3_eda.ipynb
-
Removed closed-store rows and fixed PromoInterval logic
-
Dropped irrelevant columns (
Date,CompetitionOpenDate) -
Training dataset: 844,392 rows Γ 26 features
-
Models trained:
- Linear Regression β RMSE: 1295, MAE: 940, Score: 82.6%
- HistGradientBoostingRegressor β RMSE: 929, Score: 91.0%
- XGBoost (tuned) β RMSE: 870, Score: 93.6%
-
Visualizations produced: feature importance, residuals, learning curves, actual vs predicted
Notebook β 4_modeling.ipynb
- Performed error analysis and residual checks
- Fixed feature mismatches between train and test sets
- Visualized actual vs predicted sales for key stores
- Configured
.gitignoreand serialized XGBoost model usingjoblib
Notebook β 5_evaluation.ipynb
- Built a robust prediction pipeline:
preprocess()β prepares input datapredict_sales()β runs XGBoost predictions
- Generated output artifacts:
results_predictions.csvβ actual vs predicted salesmetrics.csvβ RMSE, MAPE (~12%), mean errortop_best_predictions.csv/top_worst_predictions.csv
- Pipeline ensures scalability, modularity, and real-time readiness
Pipeline code β day6_pipeline.py
- Created a fully interactive dashboard for retail sales forecasting:
- Upload your own CSV or use the sample dataset (100 rows)
- Run predictions with one click
- Visualize actual vs predicted sales, error distributions, best/worst predictions
- Download predictions CSV for further analysis
- Dashboard uses Plotly and Matplotlib/Seaborn for interactive and static visualizations
App code β 7_app.py
Demo & repo β GitHub Link
- Python: pandas, numpy, matplotlib, seaborn, scikit-learn, XGBoost, joblib
- Jupyter Notebook
- Streamlit for interactive dashboard
- Git & GitHub
- Rossmann Store Sales (Kaggle)
- Used for learning and analysis purposes only.
- Feature engineering v2 (lags, rolling windows)
- Multi-step forecasting
- Advanced models: LSTM, Prophet, CatBoost
- Deployment improvements and cloud integration
- Clone the repository:
git clone https://github.com/y-india/retail-sales-analysis-project.git- Navigate to the project folder:
cd retail-sales-analysis-project- Install dependencies:
pip install -r requirements.txt- Run the Streamlit app:
streamlit run 7_app.py- Use the dashboard to upload your CSV or try the included sample dataset (test_dataset_100.csv)