Skip to content

Ashu11-A/My-Machine-Learn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

My Machine Learn

License Stars Last Commit Repo Size


Machine Learning Pipeline with GPU acceleration (NVIDIA CUDA) for diabetes classification, developed by @Ashu11-A.

Training of multiple classifiers with Early Stopping, interactive 3D visualization rendered on the GPU via OpenGL (vispy)


Visual graphs of trained models



Star Repo

๐Ÿ“‹ Table of Contents


๐Ÿง  About the Project

This project implements a complete Machine Learning pipeline for diabetes classification, focusing on:

  • Full GPU Acceleration โ€” preprocessing, training, and rendering are executed in the video card's VRAM via NVIDIA CUDA and OpenGL.
  • Simultaneous model comparison โ€” three classification algorithms are trained and compared side-by-side.
  • Automated hyperparameter search โ€” Early Stopping with minimum improvement tolerance (min_delta) prevents search overfitting and premature stops due to noise.
  • Interactive 3D visualization โ€” plots display decision boundaries in 3D space (Insulin ร— Glucose ร— BMI), with a synchronized camera across all panels.

๐Ÿ“Š Dataset

Field Detail
Source Kaggle โ€” Diabetes Dataset (John Da Silva)
Samples 2,000 patients
Features used Insulin, Glucose, BMI
Target Outcome โ€” 0 (non-diabetic) ยท 1 (diabetic)
Split 70% train ยท 30% test (no shuffling, temporal order preserved)
Scaling MinMaxScaler โ†’ [-1, 1] range in float32

๐Ÿ—๏ธ Architecture

The project follows a layered architecture with a clear separation of responsibilities. Each layer is an independent Python subpackage:


My-Machine-Learn/
โ”‚
โ”œโ”€โ”€ main.py                  โ† single entry point
โ”‚
โ””โ”€โ”€ diabetes_ml/             โ† project namespace
โ”œโ”€โ”€ config.py                โ† PipelineConfig (frozen dataclass)
โ”œโ”€โ”€ pipeline.py              โ† DiabetesMLPipeline (orchestrator)
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ dataset.py           โ† ProcessedDataset (CPU + GPU arrays)
โ”‚   โ””โ”€โ”€ pipeline.py          โ† DataPipeline (load โ†’ split โ†’ scale)
โ”‚
โ”œโ”€โ”€ training/
โ”‚   โ”œโ”€โ”€ early_stopping.py    โ† EarlyStopping + EarlyStoppingState
โ”‚   โ”œโ”€โ”€ wrappers.py          โ† Strategy: GPUModelWrapper + 3 models
โ”‚   โ””โ”€โ”€ tuner.py             โ† HyperparameterTuner (search loop)
โ”‚
โ””โ”€โ”€ visualization/
โ”œโ”€โ”€ gpu_canvas.py            โ† GPUScatterRow (vispy OpenGL)
โ”œโ”€โ”€ grid.py                  โ† DecisionBoundaryGrid (3D mesh on GPU)
โ”œโ”€โ”€ subplots.py              โ† ModelSubplotBuilder (Markers via OpenGL)
โ”œโ”€โ”€ tuning_plot.py           โ† TuningPlotBuilder (matplotlib)
โ””โ”€โ”€ interaction.py           โ† DiabetesMLWindow (Qt window)

Applied design patterns:

Pattern Where
Strategy GPUModelWrapper โ€” adding a new model requires only a new subclass
Dataclass (frozen) PipelineConfig โ€” immutable and hashable configuration
Dependency Injection Shared camera injected into both GPUScatterRow components
Single Responsibility Each file contains exactly one responsibility

๐Ÿค– Models Used

K-Nearest Neighbors (KNN) โ€” via cuML

KNN classifies a point based on the K nearest neighbors in the feature space. For each new sample, the algorithm calculates the Euclidean distance to all training points and assigns the majority class among the closest K.

  • Searched hyperparameter: K (number of neighbors) โ€” values from 20 to 1,024
  • Library: cuml.neighbors.KNeighborsClassifier (100% GPU execution)
  • Pros: simple, no explicit training phase, interpretable
  • Cons: slow inference for large datasets; sensitive to features on different scales (hence MinMaxScaler is essential)

Best K found: 26   โ†’   Test accuracy: 77.83%


Random Forest (RF) โ€” via cuML

Random Forest is an ensemble of decision trees trained on random subsets of the data (bagging) and with random subsets of features at each split. The final prediction is made by majority voting among all trees.

  • Searched hyperparameter: n_estimators (number of trees)
  • Fixed configuration: max_depth=5, random_state=42
  • Library: cuml.ensemble.RandomForestClassifier (parallel training on GPU)
  • Pros: robust to overfitting, performs well without extensive tuning, naturally parallel
  • Cons: less interpretable than a single tree; can be slow with many deep trees

Best N found: 254   โ†’   Test accuracy: 80.33%


Gradient Boosting (GB) โ€” via XGBoost + CUDA

Gradient Boosting builds trees sequentially: each new tree is trained to correct the residual errors of the previous tree, minimizing a loss function via gradient descent.

  • Searched hyperparameter: n_estimators (number of estimators/rounds)
  • Fixed configuration: max_depth=3, tree_method='hist', device='cuda', random_state=42
  • Library: xgboost.XGBClassifier with native CUDA backend
  • Pros: generally the highest accuracy model among the three; efficient with tree_method='hist'; natively accepts CuPy arrays without CPUโ†”GPU transfer overhead
  • Cons: more sensitive to hyperparameters; sequential training limits parallelism compared to RF

Best N found: 684   โ†’   Test accuracy: 98.00%


โฑ๏ธ Early Stopping

The hyperparameter search uses Early Stopping with a minimum improvement tolerance, avoiding two common issues:

  1. Premature stopping due to noise โ€” small negative oscillations do not interrupt the search
  2. Unnecessarily long search โ€” if no model improves significantly for patience_limit consecutive steps, the search ends
# diabetes_ml/config.py
patience_limit: int = 150    # steps without improvement before stopping
min_delta: float     = 0.01  # minimum improvement considered significant
initial_param: int   = 20    # initial value of the searched hyperparameter

Decision logic at each step:

new_accuracy - best_accuracy > min_delta?
    โ”œโ”€โ”€ YES โ†’ updates best, resets patience
    โ””โ”€โ”€ NO  โ†’ increments patience
               โ””โ”€โ”€ patience >= patience_limit โ†’ stops for this model

The three models are searched in parallel step by step. The global search ends when all models hit their patience limit.


๐Ÿ“ˆ Training Results

Model Best Parameter Best Accuracy (Test)
KNN K = 26 77.83%
Random Forest N = 254 80.33%
Gradient Boosting N = 684 98.00%

Gradient Boosting via XGBoost achieved the highest accuracy, which is expected given that boosting algorithms tend to outperform bagging methods and simple instances when the data has complex non-linear relationships between features.


๐ŸŽฎ 3D GPU Visualization

The visualization was reimplemented from matplotlib 3D (CPU) to vispy OpenGL (GPU):

Aspect Matplotlib (before) vispy OpenGL (now)
Rendering Software (CPU) OpenGL (GPU/VRAM)
Framerate when rotating Low (~2โ€“5 fps) High (60+ fps)
Data buffer Recalculated every frame Sent once to VRAM
Synchronization Event callbacks Shared camera object

Color legend:

Color Meaning
๐Ÿ”ต Blue Class 0 โ€” correct prediction
๐Ÿ”ด Red Class 1 โ€” correct prediction
๐ŸŸข Mint Green Class 0 โ€” incorrect prediction
๐ŸŸก Yellow Class 1 โ€” incorrect prediction

๐Ÿ“ฆ Requirements

  • UV (package and project manager)
  • NVIDIA GPU with CUDA support >= 11.8
  • CUDA Toolkit installed on the system

Main dependencies:

Package Usage
cuml GPU-accelerated KNN and Random Forest
xgboost Gradient Boosting with CUDA backend
cupy Arrays in VRAM and GPUโ†”CPU transfers
vispy 3D Rendering via OpenGL
PyQt5 / PyQt6 Window backend for vispy + matplotlib
matplotlib Fine Tuning Plot (accuracy vs parameter)
scikit-learn train_test_split, MinMaxScaler
pandas / numpy Data manipulation

๐Ÿš€ Installation and Execution

# 1. Clone the repository
git clone https://github.com/Ashu11-A/My-Machine-Learn.git
cd My-Machine-Learn

# 2. Place the dataset in the project root
#    Download at: https://www.kaggle.com/datasets/johndasilva/diabetes

# 3. Install dependencies
uv sync

# 4. Run
uv run main.py

To customize the search parameters without modifying the code:

from diabetes_ml.config import PipelineConfig
from diabetes_ml.pipeline import DiabetesMLPipeline

cfg = PipelineConfig(
    patience_limit=200,
    min_delta=0.005,
    initial_param=10,
)
DiabetesMLPipeline(cfg).run()

About

Three-dimensional scatter plot visualization of dataset and predict of different machine learning models on diabetes data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages