Machine Learning Pipeline with GPU acceleration (NVIDIA CUDA) for diabetes classification, developed by @Ashu11-A.
Training of multiple classifiers with Early Stopping,
interactive 3D visualization rendered on the GPU via OpenGL (vispy)
- About the Project
- Dataset
- Architecture
- Models Used
- Early Stopping
- Training Results
- 3D GPU Visualization
- Requirements
- Installation and Execution
This project implements a complete Machine Learning pipeline for diabetes classification, focusing on:
- Full GPU Acceleration โ preprocessing, training, and rendering are executed in the video card's VRAM via NVIDIA CUDA and OpenGL.
- Simultaneous model comparison โ three classification algorithms are trained and compared side-by-side.
- Automated hyperparameter search โ Early Stopping with minimum improvement tolerance (
min_delta) prevents search overfitting and premature stops due to noise. - Interactive 3D visualization โ plots display decision boundaries in 3D space (Insulin ร Glucose ร BMI), with a synchronized camera across all panels.
| Field | Detail |
|---|---|
| Source | Kaggle โ Diabetes Dataset (John Da Silva) |
| Samples | 2,000 patients |
| Features used | Insulin, Glucose, BMI |
| Target | Outcome โ 0 (non-diabetic) ยท 1 (diabetic) |
| Split | 70% train ยท 30% test (no shuffling, temporal order preserved) |
| Scaling | MinMaxScaler โ [-1, 1] range in float32 |
The project follows a layered architecture with a clear separation of responsibilities. Each layer is an independent Python subpackage:
My-Machine-Learn/
โ
โโโ main.py โ single entry point
โ
โโโ diabetes_ml/ โ project namespace
โโโ config.py โ PipelineConfig (frozen dataclass)
โโโ pipeline.py โ DiabetesMLPipeline (orchestrator)
โ
โโโ data/
โ โโโ dataset.py โ ProcessedDataset (CPU + GPU arrays)
โ โโโ pipeline.py โ DataPipeline (load โ split โ scale)
โ
โโโ training/
โ โโโ early_stopping.py โ EarlyStopping + EarlyStoppingState
โ โโโ wrappers.py โ Strategy: GPUModelWrapper + 3 models
โ โโโ tuner.py โ HyperparameterTuner (search loop)
โ
โโโ visualization/
โโโ gpu_canvas.py โ GPUScatterRow (vispy OpenGL)
โโโ grid.py โ DecisionBoundaryGrid (3D mesh on GPU)
โโโ subplots.py โ ModelSubplotBuilder (Markers via OpenGL)
โโโ tuning_plot.py โ TuningPlotBuilder (matplotlib)
โโโ interaction.py โ DiabetesMLWindow (Qt window)
Applied design patterns:
| Pattern | Where |
|---|---|
| Strategy | GPUModelWrapper โ adding a new model requires only a new subclass |
| Dataclass (frozen) | PipelineConfig โ immutable and hashable configuration |
| Dependency Injection | Shared camera injected into both GPUScatterRow components |
| Single Responsibility | Each file contains exactly one responsibility |
KNN classifies a point based on the K nearest neighbors in the feature space. For each new sample, the algorithm calculates the Euclidean distance to all training points and assigns the majority class among the closest K.
- Searched hyperparameter:
K(number of neighbors) โ values from 20 to 1,024 - Library:
cuml.neighbors.KNeighborsClassifier(100% GPU execution) - Pros: simple, no explicit training phase, interpretable
- Cons: slow inference for large datasets; sensitive to features on different scales (hence MinMaxScaler is essential)
Best K found: 26 โ Test accuracy: 77.83%
Random Forest is an ensemble of decision trees trained on random subsets of the data (bagging) and with random subsets of features at each split. The final prediction is made by majority voting among all trees.
- Searched hyperparameter:
n_estimators(number of trees) - Fixed configuration:
max_depth=5,random_state=42 - Library:
cuml.ensemble.RandomForestClassifier(parallel training on GPU) - Pros: robust to overfitting, performs well without extensive tuning, naturally parallel
- Cons: less interpretable than a single tree; can be slow with many deep trees
Best N found: 254 โ Test accuracy: 80.33%
Gradient Boosting builds trees sequentially: each new tree is trained to correct the residual errors of the previous tree, minimizing a loss function via gradient descent.
- Searched hyperparameter:
n_estimators(number of estimators/rounds) - Fixed configuration:
max_depth=3,tree_method='hist',device='cuda',random_state=42 - Library:
xgboost.XGBClassifierwith native CUDA backend - Pros: generally the highest accuracy model among the three; efficient with
tree_method='hist'; natively accepts CuPy arrays without CPUโGPU transfer overhead - Cons: more sensitive to hyperparameters; sequential training limits parallelism compared to RF
Best N found: 684 โ Test accuracy: 98.00%
The hyperparameter search uses Early Stopping with a minimum improvement tolerance, avoiding two common issues:
- Premature stopping due to noise โ small negative oscillations do not interrupt the search
- Unnecessarily long search โ if no model improves significantly for
patience_limitconsecutive steps, the search ends
# diabetes_ml/config.py
patience_limit: int = 150 # steps without improvement before stopping
min_delta: float = 0.01 # minimum improvement considered significant
initial_param: int = 20 # initial value of the searched hyperparameterDecision logic at each step:
new_accuracy - best_accuracy > min_delta?
โโโ YES โ updates best, resets patience
โโโ NO โ increments patience
โโโ patience >= patience_limit โ stops for this model
The three models are searched in parallel step by step. The global search ends when all models hit their patience limit.
| Model | Best Parameter | Best Accuracy (Test) |
|---|---|---|
| KNN | K = 26 | 77.83% |
| Random Forest | N = 254 | 80.33% |
| Gradient Boosting | N = 684 | 98.00% |
Gradient Boosting via XGBoost achieved the highest accuracy, which is expected given that boosting algorithms tend to outperform bagging methods and simple instances when the data has complex non-linear relationships between features.
The visualization was reimplemented from matplotlib 3D (CPU) to vispy OpenGL (GPU):
| Aspect | Matplotlib (before) | vispy OpenGL (now) |
|---|---|---|
| Rendering | Software (CPU) | OpenGL (GPU/VRAM) |
| Framerate when rotating | Low (~2โ5 fps) | High (60+ fps) |
| Data buffer | Recalculated every frame | Sent once to VRAM |
| Synchronization | Event callbacks | Shared camera object |
Color legend:
| Color | Meaning |
|---|---|
| ๐ต Blue | Class 0 โ correct prediction |
| ๐ด Red | Class 1 โ correct prediction |
| ๐ข Mint Green | Class 0 โ incorrect prediction |
| ๐ก Yellow | Class 1 โ incorrect prediction |
- UV (package and project manager)
- NVIDIA GPU with CUDA support
>= 11.8 - CUDA Toolkit installed on the system
Main dependencies:
| Package | Usage |
|---|---|
cuml |
GPU-accelerated KNN and Random Forest |
xgboost |
Gradient Boosting with CUDA backend |
cupy |
Arrays in VRAM and GPUโCPU transfers |
vispy |
3D Rendering via OpenGL |
PyQt5 / PyQt6 |
Window backend for vispy + matplotlib |
matplotlib |
Fine Tuning Plot (accuracy vs parameter) |
scikit-learn |
train_test_split, MinMaxScaler |
pandas / numpy |
Data manipulation |
# 1. Clone the repository
git clone https://github.com/Ashu11-A/My-Machine-Learn.git
cd My-Machine-Learn
# 2. Place the dataset in the project root
# Download at: https://www.kaggle.com/datasets/johndasilva/diabetes
# 3. Install dependencies
uv sync
# 4. Run
uv run main.py
To customize the search parameters without modifying the code:
from diabetes_ml.config import PipelineConfig
from diabetes_ml.pipeline import DiabetesMLPipeline
cfg = PipelineConfig(
patience_limit=200,
min_delta=0.005,
initial_param=10,
)
DiabetesMLPipeline(cfg).run()