GitHub - ygyzys83/AI_evaluation_app: This project demonstrates a production-grade Evaluation (Evals) Framework used to benchmark multiple Large Language Models (LLMs) against a "Source of Truth" NBA dataset.

🏀 NBA AI Evaluation & Model Selection Framework

📋 Executive Summary

In the shift from deterministic to probabilistic software, "vibes-based" testing is a liability. This project demonstrates a production-grade Evaluation (Evals) Framework used to benchmark multiple Large Language Models (LLMs) against a "Source of Truth" NBA dataset. By building a custom RAG (Retrieval-Augmented Generation) pipeline, I measured the trade-offs between Accuracy, Latency, and Cost across three local models and one cloud API to make a data-driven model selection for a sports-analytics product.

🛠 Skills Demonstrated

AI Evaluation (Evals): Designing a "Golden Dataset" and automated scoring rubric.

LLM-as-a-Judge: Implementing a secondary LLM to automate the QA of unstructured natural language outputs.

Model Selection & Benchmarking: Quantifying the performance of Qwen, Gemma, and Gemini models.

Data Integrity: Handling complex real-world data issues (Leading zero Game IDs, VRAM constraints).

AI Safety & Guardrails: Implementing instruction-following checks to prevent hallucinations and subjective bias.

🏗 System Architecture

The diagram below illustrates the end-to-end flow from official data retrieval to the final TPM decision dashboard.

graph TD
    subgraph "Data Acquisition"
    A[nba_api] -->|Fetch Box Scores| B[nba_golden_dataset.csv]
    C[Human Experts] -->|Define Questions| D[eval_questions.json]
    end

    subgraph "Inference Pipeline (RAG)"
    B -->|Context Injection| E[Python Test Runner]
    D -->|Queries| E
    E -->|Ollama| F[Local LLMs: Qwen/Gemma]
    E -->|Google AI| G[Gemini 2.5 Flash]
    end

    subgraph "Evaluation Framework"
    F & G -->|Raw Answers| H[multi_model_results.json]
    H --> I[LLM-as-a-Judge Scorer]
    I -->|Grading Pass/Fail| J[final_comparison_report.json]
    end

    subgraph "Product Leadership"
    J --> K[Streamlit Dashboard]
    K --> L[Model Selection Decision]
    end

📊 Key Performance Indicators (KPIs)

To provide a 360-degree view of model effectiveness, the system tracks five primary metrics: Accuracy (Pass Rate): Binary metric determined by the "Judge" LLM comparing answers to Ground Truth. Semantic Similarity: A 1-5 score measuring how closely the model's phrasing aligns with human-written truth. Inference Latency: Measurement of the "Time-to-Answer" to ensure real-time product feasibility. Cost of Correctness: A simulation of monthly API spend vs. accuracy at scale (e.g., 100k queries/month). Instruction Following (Guardrails): Testing the model's ability to refuse subjective questions (e.g., "Who is the GOAT?") when told to stay strictly within the provided data.

🚀 Technical Challenges & Solutions

The "Leading Zero" Data Integrity Issue:

Problem: NBA Game IDs (e.g., 0042500101) were being truncated to integers (e.g., 42500101) by Pandas, breaking the RAG lookup.

Solution: Implemented explicit schema enforcement (dtype={'GAME_ID': str}) and string padding (zfill(10)) to ensure 100% retrieval accuracy. Hardware-Constrained Inference

Problem: Large models (Gemma 26B) caused system hangs and VRAM exhaustion during bulk evaluation.

Solution: Built a Defensive Inference Wrapper with hard client-side timeouts (60s) and "GPU Cool-down" periods between model swaps to ensure pipeline reliability.

📈 Dashboard Preview

The final Streamlit Dashboard provides a Model Selection Matrix. It allows stakeholders to toggle between prioritizing Accuracy (for historical record-keeping) or Latency (for live play-by-play updates), providing a clear ROI for each model choice.

💻 Setup

Clone the repository.

Install dependencies: pip install -r requirements.txt.

Set up environment: Add GOOGLE_API_KEY to .env.

Run the pipeline:

nba_data_pull.py

run_evals.py

evaluate_results.py

streamlit run app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
.gitignore		.gitignore
README.md		README.md
app.py		app.py
eval_questions.json		eval_questions.json
evaluate_results.py		evaluate_results.py
nba_data_pull.py		nba_data_pull.py
nba_golden_dataset.csv		nba_golden_dataset.csv
requirements.txt		requirements.txt
run_evals.py		run_evals.py

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages