LLM Code Credibility Evaluator

This repository contains a comprehensive framework for generating, analyzing, and evaluating Python code produced by Large Language Models (LLMs) based on user stories.

The core of this project is a robust Credibility Score calculation that moves beyond simple text similarity, evaluating generated code based on its structure, semantic correctness, and actual runtime performance within a secure Docker sandbox.

🚀 Key Features

Multi-Model Orchestration
Uses LangChain to interface with various LLMs (e.g. GPT-4o, Llama, DeepSeek) via OpenRouter
Prompt Engineering Variants
Supports multiple prompting strategies including Zero-Shot, Chain-of-Thought (CoT), and clustered contexts
Secure Execution Sandbox
Runs generated code inside an isolated Docker container to measure execution success, runtime performance, and capture exceptions without risking the host system
Static & Semantic Analysis
- Syntax: AST parsing validation
- Style: flake8 compliance
- Typing: mypy type checking
Structural Metrics
Calculates Cyclomatic Complexity, AST Depth, Function Size, and Import Redundancy
Credibility Scoring
Aggregates confidence (logprobs), structure, semantics, and execution metrics into a single 0–100 score
User Story Clustering
Uses Word2Vec and Gaussian Mixture Models (GMM) to cluster and analyze input requirements

📂 Project Structure

├── config/                 # Configuration files for experiments and paths
├── data/                   # Input user stories and clustered data
├── dockerfile              # Definition for the execution sandbox environment
├── pipelines/              # Core workflow scripts
│   ├── stories-to-code/    # Pipeline: Generating code from user stories
│   └── code-to-stories/    # Pipeline: Reverse engineering stories from code
├── results/                # Raw JSON responses and HTML reports
├── utils/                  # Metric calculation modules
│   ├── compute_credibility.py
│   ├── compute_code_execution_metrics.py
│   ├── compute_code_semantic_metrics.py
│   └── compute_code_structure_metrics.py
└── ...

🛠️ Installation

Prerequisites

Python 3.11+
Docker Desktop (running)

Setup

Clone the repository:

git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name

Install dependencies:

pip install -r requirements.txt

# Ensure analysis tools are installed
pip install flake8 mypy radon

Build the Docker sandbox. The execution metrics rely on a Docker image named python-sandbox:

docker build -t python-sandbox -f dockerfile .

Create environment variables. Add a .env file in the root directory:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

⚙️ Usage

The project is organized into pipelines. The primary workflow lives in pipelines/stories-to-code.

1. Run the LLM Experiment

Generate code based on the user stories defined in data/:

python pipelines/stories-to-code/01_run_prompts.py

2. Analyze Outputs

Compute structural, semantic and execution metrics for the generated code. This step spins up Docker containers to safely execute the code:

python pipelines/stories-to-code/02_analyze_outputs.py

3. Generate Reports

Create detailed HTML reports visualizing the Credibility Score and metric breakdowns:

python pipelines/stories-to-code/03_generate_reports.py
python pipelines/stories-to-code/05_generate_final_report.py

📊 The Credibility Score

The compute_credibility function calculates a weighted score (0–100) based on four pillars:

Confidence (30%) Based on the LLM’s token log probabilities and perplexity
Structure (15%) Code maintainability metrics such as Cyclomatic Complexity, nesting depth and function size
Semantic (25%) Adherence to Python standards including syntax validity, Flake8 errors and Mypy typing errors
Execution (30%) Runtime behavior: successful execution, execution time and absence of runtime exceptions

🧪 Data Analysis (Clustering)

To analyze the diversity of user stories before generation:

python data/stories_to_clusters.py

This script uses Word2Vec embeddings and Gaussian Mixture Models to determine optimal clusters for input requirements.

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
config		config
data		data
pipelines		pipelines
results		results
utils		utils
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
are_logprobs_available.py		are_logprobs_available.py
dockerfile		dockerfile
notes.txt		notes.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Code Credibility Evaluator

🚀 Key Features

📂 Project Structure

🛠️ Installation

Prerequisites

Setup

⚙️ Usage

1. Run the LLM Experiment

2. Analyze Outputs

3. Generate Reports

📊 The Credibility Score

🧪 Data Analysis (Clustering)

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Code Credibility Evaluator

🚀 Key Features

📂 Project Structure

🛠️ Installation

Prerequisites

Setup

⚙️ Usage

1. Run the LLM Experiment

2. Analyze Outputs

3. Generate Reports

📊 The Credibility Score

🧪 Data Analysis (Clustering)

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages