llm-as-judge

Star

Here are 81 public repositories matching this topic...

jerry609 / PaperBot

Star

Academic Personal AI Infrastructure

research paper nextjs multi-agent arxiv scholar rag fastapi paper2code daily-paper llm llm-as-judge

Updated Mar 19, 2026
Python

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

StanfordMIMI / MedVAL

Star

Toward Expert-Level Medical Text Validation with Language Models

medical-text llm-as-judge

Updated Oct 23, 2025
Python

johnsonfarmsus / openwebui-ab-mcts-pipeline

Star

Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration

docker machine-learning ai pipeline multi-model monte-carlo-tree-search research-software sakana llm reasoning-engine open-webui llm-as-judge advanced-reasoning open-webui-tools ab-mcts

Updated Oct 10, 2025
Python

WesleyPeng / agentic-taf

Star

Agentic Extensible Test Automation Framework

docker bdd selenium atdd requests ui-automation automation-framework paramiko chaos-engineering multi-layer-architecture playwright llm-as-judge

Updated Apr 28, 2026
Python

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated May 2, 2026
TypeScript

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

Star

The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.

reinforcement-learning machine-learning-algorithms language-model reward-design rft ai-training deeplearning-ai-courses ai-optimization multi-step-reasoning ai-evaluation rlhf llm-fine-tuning opensource-ai llm-as-judge predibase grpo llm-development token-level-control

Updated Jun 13, 2025
Jupyter Notebook

VinZCodz / llm-fullstack-ai-agentic-system

Star

🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy

kubernetes express tdd nextjs helm monorepo server-sent-events cloud-native next-js vercel vitest llm drizzle-orm generative-ui llm-as-judge agentic-ai turso-db langraph agentic-devops

Updated Mar 21, 2026
TypeScript

ChantillyAn / homework-grader

Star

Rubric-driven AI homework grading system built as a Claude Code Skill. Score student submissions with CoT reasoning, bias mitigation, and PDCA quality cycle.

education quality-control batch-processing claude excel-export rubric bias-mitigation anthropic llm-as-judge claude-code ai-grading claude-code-skill homework-grading

Updated Feb 22, 2026
Python

Ufonia / wer-is-unaware

Star

A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.

healthcare-ai dspy llm-as-judge gepa

Updated Mar 5, 2026
Python

AronDaron / dataset-generator

Star

No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload.

desktop-app nextjs dataset-generation alpaca synthetic-data fine-tuning sft fastapi huggingface llm sharegpt openrouter chatml llm-fine-tuning llm-as-judge

Updated May 3, 2026
TypeScript

provnai / vex-halt

Sponsor

Star

VEX-HALT — Benchmark suite for AI verification systems. 443+ tests for calibration, robustness, honesty, and proof integrity.

testing rust benchmark cryptography ai merkle testing-tools ai-evaluation llm-as-judge ai-evaluation-framework

Updated Dec 23, 2025
Rust

natebell510 / BestTester

Star

Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.

typescript mutation-testing mcp test-automation ci-cd allure-report api-testing owasp-zap security-testing page-object-model stryker e2e-testing github-actions qa-automation playwright jira-integration ai-testing aws-bedrock llm-as-judge

Updated Apr 21, 2026
TypeScript

BertBR / gauntlet

Star

Production prompt regression testing for agentic flows

prompt regression gemini openai eval red-team llm agentic llm-as-judge

Updated Apr 27, 2026
TypeScript

anikasomaia / llm-eval-uncertainty

Star

Quantifying uncertainty in LLM-as-judge evals with conformal prediction.

nlp evaluation ai-safety probabilistic-models llm-as-judge

Updated Jul 28, 2025
Python

trillskillz / clawdmarket

Star

Autonomous agent-to-agent marketplace with live Karpathy loop self-improvement. Agents discover, hire, benchmark, and evolve programmatically. MPP/x402/MCP. No humans in the loop.

Updated Apr 28, 2026
TypeScript

anaishowland / llm-judge-psai

Star

Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.

computer-vision evaluation evaluation-metrics performance-testing evaluation-framework web-agent llm-as-judge llm-as-a-judge llm-as-evaluator computer-use

Updated Oct 28, 2025
Python

MohsinCreed / LangfuseOllama

Star

Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.

docker open-source self-hosted free no-cost local-llm ollama langfuse llm-evaluation prompt-evaluation offline-ai llm-as-judge llm-observability ai-evals

Updated Apr 13, 2026
TypeScript

Odiethebest / Pulse

Star

A multi-agent public opinion analysis system. It scrapes Reddit & Twitter, analyzes sentiment, and uses a Critic-Synthesis debate loop to uncover what the internet actually thinks.

spring-boot sentiment-analysis multi-agent public-opinion spring-ai llm-as-judge java-openai

Updated Mar 31, 2026
Java

najmulhasan-code / crc-screen

Star

CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

ai-safety conformal-prediction biosecurity dna-screening llm-as-judge calibrated-classification conformal-risk-control

Updated May 4, 2026
Python

Improve this page

Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-as-judge

Here are 81 public repositories matching this topic...

jerry609 / PaperBot

minnesotanlp / cobbler

StanfordMIMI / MedVAL

johnsonfarmsus / openwebui-ab-mcts-pipeline

WesleyPeng / agentic-taf

edholofy / dojo.md

ksm26 / Reinforcement-Fine-Tuning-LLMs-with-GRPO

VinZCodz / llm-fullstack-ai-agentic-system

ChantillyAn / homework-grader

Ufonia / wer-is-unaware

AronDaron / dataset-generator

provnai / vex-halt

natebell510 / BestTester

BertBR / gauntlet

anikasomaia / llm-eval-uncertainty

trillskillz / clawdmarket

anaishowland / llm-judge-psai

MohsinCreed / LangfuseOllama

Odiethebest / Pulse

najmulhasan-code / crc-screen

Improve this page

Add this topic to your repo