Academic Personal AI Infrastructure
-
Updated
Mar 19, 2026 - Python
Academic Personal AI Infrastructure
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Toward Expert-Level Medical Text Validation with Language Models
Open WebUI Logic Tree Multi-Model LLM Project - Advanced AI reasoning with Sakana AI's AB-MCTS and sophisticated multi-model collaboration
Agentic Extensible Test Automation Framework
University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.
The course teaches how to fine-tune LLMs using Group Relative Policy Optimization (GRPO)—a reinforcement learning method that improves model reasoning with minimal data. Learn RFT concepts, reward design, LLM-as-a-judge evaluation, and deploy jobs on the Predibase platform.
🚀 Production-grade, full-stack agentic ecosystem engineered with DB-first approach using Turso & Drizzle, Generative UI built with Next.js & Server-Sent Events (SSE), a robust LLM-as-Judge evaluation suite. Fully containerized and orchestrated via Kubernetes & Helm, leverages modern DevOps with GitHub Actions & GHCR.io scalable cloud-native deploy
Rubric-driven AI homework grading system built as a Claude Code Skill. Score student submissions with CoT reasoning, bias mitigation, and PDCA quality cycle.
A benchmark, alignment pipeline, and LLM-as-a-Judge for evaluating the clinical impact of ASR errors.
No-code desktop app for generating high-quality synthetic datasets to fine-tune LLMs — plan-then-execute pipeline, LLM-as-judge, HuggingFace upload.
VEX-HALT — Benchmark suite for AI verification systems. 443+ tests for calibration, robustness, honesty, and proof integrity.
Production-grade Playwright + TypeScript QA framework with AI-powered testing, LLM-as-Judge evaluation, MCP server, 7 CLI agents, security fuzzing, CI/CD pipelines, Jira sync, and Slack reporting — zero-config, plug-and-play.
Production prompt regression testing for agentic flows
Quantifying uncertainty in LLM-as-judge evals with conformal prediction.
Autonomous agent-to-agent marketplace with live Karpathy loop self-improvement. Agents discover, hire, benchmark, and evolve programmatically. MPP/x402/MCP. No humans in the loop.
Evaluation system for computer-use agents that uses LLMs to assess agent performance on web browsing and interaction tasks. This judge system reads screenshots, agent trajectories, and final results to provide detailed scoring and feedback.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
A multi-agent public opinion analysis system. It scrapes Reddit & Twitter, analyzes sentiment, and uses a Critic-Synthesis debate loop to uncover what the internet actually thinks.
CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift
Add a description, image, and links to the llm-as-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-judge topic, visit your repo's landing page and select "manage topics."