LLM evaluation framework with built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves. Native Claude Code skill / RAG / agent / prompt evaluation.
cli benchmark ai eval evaluation-framework claude knowledge-engineering skill-evaluation llm prompt-engineering prompt-testing rag-evaluation llm-judge claude-code bootstrap-ci agent-eval krippendorff-alpha evaluation-as-code
-
Updated
May 2, 2026 - TypeScript