This repository contains a small optimization harness for comparing fixed baseline runs against candidate prompt or instruction variants on the same task. The goal is not to crown a model. The goal is to make prompt changes measurable, repeatable, and tied to generated artifacts instead of anecdotes.
The workflow is built around Spec Kit skill scenarios. Each run receives the same scenario prompt and skill command, produces the artifacts required by that skill, and is scored with the same rubric.
agent-benchmark/0-orchestration.md: benchmark operating process.agent-benchmark/1-recipes/agents/<agent-key>/: agent configuration and run recipes.agent-benchmark/1-recipes/skills/: skill installation and setup recipes.agent-benchmark/2-evaluation/: rubric, comparison method, and thresholds.agent-benchmark/3-scenarios/: benchmark scenario prompts.agent-benchmark/4-scripts/run-agent.sh: run materialization script.agent-benchmark/5-prompts/0-agents/patches/<agent-key>/: candidate prompt patches.agent-benchmark/6-runs/0-experiments/: evaluator-created run manifests.agent-benchmark/6-runs/1-baseline/<agent-key>/<run-id>/: baseline run workspaces.agent-benchmark/6-runs/2-candidate/<agent-key>/<run-id>/: candidate run workspaces.agent-benchmark/7-reports/0-baselines/<agent-key>/: frozen baseline assessments.agent-benchmark/7-reports/1-comparisons/<agent-key>/: candidate comparison reports.
Canonical agent keys are lowercase: vibe and claude-code.
The current harness works with Claude Code and Vibe CLI. Either integration can be used in the baseline or candidate role as long as the run produces the expected Spec Kit artifacts and benchmark traces.
Agent setup materials are stored under
agent-benchmark/1-recipes/agents/<agent-key>/. These recipes are Markdown
materials for evaluators and runners; skill installation is documented
separately under agent-benchmark/1-recipes/skills/.
Evaluator agents may instantiate experiments by producing an explicit manifest
under agent-benchmark/6-runs/0-experiments/ and passing it with
--experiment. The manifest can select the agent, role, model, system prompt,
patch, output format, and agent-supported config variables. CLI flags override
manifest values, and manifest values override recipe defaults.
Each output is scored from 0 to 100 across skill-relative dimensions:
- workflow completion
- artifact quality
- scenario and acceptance coverage, when applicable
- scope, assumptions, and edge cases, when applicable
- measurable outcomes or validation criteria, when applicable
- validation discipline
The baseline is scored once and then frozen. Candidate runs are compared against that frozen baseline and against previous candidate runs. This avoids silently moving the target while optimizing.
The comparison step should be driven by a third evaluator agent, not by either
the baseline or candidate agent being judged. In this harness, that evaluator
uses the boost-agent-outcomes skill to identify baseline strengths, candidate
weaknesses, shared defects, prompt-fixable gaps, and stop/continue decisions.
The active benchmark state is fully reset. Previous experiment traces and results have been cleared.
There are no active baseline assessments, candidate comparisons, score-ledger records, latest-comparison snapshots, or selected best candidate patches. The next benchmark cycle should create fresh baseline runs before any candidate comparison.
- Select a scenario from
agent-benchmark/3-scenarios/scenarios.md. - Run and score the fixed baseline.
- Run the candidate with no patch.
- Have a third evaluator agent compare candidate artifacts against the frozen
baseline using
boost-agent-outcomes. - If there are actionable instruction-level improvements, create the next
candidate patch under
agent-benchmark/5-prompts/0-agents/patches/<agent-key>/. - Replay the candidate with that patch.
- Continue while scores do not regress and material improvements remain.
- Stop when a candidate regresses, reaches parity with only preference differences left, has no actionable next improvement, or a configured max iteration count is reached.
Candidate progress is evaluated with a log-progress metric:
ln((100 - previous_score) / (100 - new_score))
This gives more credit to improvements near the top of the scale, where each additional point is harder to earn.
Reusable harness files should be committed:
agent-benchmark/0-orchestration.mdagent-benchmark/1-recipes/agent-benchmark/2-evaluation/agent-benchmark/3-scenarios/agent-benchmark/4-scripts/agent-benchmark/5-prompts/agent-benchmark/6-runs/0-experiments/README.mdagent-benchmark/7-reports/
Generated run outputs should not be published. The repo .gitignore excludes
run workspaces, generated experiment manifests, generated candidate comparison
outputs, and generated prompt patches by default.