Add benchmark budget calibration workflow by AbdelStark · Pull Request #168 · AbdelStark/worldforge

AbdelStark · 2026-05-05T14:31:40Z

Problem

Benchmark budgets need a reviewable calibration path from preserved baseline reports before maintainers change release budget files.

Closes #146.

Added worldforge.benchmark_calibration to read preserved benchmark JSON reports, record source report SHA-256 digests, capture baseline context, and generate loadable candidate budget files.
Added scripts/calibrate_benchmark_budgets.py to write budget-calibration.json, candidate-budgets.json, and a Markdown review report without modifying existing budget files.
Added regression coverage for candidate generation, review diffs, malformed inputs, script execution, and preservation of existing budget failure behavior.
Documented the threshold-loosening review rule in benchmarking, operations, playbooks, the claim evidence map, changelog, and the WF-B6 roadmap tracker.

uv lock --check
uv run ruff check src tests examples scripts
uv run ruff format --check src tests examples scripts
uv run python scripts/generate_provider_docs.py --check
uv run pytest tests/test_benchmark.py tests/test_benchmark_budget_calibration.py -q (58 passed)
uv run pytest tests/test_docs_site.py tests/test_benchmark_budget_calibration.py tests/test_benchmark.py -q (73 passed)
uv run mkdocs build --strict
uv run pytest (758 passed)
uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90 (758 passed, 90.10% coverage)
bash scripts/test_package.sh (679 passed, 79 skipped)
uv build --out-dir dist --clear --no-build-logs

Add benchmark budget calibration workflow

acbb75f

AbdelStark merged commit 23a1c4a into main May 5, 2026
8 checks passed

AbdelStark deleted the issue-146-budget-calibration branch May 5, 2026 14:33