Skip to content

Commit 7d4dd14

Browse files
authored
Merge pull request #169 from AbdelStark/issue-147-failure-case-galleries
Add evaluation failure galleries
2 parents 23a1c4a + 79fffbe commit 7d4dd14

12 files changed

Lines changed: 671 additions & 7 deletions

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ releases may still include breaking changes when the public API needs to tighten
99

1010
### Added
1111

12+
- Added sanitized evaluation failure galleries. Failed evaluation reports now expose
13+
representative fixture-level cases with expected contract notes, observed summaries, metrics
14+
previews, and triage steps; `report.artifacts()` also exports `failure_gallery.json` and
15+
`failure_gallery.md` for issue attachments.
1216
- Added benchmark budget calibration artifacts generated from preserved benchmark JSON reports.
1317
`scripts/calibrate_benchmark_budgets.py` writes a loadable `candidate-budgets.json`, a full
1418
`budget-calibration.json`, and a human-review Markdown report with source report digests,

docs/src/api/python.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -336,8 +336,15 @@ suite = EvaluationSuite.from_builtin("reasoning")
336336
report = suite.run_report(["mock"], forge=forge)
337337
print(report.results[0].passed)
338338
print(report.to_markdown())
339+
340+
gallery = report.failure_gallery()
341+
print(gallery.to_json())
339342
```
340343

344+
Failed reports expose representative, sanitized gallery cases through `failure_gallery()` and
345+
through `report.artifacts()["failure_gallery.json"]` / `["failure_gallery.md"]`. The gallery is
346+
for deterministic contract triage; it does not rank providers or claim physical fidelity.
347+
341348
## Benchmarking
342349

343350
```python

docs/src/claim-evidence-map.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ fidelity claims.
4949
| Public claim | Evidence class | Evidence | Command or artifact | Boundary |
5050
| --- | --- | --- | --- | --- |
5151
| Evaluation reports carry provenance and claim boundaries. | `checkout-tested`; `release-gated` | `tests/test_provenance.py`, `tests/test_evaluation_and_planning.py`, `docs/src/evaluation.md` | `uv run worldforge eval --suite planning --provider mock --format json` | Scores are deterministic contract signals, not physical or media-quality metrics. |
52+
| Failed evaluation reports include issue-ready failure galleries. | `checkout-tested`; `release-gated` | `tests/test_evaluation_failure_gallery.py`, `docs/src/evaluation.md`, `docs/src/api/python.md` | `uv run worldforge eval --suite planning --provider mock --format json` plus `report.artifacts()["failure_gallery.json"]` | Galleries are sanitized deterministic contract triage, not provider ranking or fidelity evidence. |
5253
| Benchmark reports carry provenance, budgets, and preset gates. | `checkout-tested`; `release-gated` | `tests/test_benchmark.py`, `tests/test_benchmark_presets.py`, `docs/src/benchmarking.md` | `uv run worldforge benchmark --preset release-evidence --format json --run-workspace .worldforge` | Timings are process-local adapter-path measurements, not machine-independent performance claims. |
5354
| Benchmark budget changes have a preserved baseline review path. | `checkout-tested`; `release-gated` | `tests/test_benchmark_budget_calibration.py`, `scripts/calibrate_benchmark_budgets.py`, `docs/src/benchmarking.md` | `uv run python scripts/calibrate_benchmark_budgets.py --report .worldforge/reports/benchmark-<timestamp>-<run-id>.json --current-budget src/worldforge/benchmark_presets/_data/budget-release-evidence.json` | Candidate budgets are review artifacts; they do not automatically weaken release gates. |
5455
| Evaluation evidence bundles package preserved runs. | `checkout-tested`; `release-gated` | `tests/test_evidence_bundle.py`, `scripts/generate_evidence_bundle.py`, `docs/src/evaluation.md` | `uv run python scripts/generate_evidence_bundle.py --workspace-dir .worldforge` | Unsafe, local-only, signed, or binary artifacts are excluded or marked; the bundle does not upload anything. |

docs/src/evaluation.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,21 @@ uv run worldforge eval --suite planning --provider mock --run-workspace .worldfo
4141
The run workspace stores `run_manifest.json`, JSON/Markdown/CSV reports, and a result summary under
4242
`.worldforge/runs/<run-id>/`.
4343

44+
When a suite has failures, the JSON and Markdown reports include a compact failure gallery. The
45+
gallery is also exported through `report.artifacts()` as `failure_gallery.json` and
46+
`failure_gallery.md` for issue attachments:
47+
48+
```python
49+
gallery = report.failure_gallery()
50+
print(gallery.to_markdown())
51+
```
52+
53+
Each gallery case records a fixture id such as `evaluation:planning:object-relocation`, the
54+
provider, scenario, score, expected contract note, observed summary, small metrics preview, and
55+
triage steps. Metric previews are sanitized: secret-shaped values are redacted, signed URL query
56+
strings are stripped, host-local paths are replaced, and tensor-like arrays are summarized instead
57+
of copied raw.
58+
4459
To package one or more preserved evaluation or benchmark runs for issue triage or release review,
4560
generate a checkout-safe evidence bundle:
4661

@@ -66,13 +81,16 @@ silently skipping the suite.
6681

6782
- Markdown: provenance section, provider summary table, scenario-level detail table
6883
- JSON: `suite_id`, `suite`, `claim_boundary`, `metric_semantics`, `provider_summaries`,
69-
scenario `results`, and a `provenance` envelope
84+
scenario `results`, a `provenance` envelope, and `failure_gallery` when failed scenarios exist
7085
- CSV: one row per provider/scenario pair with serialized metrics payloads (envelope omitted
7186
to keep the table import-compatible with prior releases)
87+
- Failure gallery JSON/Markdown: representative failed cases with fixture ids, expected contract
88+
notes, sanitized metrics previews, and first triage steps
7289

7390
Every JSON and Markdown report carries an explicit claim boundary. Built-in suites are
7491
deterministic adapter contract checks; their scores are not physical-fidelity, media-quality,
75-
safety, or real robot performance claims.
92+
safety, or real robot performance claims. Failure galleries follow the same boundary: they are for
93+
issue triage and provider-review debugging, not provider quality ranking.
7694

7795
## Provenance envelope
7896

docs/src/roadmap-continuation.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -599,11 +599,11 @@ Out of scope:
599599

600600
Acceptance criteria:
601601

602-
- [ ] Failed evaluation reports include representative cases with fixture IDs and expected
602+
- [x] Failed evaluation reports include representative cases with fixture IDs and expected
603603
contract notes.
604-
- [ ] Galleries are exported as JSON and Markdown.
605-
- [ ] Reports avoid raw secrets, signed URLs, raw tensors, and host-local paths.
606-
- [ ] Docs explain that galleries are deterministic contract triage, not fidelity claims.
604+
- [x] Galleries are exported as JSON and Markdown.
605+
- [x] Reports avoid raw secrets, signed URLs, raw tensors, and host-local paths.
606+
- [x] Docs explain that galleries are deterministic contract triage, not fidelity claims.
607607

608608
Validation:
609609

src/worldforge/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@
4444
"EvalResult": "worldforge.evaluation",
4545
"EvalScenario": "worldforge.evaluation",
4646
"EvalSuite": "worldforge.evaluation",
47+
"EvaluationFailureCase": "worldforge.evaluation",
48+
"EvaluationFailureGallery": "worldforge.evaluation",
4749
"EvaluationReport": "worldforge.evaluation",
4850
"EvaluationResult": "worldforge.evaluation",
4951
"EvaluationScenario": "worldforge.evaluation",

src/worldforge/evaluation/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
EvalResult,
1919
EvalScenario,
2020
EvalSuite,
21+
EvaluationFailureCase,
22+
EvaluationFailureGallery,
2123
EvaluationReport,
2224
EvaluationResult,
2325
EvaluationScenario,
@@ -40,6 +42,8 @@
4042
"EvalResult",
4143
"EvalScenario",
4244
"EvalSuite",
45+
"EvaluationFailureCase",
46+
"EvaluationFailureGallery",
4347
"EvaluationReport",
4448
"EvaluationResult",
4549
"EvaluationScenario",

0 commit comments

Comments
 (0)