AbdelStark
diff --git a/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/src/api/python.md‎
Lines changed: 7 additions & 0 deletions b/‎docs/src/api/python.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/src/claim-evidence-map.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/src/claim-evidence-map.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/src/evaluation.md‎
Lines changed: 20 additions & 2 deletions b/‎docs/src/evaluation.md‎
Lines changed: 20 additions & 2 deletions
diff --git a/‎docs/src/roadmap-continuation.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/src/roadmap-continuation.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎src/worldforge/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎src/worldforge/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/worldforge/evaluation/__init__.py‎
Lines changed: 4 additions & 0 deletions b/‎src/worldforge/evaluation/__init__.py‎
Lines changed: 4 additions & 0 deletions
@@ -9,6 +9,10 @@ releases may still include breaking changes when the public API needs to tighten
 
 ### Added
 
+- Added sanitized evaluation failure galleries. Failed evaluation reports now expose
+  representative fixture-level cases with expected contract notes, observed summaries, metrics
+  previews, and triage steps; `report.artifacts()` also exports `failure_gallery.json` and
+  `failure_gallery.md` for issue attachments.
 - Added benchmark budget calibration artifacts generated from preserved benchmark JSON reports.
   `scripts/calibrate_benchmark_budgets.py` writes a loadable `candidate-budgets.json`, a full
   `budget-calibration.json`, and a human-review Markdown report with source report digests,
 
@@ -336,8 +336,15 @@ suite = EvaluationSuite.from_builtin("reasoning")
 report = suite.run_report(["mock"], forge=forge)
 print(report.results[0].passed)
 print(report.to_markdown())
+
+gallery = report.failure_gallery()
+print(gallery.to_json())
 ```
 
+Failed reports expose representative, sanitized gallery cases through `failure_gallery()` and
+through `report.artifacts()["failure_gallery.json"]` / `["failure_gallery.md"]`. The gallery is
+for deterministic contract triage; it does not rank providers or claim physical fidelity.
+
 ## Benchmarking
 
 ```python
 
@@ -49,6 +49,7 @@ fidelity claims.
 | Public claim | Evidence class | Evidence | Command or artifact | Boundary |
 | --- | --- | --- | --- | --- |
 | Evaluation reports carry provenance and claim boundaries. | `checkout-tested`; `release-gated` | `tests/test_provenance.py`, `tests/test_evaluation_and_planning.py`, `docs/src/evaluation.md` | `uv run worldforge eval --suite planning --provider mock --format json` | Scores are deterministic contract signals, not physical or media-quality metrics. |
+| Failed evaluation reports include issue-ready failure galleries. | `checkout-tested`; `release-gated` | `tests/test_evaluation_failure_gallery.py`, `docs/src/evaluation.md`, `docs/src/api/python.md` | `uv run worldforge eval --suite planning --provider mock --format json` plus `report.artifacts()["failure_gallery.json"]` | Galleries are sanitized deterministic contract triage, not provider ranking or fidelity evidence. |
 | Benchmark reports carry provenance, budgets, and preset gates. | `checkout-tested`; `release-gated` | `tests/test_benchmark.py`, `tests/test_benchmark_presets.py`, `docs/src/benchmarking.md` | `uv run worldforge benchmark --preset release-evidence --format json --run-workspace .worldforge` | Timings are process-local adapter-path measurements, not machine-independent performance claims. |
 | Benchmark budget changes have a preserved baseline review path. | `checkout-tested`; `release-gated` | `tests/test_benchmark_budget_calibration.py`, `scripts/calibrate_benchmark_budgets.py`, `docs/src/benchmarking.md` | `uv run python scripts/calibrate_benchmark_budgets.py --report .worldforge/reports/benchmark-<timestamp>-<run-id>.json --current-budget src/worldforge/benchmark_presets/_data/budget-release-evidence.json` | Candidate budgets are review artifacts; they do not automatically weaken release gates. |
 | Evaluation evidence bundles package preserved runs. | `checkout-tested`; `release-gated` | `tests/test_evidence_bundle.py`, `scripts/generate_evidence_bundle.py`, `docs/src/evaluation.md` | `uv run python scripts/generate_evidence_bundle.py --workspace-dir .worldforge` | Unsafe, local-only, signed, or binary artifacts are excluded or marked; the bundle does not upload anything. |
 
@@ -41,6 +41,21 @@ uv run worldforge eval --suite planning --provider mock --run-workspace .worldfo
 The run workspace stores `run_manifest.json`, JSON/Markdown/CSV reports, and a result summary under
 `.worldforge/runs/<run-id>/`.
 
+When a suite has failures, the JSON and Markdown reports include a compact failure gallery. The
+gallery is also exported through `report.artifacts()` as `failure_gallery.json` and
+`failure_gallery.md` for issue attachments:
+
+```python
+gallery = report.failure_gallery()
+print(gallery.to_markdown())
+```
+
+Each gallery case records a fixture id such as `evaluation:planning:object-relocation`, the
+provider, scenario, score, expected contract note, observed summary, small metrics preview, and
+triage steps. Metric previews are sanitized: secret-shaped values are redacted, signed URL query
+strings are stripped, host-local paths are replaced, and tensor-like arrays are summarized instead
+of copied raw.
+
 To package one or more preserved evaluation or benchmark runs for issue triage or release review,
 generate a checkout-safe evidence bundle:
 
@@ -66,13 +81,16 @@ silently skipping the suite.
 
 - Markdown: provenance section, provider summary table, scenario-level detail table
 - JSON: `suite_id`, `suite`, `claim_boundary`, `metric_semantics`, `provider_summaries`,
-  scenario `results`, and a `provenance` envelope
+  scenario `results`, a `provenance` envelope, and `failure_gallery` when failed scenarios exist
 - CSV: one row per provider/scenario pair with serialized metrics payloads (envelope omitted
   to keep the table import-compatible with prior releases)
+- Failure gallery JSON/Markdown: representative failed cases with fixture ids, expected contract
+  notes, sanitized metrics previews, and first triage steps
 
 Every JSON and Markdown report carries an explicit claim boundary. Built-in suites are
 deterministic adapter contract checks; their scores are not physical-fidelity, media-quality,
-safety, or real robot performance claims.
+safety, or real robot performance claims. Failure galleries follow the same boundary: they are for
+issue triage and provider-review debugging, not provider quality ranking.
 
 ## Provenance envelope
 
 
@@ -599,11 +599,11 @@ Out of scope:
 
 Acceptance criteria:
 
-- [ ] Failed evaluation reports include representative cases with fixture IDs and expected
+- [x] Failed evaluation reports include representative cases with fixture IDs and expected
       contract notes.
-- [ ] Galleries are exported as JSON and Markdown.
-- [ ] Reports avoid raw secrets, signed URLs, raw tensors, and host-local paths.
-- [ ] Docs explain that galleries are deterministic contract triage, not fidelity claims.
+- [x] Galleries are exported as JSON and Markdown.
+- [x] Reports avoid raw secrets, signed URLs, raw tensors, and host-local paths.
+- [x] Docs explain that galleries are deterministic contract triage, not fidelity claims.
 
 Validation:
 
 
@@ -44,6 +44,8 @@
     "EvalResult": "worldforge.evaluation",
     "EvalScenario": "worldforge.evaluation",
     "EvalSuite": "worldforge.evaluation",
+    "EvaluationFailureCase": "worldforge.evaluation",
+    "EvaluationFailureGallery": "worldforge.evaluation",
     "EvaluationReport": "worldforge.evaluation",
     "EvaluationResult": "worldforge.evaluation",
     "EvaluationScenario": "worldforge.evaluation",
 
@@ -18,6 +18,8 @@
     EvalResult,
     EvalScenario,
     EvalSuite,
+    EvaluationFailureCase,
+    EvaluationFailureGallery,
     EvaluationReport,
     EvaluationResult,
     EvaluationScenario,
@@ -40,6 +42,8 @@
     "EvalResult",
     "EvalScenario",
     "EvalSuite",
+    "EvaluationFailureCase",
+    "EvaluationFailureGallery",
     "EvaluationReport",
     "EvaluationResult",
     "EvaluationScenario",