Skip to content

Add cross-provider run comparisons#170

Merged
AbdelStark merged 1 commit intomainfrom
issue-150-cross-provider-comparison
May 5, 2026
Merged

Add cross-provider run comparisons#170
AbdelStark merged 1 commit intomainfrom
issue-150-cross-provider-comparison

Conversation

@AbdelStark
Copy link
Copy Markdown
Owner

Problem

Preserved eval and benchmark runs could be compared only with a shallow row model. That made cross-provider deltas ambiguous and did not separate real performance differences from capability, fixture, budget, or suite mismatches.

Solution

  • Add a shared comparison context for preserved run reports covering providers, capabilities, operations, fixture digests, budget refs, suite versions, missing evidence, skip reasons, event counts, and claim boundaries.
  • Refuse incompatible comparisons with explicit mismatch details while allowing compatible cross-provider rows.
  • Extend JSON, Markdown, and CSV comparison exports with the new context and budget status fields.
  • Add direct report-model tests plus public docs, changelog, and roadmap updates.

Closes #150

Validation

  • uv lock --check
  • uv run ruff check src tests examples scripts
  • uv run ruff format --check src tests examples scripts
  • uv run python scripts/generate_provider_docs.py --check
  • uv run pytest tests/test_harness_report_compare.py tests/test_benchmark.py tests/test_evaluation_suites.py
  • uv run pytest tests/test_harness_report_compare.py tests/test_harness_workspace.py tests/test_docs_site.py -q
  • uv run mkdocs build --strict
  • uv run pytest
  • uv run --extra harness pytest --cov=src/worldforge --cov-report=term-missing --cov-fail-under=90
  • bash scripts/test_package.sh
  • uv build --out-dir dist --clear --no-build-logs
  • git diff --check

@AbdelStark AbdelStark merged commit da05aac into main May 5, 2026
8 checks passed
@AbdelStark AbdelStark deleted the issue-150-cross-provider-comparison branch May 5, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WF-B8: Add cross-provider comparison reports

1 participant