✨ feat: allow different judge models for same judge type and show stats in dashboard#420
Open
Marco Russo (marcorusso97) wants to merge 1 commit into
Open
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces full support for running multiple judges of the same type with different models, and ensures their outputs are correctly tracked, aggregated, and rendered across the dashboard.
It also fixes consistency issues between summary panels and expanded detail views, so judge counts, names, metrics, and verdicts stay aligned.
Why
When two or more judges shared the same type, judge vote keys could collide and overwrite each other.
This caused missing judges, incorrect counts, incomplete strictness/ASR values, and absent verdict blocks in detail cards.
What Changed
Multi-judge key stability
Evaluation and metrics pipeline
Dashboard enrichment and rendering
Attack card updates
Tests
Impact