Releases: aerosta/rewardhackwatch
Releases · aerosta/rewardhackwatch
v1.3.0 - Eval Workbench, LLM Judge, React Frontend Redesign
Eval Workbench (NEW)
- JSONL import with drag-drop and auto schema detection
- 8 built-in rule templates (rogue code, mock exploit, safety, deceptive CoT)
- Custom rule builder: regex, keyword, length, LLM judge
- Batch scoring with weighted A-F grades
- Side-by-side model comparison for A/B evaluation
- JSON/CSV export
LLM Judge Integration (NEW)
- Dual provider: Claude (Opus 4.6, Sonnet 4.5, Haiku 4.5) + OpenAI (GPT-5.2, GPT-4o)
- Independent review provider for cross-validation
- Graceful degradation when API keys not configured
Frontend Redesign
- React frontend with 9 pages
- Dark theme with Apple-native color scheme
- Space Grotesk typography
- Dashboard: 3-column chart grid, Recent Alerts section
- CoT Viewer: per-step deception scoring with annotations
- Session Logs: filtering, search, export
- Settings: full-width dual LLM config, monitoring, notifications
Quality
- 388 tests passing
- TypeScript clean, 0 build errors
- Category label fixes (Deceptive CoT, abbreviated chart labels)
- Graceful error handling for missing backend/API keys
v1.2.0 - Mock Exploit Fix, HackBench, Temporal Detection
Mock exploit detection improved: 0% to 98.5% F1
- RewardHackDetector API + Quick Start fix
- 1,200 synthetic trajectories for mock exploit coverage
- HackBench benchmark dataset (4,300+ trajectories)
- Temporal trajectory modeling pipeline
- Causal RMGI with Granger causality testing
- Cross-model transfer study framework
- Evasion robustness testing (5 attack types)
- Dynamic threshold calibration API
- Dark-mode dashboard redesign
- React frontend (9 pages)
- Eval framework (JSONL loader, rubric scoring, batch analysis)
- License changed to Apache 2.0
- 388 tests passing, Python 3.9+ compatible
v1.0.0 - Initial Release
RewardHackWatch v1.0.0 - Runtime detector for reward hacking and misalignment in LLM agents
Highlights
- 89.7% F1 on 5,391 MALT trajectories
- Novel RMGI metric for hack→misalignment tracking
- Full stack: FastAPI REST API, Streamlit dashboard, CLI
Getting started
pip install "git+https://github.com/aerosta/rewardhackwatch.git"
Links