Skip to content

Releases: aerosta/rewardhackwatch

v1.3.0 - Eval Workbench, LLM Judge, React Frontend Redesign

01 Mar 15:21

Choose a tag to compare

Eval Workbench (NEW)

  • JSONL import with drag-drop and auto schema detection
  • 8 built-in rule templates (rogue code, mock exploit, safety, deceptive CoT)
  • Custom rule builder: regex, keyword, length, LLM judge
  • Batch scoring with weighted A-F grades
  • Side-by-side model comparison for A/B evaluation
  • JSON/CSV export

LLM Judge Integration (NEW)

  • Dual provider: Claude (Opus 4.6, Sonnet 4.5, Haiku 4.5) + OpenAI (GPT-5.2, GPT-4o)
  • Independent review provider for cross-validation
  • Graceful degradation when API keys not configured

Frontend Redesign

  • React frontend with 9 pages
  • Dark theme with Apple-native color scheme
  • Space Grotesk typography
  • Dashboard: 3-column chart grid, Recent Alerts section
  • CoT Viewer: per-step deception scoring with annotations
  • Session Logs: filtering, search, export
  • Settings: full-width dual LLM config, monitoring, notifications

Quality

  • 388 tests passing
  • TypeScript clean, 0 build errors
  • Category label fixes (Deceptive CoT, abbreviated chart labels)
  • Graceful error handling for missing backend/API keys

v1.2.0 - Mock Exploit Fix, HackBench, Temporal Detection

01 Mar 11:59

Choose a tag to compare

Mock exploit detection improved: 0% to 98.5% F1

  • RewardHackDetector API + Quick Start fix
  • 1,200 synthetic trajectories for mock exploit coverage
  • HackBench benchmark dataset (4,300+ trajectories)
  • Temporal trajectory modeling pipeline
  • Causal RMGI with Granger causality testing
  • Cross-model transfer study framework
  • Evasion robustness testing (5 attack types)
  • Dynamic threshold calibration API
  • Dark-mode dashboard redesign
  • React frontend (9 pages)
  • Eval framework (JSONL loader, rubric scoring, batch analysis)
  • License changed to Apache 2.0
  • 388 tests passing, Python 3.9+ compatible

v1.0.0 - Initial Release

09 Dec 23:32

Choose a tag to compare

RewardHackWatch v1.0.0 - Runtime detector for reward hacking and misalignment in LLM agents

Highlights

  • 89.7% F1 on 5,391 MALT trajectories
  • Novel RMGI metric for hack→misalignment tracking
  • Full stack: FastAPI REST API, Streamlit dashboard, CLI

Getting started
pip install "git+https://github.com/aerosta/rewardhackwatch.git"

Links