Releases · aerosta/rewardhackwatch · GitHub

01 Mar 15:21

aerosta

v1.3.0 - Eval Workbench, LLM Judge, React Frontend Redesign Latest

Latest

Eval Workbench (NEW)

JSONL import with drag-drop and auto schema detection
8 built-in rule templates (rogue code, mock exploit, safety, deceptive CoT)
Custom rule builder: regex, keyword, length, LLM judge
Batch scoring with weighted A-F grades
Side-by-side model comparison for A/B evaluation
JSON/CSV export

LLM Judge Integration (NEW)

Dual provider: Claude (Opus 4.6, Sonnet 4.5, Haiku 4.5) + OpenAI (GPT-5.2, GPT-4o)
Independent review provider for cross-validation
Graceful degradation when API keys not configured

Frontend Redesign

React frontend with 9 pages
Dark theme with Apple-native color scheme
Space Grotesk typography
Dashboard: 3-column chart grid, Recent Alerts section
CoT Viewer: per-step deception scoring with annotations
Session Logs: filtering, search, export
Settings: full-width dual LLM config, monitoring, notifications

Quality

388 tests passing
TypeScript clean, 0 build errors
Category label fixes (Deceptive CoT, abbreviated chart labels)
Graceful error handling for missing backend/API keys

Assets 2

01 Mar 11:59

aerosta

v1.2.0 - Mock Exploit Fix, HackBench, Temporal Detection

Mock exploit detection improved: 0% to 98.5% F1

RewardHackDetector API + Quick Start fix
1,200 synthetic trajectories for mock exploit coverage
HackBench benchmark dataset (4,300+ trajectories)
Temporal trajectory modeling pipeline
Causal RMGI with Granger causality testing
Cross-model transfer study framework
Evasion robustness testing (5 attack types)
Dynamic threshold calibration API
Dark-mode dashboard redesign
React frontend (9 pages)
Eval framework (JSONL loader, rubric scoring, batch analysis)
License changed to Apache 2.0
388 tests passing, Python 3.9+ compatible

Assets 2

09 Dec 23:32

aerosta

v1.0.0 - Initial Release

RewardHackWatch v1.0.0 - Runtime detector for reward hacking and misalignment in LLM agents

Highlights

89.7% F1 on 5,391 MALT trajectories
Novel RMGI metric for hack→misalignment tracking
Full stack: FastAPI REST API, Streamlit dashboard, CLI

Getting started
pip install "git+https://github.com/aerosta/rewardhackwatch.git"

Links

Assets 2

0 Join discussion