Summary
The current runner contract (write outputs/{fixture_id}/run-N.txt) works well for static fixtures but cannot represent agent traces, multi-step workflows, or production logs. This RFC tracks the design work needed before implementation.
Motivation
Static fixtures cover the "does it behave correctly on known inputs" question. But teams also need:
- Agent traces — multi-step outputs where intermediate steps matter
- Production logs — evaluating real traffic, not synthetic cases
- Regression testing on real behavior — not just on fixtures you wrote
Design Questions to Resolve
- Runner contract — how does a trace-aware runner write outputs? A directory of JSON blobs? A trace manifest? Must be backwards-compatible with existing txt-file runners.
- Fixture abstraction — does
fixture_id still make sense when the input is a logged trace with its own ID?
- Multi-step outputs — evals that need to inspect intermediate steps (tool calls, chain-of-thought) rather than just final output.
- Production sampling — how does fieldtest pull a sample from a log store (file, MLflow, custom) without coupling to infrastructure?
- Determinism — production traces are not re-runnable; N-run distribution semantics don't apply. How does the report handle single-run trace evals?
Proposed First Step
Gather concrete use cases before speccing the interface. If you have a trace-based eval need, comment here with:
- What your system looks like (agent, RAG, single-turn)
- What your current log format looks like
- What you'd want to evaluate
Priority
Design discussion now. Implementation after use cases are understood — this is an architectural change to the runner contract and must be backwards-compatible.
Summary
The current runner contract (write
outputs/{fixture_id}/run-N.txt) works well for static fixtures but cannot represent agent traces, multi-step workflows, or production logs. This RFC tracks the design work needed before implementation.Motivation
Static fixtures cover the "does it behave correctly on known inputs" question. But teams also need:
Design Questions to Resolve
fixture_idstill make sense when the input is a logged trace with its own ID?Proposed First Step
Gather concrete use cases before speccing the interface. If you have a trace-based eval need, comment here with:
Priority
Design discussion now. Implementation after use cases are understood — this is an architectural change to the runner contract and must be backwards-compatible.