RFC: Input abstraction layer for trace-based and production log evaluation

## Summary

The current runner contract (write `outputs/{fixture_id}/run-N.txt`) works well for static fixtures but cannot represent agent traces, multi-step workflows, or production logs. This RFC tracks the design work needed before implementation.

## Motivation

Static fixtures cover the "does it behave correctly on known inputs" question. But teams also need:

- **Agent traces** — multi-step outputs where intermediate steps matter
- **Production logs** — evaluating real traffic, not synthetic cases
- **Regression testing on real behavior** — not just on fixtures you wrote

## Design Questions to Resolve

1. **Runner contract** — how does a trace-aware runner write outputs? A directory of JSON blobs? A trace manifest? Must be backwards-compatible with existing txt-file runners.
2. **Fixture abstraction** — does `fixture_id` still make sense when the input is a logged trace with its own ID?
3. **Multi-step outputs** — evals that need to inspect intermediate steps (tool calls, chain-of-thought) rather than just final output.
4. **Production sampling** — how does fieldtest pull a sample from a log store (file, MLflow, custom) without coupling to infrastructure?
5. **Determinism** — production traces are not re-runnable; N-run distribution semantics don't apply. How does the report handle single-run trace evals?

## Proposed First Step

Gather concrete use cases before speccing the interface. If you have a trace-based eval need, comment here with:
- What your system looks like (agent, RAG, single-turn)
- What your current log format looks like
- What you'd want to evaluate

## Priority

Design discussion now. Implementation after use cases are understood — this is an architectural change to the runner contract and must be backwards-compatible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Input abstraction layer for trace-based and production log evaluation #5

Summary

Motivation

Design Questions to Resolve

Proposed First Step

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RFC: Input abstraction layer for trace-based and production log evaluation #5

Description

Summary

Motivation

Design Questions to Resolve

Proposed First Step

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions