Skip to content

RFC: Input abstraction layer for trace-based and production log evaluation #5

@gmitt98

Description

@gmitt98

Summary

The current runner contract (write outputs/{fixture_id}/run-N.txt) works well for static fixtures but cannot represent agent traces, multi-step workflows, or production logs. This RFC tracks the design work needed before implementation.

Motivation

Static fixtures cover the "does it behave correctly on known inputs" question. But teams also need:

  • Agent traces — multi-step outputs where intermediate steps matter
  • Production logs — evaluating real traffic, not synthetic cases
  • Regression testing on real behavior — not just on fixtures you wrote

Design Questions to Resolve

  1. Runner contract — how does a trace-aware runner write outputs? A directory of JSON blobs? A trace manifest? Must be backwards-compatible with existing txt-file runners.
  2. Fixture abstraction — does fixture_id still make sense when the input is a logged trace with its own ID?
  3. Multi-step outputs — evals that need to inspect intermediate steps (tool calls, chain-of-thought) rather than just final output.
  4. Production sampling — how does fieldtest pull a sample from a log store (file, MLflow, custom) without coupling to infrastructure?
  5. Determinism — production traces are not re-runnable; N-run distribution semantics don't apply. How does the report handle single-run trace evals?

Proposed First Step

Gather concrete use cases before speccing the interface. If you have a trace-based eval need, comment here with:

  • What your system looks like (agent, RAG, single-turn)
  • What your current log format looks like
  • What you'd want to evaluate

Priority

Design discussion now. Implementation after use cases are understood — this is an architectural change to the runner contract and must be backwards-compatible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions