Skip to content

add eval capes to sdk#460

Open
luke-e-schaefer wants to merge 7 commits into
masterfrom
add-eval-capabilities
Open

add eval capes to sdk#460
luke-e-schaefer wants to merge 7 commits into
masterfrom
add-eval-capabilities

Conversation

@luke-e-schaefer
Copy link
Copy Markdown

@luke-e-schaefer luke-e-schaefer commented May 12, 2026

resolves https://linear.app/scale-epd/issue/DE-7460

Greptile Summary

  • Adds EvaluationV2 SDK support: NucleusClient.create_evaluation_v2, get_evaluation_v2, and list_evaluations_v2, plus Pydantic DTOs for charts/examples payloads and a new tests/test_evaluation_v2.py with mocked-HTTP coverage.
  • The EvaluationV2Status enum is exported publicly but wait_for_completion compares self.status against raw string literals instead of using the enum values, creating a silent drift risk.
  • list_evaluations_v2 silently returns [] when the API returns an unexpected non-list success body, which could mask malformed responses.

Confidence Score: 5/5

Safe to merge; only P2 style/maintenance observations found.

No P0 or P1 bugs found. Both findings are P2: one is a style inconsistency (raw strings vs enum) and one is a defensive-code gap (silent empty list). Core create/poll/charts/examples flows are correctly implemented and well-tested.

nucleus/evaluation_v2.py (enum usage) and nucleus/init.py (list_evaluations_v2 silent fallback).

Important Files Changed

Filename Overview
nucleus/evaluation_v2.py New EvaluationV2 dataclass + helpers; status polling uses raw string literals instead of the exported EvaluationV2Status enum.
nucleus/init.py Adds create/get/list evaluation_v2 methods; list_evaluations_v2 silently returns [] on unexpected non-list responses.
nucleus/data_transfer_object/evaluation_v2.py New Pydantic DTOs for evaluation V2 REST payloads; all nullable fields on EvaluationV2MatchExample are properly Optional.
tests/test_evaluation_v2.py Unit tests with mocked HTTP layer covering create, get, charts, examples, wait_for_completion, and delete flows.
docs/index.rst Adds Evaluations V2 documentation section with usage example and REST endpoint reference.

Sequence Diagram

sequenceDiagram
    participant User
    participant NucleusClient
    participant API

    User->>NucleusClient: create_evaluation_v2(model_run_id, ...)
    NucleusClient->>API: POST /modelRun/:id/evaluationsV2
    API-->>NucleusClient: "{ evaluation_id }"
    NucleusClient->>API: GET /evaluationsV2/:id
    API-->>NucleusClient: EvaluationV2 payload
    NucleusClient-->>User: EvaluationV2

    User->>NucleusClient: ev.wait_for_completion()
    loop Poll until terminal
        NucleusClient->>API: GET /evaluationsV2/:id
        API-->>NucleusClient: "{ status }"
    end
    NucleusClient-->>User: EvaluationV2 (succeeded)

    User->>NucleusClient: "ev.charts(iou_threshold=0.5)"
    NucleusClient->>API: "GET /evaluationsV2/:id/charts?iouThreshold=0.5"
    API-->>NucleusClient: EvaluationV2Charts JSON
    NucleusClient-->>User: EvaluationV2Charts

    User->>NucleusClient: "ev.examples(match_type=FP)"
    NucleusClient->>API: POST /evaluationsV2/:id/examples
    API-->>NucleusClient: EvaluationV2ExamplesPage JSON
    NucleusClient-->>User: EvaluationV2ExamplesPage
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nucleus/evaluation_v2.py:131-141
**`EvaluationV2Status` enum defined but unused in polling logic**

`EvaluationV2Status` is exported as part of the public API, but `wait_for_completion` compares `self.status` against raw string literals. If a new terminal state (e.g., `"timed_out"`) is added to the enum later, the polling set won't be updated unless someone remembers to change both places. Using `EvaluationV2Status` values (e.g., `EvaluationV2Status.FAILED.value`) makes the coupling explicit and lets type checkers catch drift.

### Issue 2 of 2
nucleus/__init__.py:940-945
**`list_evaluations_v2` silently returns `[]` on unexpected API shape**

`self.get(...)` raises `NucleusAPIError` for HTTP errors, so the `not isinstance(rows, list)` branch is reachable only when the server returns a non-list success body (e.g., a dict). In that case the method returns `[]` without surfacing any diagnostic, making it impossible to tell whether the run has no evaluations or whether the response was malformed. Raising instead of swallowing the unexpected payload would make the error actionable.

Reviews (4): Last reviewed commit: "fix p1" | Re-trigger Greptile

@luke-e-schaefer luke-e-schaefer self-assigned this May 12, 2026
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
Comment thread nucleus/__init__.py Outdated
luke-e-schaefer and others added 2 commits May 12, 2026 13:49
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
luke-e-schaefer and others added 3 commits May 12, 2026 14:03
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant