add eval capes to sdk by luke-e-schaefer · Pull Request #460 · scaleapi/nucleus-python-client

luke-e-schaefer · 2026-05-12T18:19:19Z

resolves https://linear.app/scale-epd/issue/DE-7460

Greptile Summary

Adds EvaluationV2 SDK support: NucleusClient.create_evaluation_v2, get_evaluation_v2, and list_evaluations_v2, plus Pydantic DTOs for charts/examples payloads and a new tests/test_evaluation_v2.py with mocked-HTTP coverage.
The EvaluationV2Status enum is exported publicly but wait_for_completion compares self.status against raw string literals instead of using the enum values, creating a silent drift risk.
list_evaluations_v2 silently returns [] when the API returns an unexpected non-list success body, which could mask malformed responses.

Confidence Score: 5/5

Safe to merge; only P2 style/maintenance observations found.

No P0 or P1 bugs found. Both findings are P2: one is a style inconsistency (raw strings vs enum) and one is a defensive-code gap (silent empty list). Core create/poll/charts/examples flows are correctly implemented and well-tested.

nucleus/evaluation_v2.py (enum usage) and nucleus/init.py (list_evaluations_v2 silent fallback).

Important Files Changed

Filename	Overview
nucleus/evaluation_v2.py	New EvaluationV2 dataclass + helpers; status polling uses raw string literals instead of the exported EvaluationV2Status enum.
nucleus/init.py	Adds create/get/list evaluation_v2 methods; list_evaluations_v2 silently returns [] on unexpected non-list responses.
nucleus/data_transfer_object/evaluation_v2.py	New Pydantic DTOs for evaluation V2 REST payloads; all nullable fields on EvaluationV2MatchExample are properly Optional.
tests/test_evaluation_v2.py	Unit tests with mocked HTTP layer covering create, get, charts, examples, wait_for_completion, and delete flows.
docs/index.rst	Adds Evaluations V2 documentation section with usage example and REST endpoint reference.

Sequence Diagram

sequenceDiagram
    participant User
    participant NucleusClient
    participant API

    User->>NucleusClient: create_evaluation_v2(model_run_id, ...)
    NucleusClient->>API: POST /modelRun/:id/evaluationsV2
    API-->>NucleusClient: "{ evaluation_id }"
    NucleusClient->>API: GET /evaluationsV2/:id
    API-->>NucleusClient: EvaluationV2 payload
    NucleusClient-->>User: EvaluationV2

    User->>NucleusClient: ev.wait_for_completion()
    loop Poll until terminal
        NucleusClient->>API: GET /evaluationsV2/:id
        API-->>NucleusClient: "{ status }"
    end
    NucleusClient-->>User: EvaluationV2 (succeeded)

    User->>NucleusClient: "ev.charts(iou_threshold=0.5)"
    NucleusClient->>API: "GET /evaluationsV2/:id/charts?iouThreshold=0.5"
    API-->>NucleusClient: EvaluationV2Charts JSON
    NucleusClient-->>User: EvaluationV2Charts

    User->>NucleusClient: "ev.examples(match_type=FP)"
    NucleusClient->>API: POST /evaluationsV2/:id/examples
    API-->>NucleusClient: EvaluationV2ExamplesPage JSON
    NucleusClient-->>User: EvaluationV2ExamplesPage

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nucleus/evaluation_v2.py:131-141
**`EvaluationV2Status` enum defined but unused in polling logic**

`EvaluationV2Status` is exported as part of the public API, but `wait_for_completion` compares `self.status` against raw string literals. If a new terminal state (e.g., `"timed_out"`) is added to the enum later, the polling set won't be updated unless someone remembers to change both places. Using `EvaluationV2Status` values (e.g., `EvaluationV2Status.FAILED.value`) makes the coupling explicit and lets type checkers catch drift.

### Issue 2 of 2
nucleus/__init__.py:940-945
**`list_evaluations_v2` silently returns `[]` on unexpected API shape**

`self.get(...)` raises `NucleusAPIError` for HTTP errors, so the `not isinstance(rows, list)` branch is reachable only when the server returns a non-list success body (e.g., a dict). In that case the method returns `[]` without surfacing any diagnostic, making it impossible to tell whether the run has no evaluations or whether the response was malformed. Raising instead of swallowing the unexpected payload would make the error actionable.

_{Reviews (4): Last reviewed commit: "fix p1" | Re-trigger Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

add eval capes to sdk

4c6083e

luke-e-schaefer requested review from edwinpav and vinay553 May 12, 2026 18:19

luke-e-schaefer self-assigned this May 12, 2026

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated

Comment thread nucleus/__init__.py Outdated

luke-e-schaefer and others added 2 commits May 12, 2026 13:49

Apply suggestion from @greptile-apps[bot]

36f6b4a

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Apply suggestion from @greptile-apps[bot]

3caaf8d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated

luke-e-schaefer and others added 3 commits May 12, 2026 14:03

run hooks

13a91b2

merge remote

cce066e

Update nucleus/data_transfer_object/evaluation_v2.py

aced4aa

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread nucleus/data_transfer_object/evaluation_v2.py

fix p1

866ac71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add eval capes to sdk#460

add eval capes to sdk#460
luke-e-schaefer wants to merge 7 commits into
masterfrom
add-eval-capabilities

luke-e-schaefer commented May 12, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luke-e-schaefer commented May 12, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luke-e-schaefer commented May 12, 2026 •

edited by greptile-apps Bot

Loading