harness: align graph runs with structured metrics artifacts by jioffe502 · Pull Request #1803 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-04-07T13:12:40Z

TL;DR

Aligns the retriever harness with the graph pipeline and makes run artifacts easier to trust: metrics come from structured *.runtime.summary.json, not stdout scraping. The harness invokes graph_pipeline with the same controls as before for run mode and CLI wiring.

Description

Build graph_pipeline subprocess commands from harness config (including run_mode and resource flags as determined by preset, YAML, env, and use_heuristics).
Replace stdout metric parsing with runtime-summary-driven metric extraction; remove the obsolete parser module and its tests; keep results.json / session layout compatible with existing consumers.
Improve artifact provenance by hardening latest_commit resolution (gitdir / packed-refs–friendly) in harness artifacts helpers.
test_config.tuning continues to record the merged harness configuration used for the run. metrics and summary_metrics contain only performance and evaluation fields from the runtime summary.

Replace stdout metric parsing in harness run execution with structured metrics derived from graph pipeline runtime summary artifacts so results are more stable and reviewer-friendly. Preserve existing artifact shape while adding run-mode/auto-tuning config coverage and updating harness tests for batch-first structured sourcing. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Delete the legacy harness stream-metric parser and its tests now that harness metrics are sourced from runtime summary artifacts. Keep PTY child-process output streaming for user job-progress visibility while slimming dead parsing code. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

- harden latest_commit fallback for gitdir and packed-refs environments - expose resolved auto-heuristic graph flags in runtime and harness artifacts - trim effective_tuning payload while preserving key runtime context

greptile-apps · 2026-04-07T13:21:28Z

Greptile Summary

This PR replaces the harness's fragile stdout-scraping (StreamMetrics/parse_stream_text) with structured *.runtime.summary.json-driven metric extraction and wires --run-mode through to the graph_pipeline subprocess. Secondary improvements include a gitdir/packed-refs-aware last_commit() fallback and fractional Ray resource coercion in gather_cluster_resources. Test coverage is thorough and the results.json schema is kept additive.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/cleanup suggestions with no correctness impact.

The refactor is well-structured and test coverage is solid. The three findings are P2: broad exception handling that could be tightened to OSError/UnicodeDecodeError, a files: None placeholder that may surprise downstream consumers, and dead code in the pages-fallback path of _resolve_summary_metrics. None of these cause incorrect runtime behavior.

nemo_retriever/src/nemo_retriever/harness/artifacts.py (exception specificity) and nemo_retriever/src/nemo_retriever/harness/run.py (files: None + dead fallback code)

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/harness/run.py	Core change: removes StreamMetrics/stdout parsing and replaces with structured runtime-summary-driven metrics; adds --run-mode to _build_command; introduces _build_structured_metrics_payload; files metric is now hardcoded to None
nemo_retriever/src/nemo_retriever/harness/artifacts.py	Adds gitdir-file / packed-refs aware commit resolution; multiple except Exception blocks that silently return None violate the no-bare-except rule
nemo_retriever/src/nemo_retriever/harness/config.py	Adds run_mode field with VALID_RUN_MODES constant, validation, and HARNESS_RUN_MODE env override; clean and well-tested
nemo_retriever/src/nemo_retriever/harness/parsers.py	Deleted: stdout-scraping StreamMetrics/parse_stream_text replaced entirely by structured runtime-summary extraction
nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py	Adds _coerce_count helper to safely convert Ray's potentially-fractional resource counts to int; clean fix with test coverage

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HarnessConfig\nrun_mode, tuning flags] --> B[_build_command\nadds --run-mode]
    B --> C[graph_pipeline subprocess\nwrites *.runtime.summary.json]
    C --> D{runtime summary\nexists?}
    D -- yes --> E[_build_structured_metrics_payload\nreads pages / ingest_secs / evaluation_metrics]
    D -- no --> F[all metrics = None]
    E --> G[_evaluate_run_outcome\nrecall_metrics / evaluation_metrics]
    E --> H[_resolve_summary_metrics\npages / PPS / recall_5 / ndcg_10]
    G --> I[results.json\nmetrics + summary_metrics\n+ runtime_summary]
    H --> I
    F --> G
    F --> H

Prompt To Fix All With AI

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/artifacts.py
Line: 37-44

Comment:
**Overly broad exception handling silently swallows errors**

`except Exception: return None` here (and in `_read_packed_ref` and `_read_head_commit`) catches far more than just I/O failures — including programming errors like `AttributeError` or `TypeError`. The intended guard is filesystem errors; prefer catching `(OSError, UnicodeDecodeError)` so logic bugs surface instead of silently falling back to `"unknown"`.

```suggestion
    try:
        raw = dot_git.read_text(encoding="utf-8").strip()
    except (OSError, UnicodeDecodeError):
        return None
```

The same tightening applies to the `except Exception` blocks in `_read_packed_ref` and `_read_head_commit`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 178-179

Comment:
**`files` metric is always `None` — consumers expecting a file count will silently regress**

The old `StreamMetrics`-based path populated `files` from the `[done] N files …` stdout line. The new structured payload hardcodes `"files": None` because the runtime-summary JSON has no equivalent field. The PR description says the `results.json` layout stays compatible with existing consumers; any consumer that previously read a non-None `files` value will see None without any warning.

If the `graph_pipeline` runtime summary does not (and will not) emit a file count, consider removing the `files` key from the schema entirely rather than keeping it as a perpetually-None placeholder, or document the intentional omission with a comment.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 223-233

Comment:
**Redundant pages-fallback lookup is unreachable dead code**

`metrics_payload["pages"]` is already resolved from `runtime_summary` by `_build_structured_metrics_payload` using the same key precedence (`processed_pages` → `num_pages` → `input_pages`). Because both functions receive the same `runtime_summary` dict, `summary_metrics["pages"]` will equal `metrics_payload.get("pages")` and the `if summary_metrics["pages"] is None` branch in `_resolve_summary_metrics` can never find a value that `_build_structured_metrics_payload` missed. The block from line 223 to 233 is dead and can be removed to reduce confusion.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (7): Last reviewed commit: "harness: drop auto tuning overlay from m..." | Re-trigger Greptile}

greptile-apps · 2026-04-07T13:21:32Z

nemo_retriever/src/nemo_retriever/harness/run.py

+        resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags")
+        if isinstance(resolved_graph_flags, dict):
+            final_graph_flags = dict(resolved_graph_flags)
+            effective_tuning["graph_pipeline_flags"] = final_graph_flags


*_cpus_per_actor keys silently dropped from graph_pipeline_flags on runtime override

When runtime_resolved_tuning provides a resolved_graph_flags dict, final_graph_flags is completely replaced. resolved_graph_flags (built in _resolve_runtime_tuning_summary) includes nemotron_parse_* keys but does not include page_elements_cpus_per_actor, ocr_cpus_per_actor, or embed_cpus_per_actor. Those three fields are present in requested_graph_flags from _resolve_effective_tuning and are actually passed to the subprocess, but disappear from the effective_tuning.graph_pipeline_flags artifact after the replacement. Merging instead of replacing preserves them:

Suggested change

resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags")

if isinstance(resolved_graph_flags, dict):

final_graph_flags = dict(resolved_graph_flags)

effective_tuning["graph_pipeline_flags"] = final_graph_flags

if isinstance(resolved_graph_flags, dict):

final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)}

effective_tuning["graph_pipeline_flags"] = final_graph_flags

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/harness/run.py Line: 522-525 Comment: **`*_cpus_per_actor` keys silently dropped from `graph_pipeline_flags` on runtime override** When `runtime_resolved_tuning` provides a `resolved_graph_flags` dict, `final_graph_flags` is completely replaced. `resolved_graph_flags` (built in `_resolve_runtime_tuning_summary`) includes `nemotron_parse_*` keys but does not include `page_elements_cpus_per_actor`, `ocr_cpus_per_actor`, or `embed_cpus_per_actor`. Those three fields are present in `requested_graph_flags` from `_resolve_effective_tuning` and are actually passed to the subprocess, but disappear from the `effective_tuning.graph_pipeline_flags` artifact after the replacement. Merging instead of replacing preserves them: ```suggestion if isinstance(resolved_graph_flags, dict): final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)} effective_tuning["graph_pipeline_flags"] = final_graph_flags ``` How can I resolve this? If you propose a fix, please make it concise.

- add missing trailing newline to CONTRIBUTING.md - add missing trailing newline to release notes markdown

Resolve conflicts: - run.py: omit tuning CLI flags when use_heuristics; otherwise pass effective_tuning (auto_tuning zeros vs configured values). - graph_pipeline.py: keep runtime tuning summary helpers and _count_input_units from main.

Runtime summaries may only return a subset of tuning keys; preserve requested effective values (e.g. auto-tuning CPU sentinels) when overlaying resolved_graph_flags.

…_tuning Remove harness results.json effective_tuning and runtime summary resolved_tuning so this PR stays focused on runtime-summary metrics (vs stdout). Tuning CLI construction unchanged; follow-up can reintroduce tuning capture with a single clear model.

…list Trim graph-refactor migration, git workflow, and obsolete batch/parser notes. Keep operator map, CLI, artifacts, and a short results.json contract. Add maintainer backlog and harness change checklist for agents.

Resolve harness conflicts: keep runtime-summary metrics path, add _resolve_store_uri and graph store CLI flags from main; restore slim HANDOFF with store note.

jperez999

If the behavior, in the resolve_effective_tuning is as I describe in my comment. I think the real problem is that we need ensure that when we run the graph_pipeline.py module the defaults work correctly, wdyt?

jperez999 · 2026-04-09T20:13:31Z

nemo_retriever/src/nemo_retriever/harness/run.py

    return str(p)


+def _resolve_effective_tuning(cfg: HarnessConfig) -> dict[str, int | float]:


is this supposed to be a catch all, so that if you dont set it, something still gets set? If so, what we are talking about is a default. Shouldn't that be in the actual graph_pipeline.py defaults so they are always picked up?

Good point. _resolve_effective_tuning() was new in this PR, and I removed that helper plus the harness auto_tuning surface in the latest push to keep this change focused on the structured metric flow. I agree the broader default-ownership question belongs in graph_pipeline.py as follow-up work, especially if we want to preserve selective harness overrides later.

Keep this PR focused on structured runtime summary metrics and run_mode plumbing. Leave graph_pipeline default ownership and selective harness overrides for follow-up work. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jioffe502 added 3 commits April 6, 2026 18:51

harness: clarify effective tuning and commit stamping

3bd23d3

- harden latest_commit fallback for gitdir and packed-refs environments - expose resolved auto-heuristic graph flags in runtime and harness artifacts - trim effective_tuning payload while preserving key runtime context

jioffe502 requested review from a team as code owners April 7, 2026 13:12

jioffe502 requested a review from nkmcalli April 7, 2026 13:12

jioffe502 marked this pull request as draft April 7, 2026 13:14

jioffe502 changed the title ~~harness: simplify structured metrics and tuned flag reporting~~ harness: align graph runs with structured metrics artifacts Apr 7, 2026

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

docs: fix eof newline formatting for pre-commit

ff5a177

- add missing trailing newline to CONTRIBUTING.md - add missing trailing newline to release notes markdown

jioffe502 marked this pull request as ready for review April 7, 2026 14:53

jioffe502 requested review from jperez999 and removed request for nkmcalli April 7, 2026 14:53

jioffe502 added 3 commits April 7, 2026 16:29

fix(harness): merge resolved_graph_flags into requested for artifacts

abba684

Runtime summaries may only return a subset of tuning keys; preserve requested effective values (e.g. auto-tuning CPU sentinels) when overlaying resolved_graph_flags.

jioffe502 mentioned this pull request Apr 7, 2026

nemo_retriever: Add structured harness metrics reports #1699

Closed

jioffe502 and others added 3 commits April 7, 2026 15:23

Merge branch 'main' into jioffe/graph-refactor-integration

0580585

Merge upstream/main into graph-refactor-integration

cb8eec8

Resolve harness conflicts: keep runtime-summary metrics path, add _resolve_store_uri and graph store CLI flags from main; restore slim HANDOFF with store note.

jperez999 approved these changes Apr 10, 2026

View reviewed changes

harness: drop auto tuning overlay from metrics refactor

0976aa4

Keep this PR focused on structured runtime summary metrics and run_mode plumbing. Leave graph_pipeline default ownership and selective harness overrides for follow-up work. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jioffe502 merged commit 2f7b1bb into NVIDIA:main Apr 10, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harness: align graph runs with structured metrics artifacts#1803

harness: align graph runs with structured metrics artifacts#1803
jioffe502 merged 11 commits intoNVIDIA:mainfrom
jioffe502:jioffe/graph-refactor-integration

jioffe502 commented Apr 7, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 7, 2026

Uh oh!

jperez999 left a comment

Uh oh!

jperez999 Apr 9, 2026

Uh oh!

jioffe502 Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return str(p)


		def _resolve_effective_tuning(cfg: HarnessConfig) -> dict[str, int \| float]:

Conversation

jioffe502 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Description

Uh oh!

greptile-apps bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 left a comment

Choose a reason for hiding this comment

Uh oh!

jperez999 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jioffe502 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jioffe502 commented Apr 7, 2026 •

edited

Loading

greptile-apps bot commented Apr 7, 2026 •

edited

Loading