harness: align graph runs with structured metrics artifacts#1803
harness: align graph runs with structured metrics artifacts#1803jioffe502 merged 11 commits intoNVIDIA:mainfrom
Conversation
Replace stdout metric parsing in harness run execution with structured metrics derived from graph pipeline runtime summary artifacts so results are more stable and reviewer-friendly. Preserve existing artifact shape while adding run-mode/auto-tuning config coverage and updating harness tests for batch-first structured sourcing. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Delete the legacy harness stream-metric parser and its tests now that harness metrics are sourced from runtime summary artifacts. Keep PTY child-process output streaming for user job-progress visibility while slimming dead parsing code. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- harden latest_commit fallback for gitdir and packed-refs environments - expose resolved auto-heuristic graph flags in runtime and harness artifacts - trim effective_tuning payload while preserving key runtime context
Greptile SummaryThis PR replaces the harness's fragile stdout-scraping (
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/harness/run.py | Core change: removes StreamMetrics/stdout parsing and replaces with structured runtime-summary-driven metrics; adds --run-mode to _build_command; introduces _build_structured_metrics_payload; files metric is now hardcoded to None |
| nemo_retriever/src/nemo_retriever/harness/artifacts.py | Adds gitdir-file / packed-refs aware commit resolution; multiple except Exception blocks that silently return None violate the no-bare-except rule |
| nemo_retriever/src/nemo_retriever/harness/config.py | Adds run_mode field with VALID_RUN_MODES constant, validation, and HARNESS_RUN_MODE env override; clean and well-tested |
| nemo_retriever/src/nemo_retriever/harness/parsers.py | Deleted: stdout-scraping StreamMetrics/parse_stream_text replaced entirely by structured runtime-summary extraction |
| nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py | Adds _coerce_count helper to safely convert Ray's potentially-fractional resource counts to int; clean fix with test coverage |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[HarnessConfig\nrun_mode, tuning flags] --> B[_build_command\nadds --run-mode]
B --> C[graph_pipeline subprocess\nwrites *.runtime.summary.json]
C --> D{runtime summary\nexists?}
D -- yes --> E[_build_structured_metrics_payload\nreads pages / ingest_secs / evaluation_metrics]
D -- no --> F[all metrics = None]
E --> G[_evaluate_run_outcome\nrecall_metrics / evaluation_metrics]
E --> H[_resolve_summary_metrics\npages / PPS / recall_5 / ndcg_10]
G --> I[results.json\nmetrics + summary_metrics\n+ runtime_summary]
H --> I
F --> G
F --> H
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/artifacts.py
Line: 37-44
Comment:
**Overly broad exception handling silently swallows errors**
`except Exception: return None` here (and in `_read_packed_ref` and `_read_head_commit`) catches far more than just I/O failures — including programming errors like `AttributeError` or `TypeError`. The intended guard is filesystem errors; prefer catching `(OSError, UnicodeDecodeError)` so logic bugs surface instead of silently falling back to `"unknown"`.
```suggestion
try:
raw = dot_git.read_text(encoding="utf-8").strip()
except (OSError, UnicodeDecodeError):
return None
```
The same tightening applies to the `except Exception` blocks in `_read_packed_ref` and `_read_head_commit`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 178-179
Comment:
**`files` metric is always `None` — consumers expecting a file count will silently regress**
The old `StreamMetrics`-based path populated `files` from the `[done] N files …` stdout line. The new structured payload hardcodes `"files": None` because the runtime-summary JSON has no equivalent field. The PR description says the `results.json` layout stays compatible with existing consumers; any consumer that previously read a non-None `files` value will see None without any warning.
If the `graph_pipeline` runtime summary does not (and will not) emit a file count, consider removing the `files` key from the schema entirely rather than keeping it as a perpetually-None placeholder, or document the intentional omission with a comment.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 223-233
Comment:
**Redundant pages-fallback lookup is unreachable dead code**
`metrics_payload["pages"]` is already resolved from `runtime_summary` by `_build_structured_metrics_payload` using the same key precedence (`processed_pages` → `num_pages` → `input_pages`). Because both functions receive the same `runtime_summary` dict, `summary_metrics["pages"]` will equal `metrics_payload.get("pages")` and the `if summary_metrics["pages"] is None` branch in `_resolve_summary_metrics` can never find a value that `_build_structured_metrics_payload` missed. The block from line 223 to 233 is dead and can be removed to reduce confusion.
How can I resolve this? If you propose a fix, please make it concise.Reviews (7): Last reviewed commit: "harness: drop auto tuning overlay from m..." | Re-trigger Greptile
| resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags") | ||
| if isinstance(resolved_graph_flags, dict): | ||
| final_graph_flags = dict(resolved_graph_flags) | ||
| effective_tuning["graph_pipeline_flags"] = final_graph_flags |
There was a problem hiding this comment.
*_cpus_per_actor keys silently dropped from graph_pipeline_flags on runtime override
When runtime_resolved_tuning provides a resolved_graph_flags dict, final_graph_flags is completely replaced. resolved_graph_flags (built in _resolve_runtime_tuning_summary) includes nemotron_parse_* keys but does not include page_elements_cpus_per_actor, ocr_cpus_per_actor, or embed_cpus_per_actor. Those three fields are present in requested_graph_flags from _resolve_effective_tuning and are actually passed to the subprocess, but disappear from the effective_tuning.graph_pipeline_flags artifact after the replacement. Merging instead of replacing preserves them:
| resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags") | |
| if isinstance(resolved_graph_flags, dict): | |
| final_graph_flags = dict(resolved_graph_flags) | |
| effective_tuning["graph_pipeline_flags"] = final_graph_flags | |
| if isinstance(resolved_graph_flags, dict): | |
| final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)} | |
| effective_tuning["graph_pipeline_flags"] = final_graph_flags |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 522-525
Comment:
**`*_cpus_per_actor` keys silently dropped from `graph_pipeline_flags` on runtime override**
When `runtime_resolved_tuning` provides a `resolved_graph_flags` dict, `final_graph_flags` is completely replaced. `resolved_graph_flags` (built in `_resolve_runtime_tuning_summary`) includes `nemotron_parse_*` keys but does not include `page_elements_cpus_per_actor`, `ocr_cpus_per_actor`, or `embed_cpus_per_actor`. Those three fields are present in `requested_graph_flags` from `_resolve_effective_tuning` and are actually passed to the subprocess, but disappear from the `effective_tuning.graph_pipeline_flags` artifact after the replacement. Merging instead of replacing preserves them:
```suggestion
if isinstance(resolved_graph_flags, dict):
final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)}
effective_tuning["graph_pipeline_flags"] = final_graph_flags
```
How can I resolve this? If you propose a fix, please make it concise.- add missing trailing newline to CONTRIBUTING.md - add missing trailing newline to release notes markdown
Resolve conflicts: - run.py: omit tuning CLI flags when use_heuristics; otherwise pass effective_tuning (auto_tuning zeros vs configured values). - graph_pipeline.py: keep runtime tuning summary helpers and _count_input_units from main.
Runtime summaries may only return a subset of tuning keys; preserve requested effective values (e.g. auto-tuning CPU sentinels) when overlaying resolved_graph_flags.
…_tuning Remove harness results.json effective_tuning and runtime summary resolved_tuning so this PR stays focused on runtime-summary metrics (vs stdout). Tuning CLI construction unchanged; follow-up can reintroduce tuning capture with a single clear model.
…list Trim graph-refactor migration, git workflow, and obsolete batch/parser notes. Keep operator map, CLI, artifacts, and a short results.json contract. Add maintainer backlog and harness change checklist for agents.
Resolve harness conflicts: keep runtime-summary metrics path, add _resolve_store_uri and graph store CLI flags from main; restore slim HANDOFF with store note.
jperez999
left a comment
There was a problem hiding this comment.
If the behavior, in the resolve_effective_tuning is as I describe in my comment. I think the real problem is that we need ensure that when we run the graph_pipeline.py module the defaults work correctly, wdyt?
| return str(p) | ||
|
|
||
|
|
||
| def _resolve_effective_tuning(cfg: HarnessConfig) -> dict[str, int | float]: |
There was a problem hiding this comment.
is this supposed to be a catch all, so that if you dont set it, something still gets set? If so, what we are talking about is a default. Shouldn't that be in the actual graph_pipeline.py defaults so they are always picked up?
There was a problem hiding this comment.
Good point. _resolve_effective_tuning() was new in this PR, and I removed that helper plus the harness auto_tuning surface in the latest push to keep this change focused on the structured metric flow. I agree the broader default-ownership question belongs in graph_pipeline.py as follow-up work, especially if we want to preserve selective harness overrides later.
Keep this PR focused on structured runtime summary metrics and run_mode plumbing. Leave graph_pipeline default ownership and selective harness overrides for follow-up work. Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
TL;DR
Aligns the retriever harness with the graph pipeline and makes run artifacts easier to trust: metrics come from structured
*.runtime.summary.json, not stdout scraping. The harness invokesgraph_pipelinewith the same controls as before for run mode and CLI wiring.Description
graph_pipelinesubprocess commands from harness config (includingrun_modeand resource flags as determined by preset, YAML, env, anduse_heuristics).results.json/ session layout compatible with existing consumers.latest_commitresolution (gitdir / packed-refs–friendly) in harness artifacts helpers.test_config.tuningcontinues to record the merged harness configuration used for the run.metricsandsummary_metricscontain only performance and evaluation fields from the runtime summary.