Skip to content

harness: align graph runs with structured metrics artifacts#1803

Merged
jioffe502 merged 11 commits intoNVIDIA:mainfrom
jioffe502:jioffe/graph-refactor-integration
Apr 10, 2026
Merged

harness: align graph runs with structured metrics artifacts#1803
jioffe502 merged 11 commits intoNVIDIA:mainfrom
jioffe502:jioffe/graph-refactor-integration

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 commented Apr 7, 2026

TL;DR

Aligns the retriever harness with the graph pipeline and makes run artifacts easier to trust: metrics come from structured *.runtime.summary.json, not stdout scraping. The harness invokes graph_pipeline with the same controls as before for run mode and CLI wiring.

Description

  • Build graph_pipeline subprocess commands from harness config (including run_mode and resource flags as determined by preset, YAML, env, and use_heuristics).
  • Replace stdout metric parsing with runtime-summary-driven metric extraction; remove the obsolete parser module and its tests; keep results.json / session layout compatible with existing consumers.
  • Improve artifact provenance by hardening latest_commit resolution (gitdir / packed-refs–friendly) in harness artifacts helpers.
  • test_config.tuning continues to record the merged harness configuration used for the run. metrics and summary_metrics contain only performance and evaluation fields from the runtime summary.

Replace stdout metric parsing in harness run execution with structured metrics derived from graph pipeline runtime summary artifacts so results are more stable and reviewer-friendly. Preserve existing artifact shape while adding run-mode/auto-tuning config coverage and updating harness tests for batch-first structured sourcing.

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Delete the legacy harness stream-metric parser and its tests now that harness metrics are sourced from runtime summary artifacts. Keep PTY child-process output streaming for user job-progress visibility while slimming dead parsing code.

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
- harden latest_commit fallback for gitdir and packed-refs environments
- expose resolved auto-heuristic graph flags in runtime and harness artifacts
- trim effective_tuning payload while preserving key runtime context
@jioffe502 jioffe502 requested review from a team as code owners April 7, 2026 13:12
@jioffe502 jioffe502 requested a review from nkmcalli April 7, 2026 13:12
@jioffe502 jioffe502 marked this pull request as draft April 7, 2026 13:14
@jioffe502 jioffe502 changed the title harness: simplify structured metrics and tuned flag reporting harness: align graph runs with structured metrics artifacts Apr 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

This PR replaces the harness's fragile stdout-scraping (StreamMetrics/parse_stream_text) with structured *.runtime.summary.json-driven metric extraction and wires --run-mode through to the graph_pipeline subprocess. Secondary improvements include a gitdir/packed-refs-aware last_commit() fallback and fractional Ray resource coercion in gather_cluster_resources. Test coverage is thorough and the results.json schema is kept additive.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/cleanup suggestions with no correctness impact.

The refactor is well-structured and test coverage is solid. The three findings are P2: broad exception handling that could be tightened to OSError/UnicodeDecodeError, a files: None placeholder that may surprise downstream consumers, and dead code in the pages-fallback path of _resolve_summary_metrics. None of these cause incorrect runtime behavior.

nemo_retriever/src/nemo_retriever/harness/artifacts.py (exception specificity) and nemo_retriever/src/nemo_retriever/harness/run.py (files: None + dead fallback code)

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/harness/run.py Core change: removes StreamMetrics/stdout parsing and replaces with structured runtime-summary-driven metrics; adds --run-mode to _build_command; introduces _build_structured_metrics_payload; files metric is now hardcoded to None
nemo_retriever/src/nemo_retriever/harness/artifacts.py Adds gitdir-file / packed-refs aware commit resolution; multiple except Exception blocks that silently return None violate the no-bare-except rule
nemo_retriever/src/nemo_retriever/harness/config.py Adds run_mode field with VALID_RUN_MODES constant, validation, and HARNESS_RUN_MODE env override; clean and well-tested
nemo_retriever/src/nemo_retriever/harness/parsers.py Deleted: stdout-scraping StreamMetrics/parse_stream_text replaced entirely by structured runtime-summary extraction
nemo_retriever/src/nemo_retriever/utils/ray_resource_hueristics.py Adds _coerce_count helper to safely convert Ray's potentially-fractional resource counts to int; clean fix with test coverage

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[HarnessConfig\nrun_mode, tuning flags] --> B[_build_command\nadds --run-mode]
    B --> C[graph_pipeline subprocess\nwrites *.runtime.summary.json]
    C --> D{runtime summary\nexists?}
    D -- yes --> E[_build_structured_metrics_payload\nreads pages / ingest_secs / evaluation_metrics]
    D -- no --> F[all metrics = None]
    E --> G[_evaluate_run_outcome\nrecall_metrics / evaluation_metrics]
    E --> H[_resolve_summary_metrics\npages / PPS / recall_5 / ndcg_10]
    G --> I[results.json\nmetrics + summary_metrics\n+ runtime_summary]
    H --> I
    F --> G
    F --> H
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/artifacts.py
Line: 37-44

Comment:
**Overly broad exception handling silently swallows errors**

`except Exception: return None` here (and in `_read_packed_ref` and `_read_head_commit`) catches far more than just I/O failures — including programming errors like `AttributeError` or `TypeError`. The intended guard is filesystem errors; prefer catching `(OSError, UnicodeDecodeError)` so logic bugs surface instead of silently falling back to `"unknown"`.

```suggestion
    try:
        raw = dot_git.read_text(encoding="utf-8").strip()
    except (OSError, UnicodeDecodeError):
        return None
```

The same tightening applies to the `except Exception` blocks in `_read_packed_ref` and `_read_head_commit`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 178-179

Comment:
**`files` metric is always `None` — consumers expecting a file count will silently regress**

The old `StreamMetrics`-based path populated `files` from the `[done] N files …` stdout line. The new structured payload hardcodes `"files": None` because the runtime-summary JSON has no equivalent field. The PR description says the `results.json` layout stays compatible with existing consumers; any consumer that previously read a non-None `files` value will see None without any warning.

If the `graph_pipeline` runtime summary does not (and will not) emit a file count, consider removing the `files` key from the schema entirely rather than keeping it as a perpetually-None placeholder, or document the intentional omission with a comment.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 223-233

Comment:
**Redundant pages-fallback lookup is unreachable dead code**

`metrics_payload["pages"]` is already resolved from `runtime_summary` by `_build_structured_metrics_payload` using the same key precedence (`processed_pages``num_pages``input_pages`). Because both functions receive the same `runtime_summary` dict, `summary_metrics["pages"]` will equal `metrics_payload.get("pages")` and the `if summary_metrics["pages"] is None` branch in `_resolve_summary_metrics` can never find a value that `_build_structured_metrics_payload` missed. The block from line 223 to 233 is dead and can be removed to reduce confusion.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (7): Last reviewed commit: "harness: drop auto tuning overlay from m..." | Re-trigger Greptile

Comment on lines +522 to +525
resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags")
if isinstance(resolved_graph_flags, dict):
final_graph_flags = dict(resolved_graph_flags)
effective_tuning["graph_pipeline_flags"] = final_graph_flags
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 *_cpus_per_actor keys silently dropped from graph_pipeline_flags on runtime override

When runtime_resolved_tuning provides a resolved_graph_flags dict, final_graph_flags is completely replaced. resolved_graph_flags (built in _resolve_runtime_tuning_summary) includes nemotron_parse_* keys but does not include page_elements_cpus_per_actor, ocr_cpus_per_actor, or embed_cpus_per_actor. Those three fields are present in requested_graph_flags from _resolve_effective_tuning and are actually passed to the subprocess, but disappear from the effective_tuning.graph_pipeline_flags artifact after the replacement. Merging instead of replacing preserves them:

Suggested change
resolved_graph_flags = runtime_resolved_tuning.get("resolved_graph_flags")
if isinstance(resolved_graph_flags, dict):
final_graph_flags = dict(resolved_graph_flags)
effective_tuning["graph_pipeline_flags"] = final_graph_flags
if isinstance(resolved_graph_flags, dict):
final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)}
effective_tuning["graph_pipeline_flags"] = final_graph_flags
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/harness/run.py
Line: 522-525

Comment:
**`*_cpus_per_actor` keys silently dropped from `graph_pipeline_flags` on runtime override**

When `runtime_resolved_tuning` provides a `resolved_graph_flags` dict, `final_graph_flags` is completely replaced. `resolved_graph_flags` (built in `_resolve_runtime_tuning_summary`) includes `nemotron_parse_*` keys but does not include `page_elements_cpus_per_actor`, `ocr_cpus_per_actor`, or `embed_cpus_per_actor`. Those three fields are present in `requested_graph_flags` from `_resolve_effective_tuning` and are actually passed to the subprocess, but disappear from the `effective_tuning.graph_pipeline_flags` artifact after the replacement. Merging instead of replacing preserves them:

```suggestion
    if isinstance(resolved_graph_flags, dict):
        final_graph_flags = {**final_graph_flags, **dict(resolved_graph_flags)}
        effective_tuning["graph_pipeline_flags"] = final_graph_flags
```

How can I resolve this? If you propose a fix, please make it concise.

- add missing trailing newline to CONTRIBUTING.md
- add missing trailing newline to release notes markdown
@jioffe502 jioffe502 marked this pull request as ready for review April 7, 2026 14:53
@jioffe502 jioffe502 requested review from jperez999 and removed request for nkmcalli April 7, 2026 14:53
Resolve conflicts:
- run.py: omit tuning CLI flags when use_heuristics; otherwise pass
  effective_tuning (auto_tuning zeros vs configured values).
- graph_pipeline.py: keep runtime tuning summary helpers and
  _count_input_units from main.
Runtime summaries may only return a subset of tuning keys; preserve
requested effective values (e.g. auto-tuning CPU sentinels) when
overlaying resolved_graph_flags.
…_tuning

Remove harness results.json effective_tuning and runtime summary
resolved_tuning so this PR stays focused on runtime-summary metrics
(vs stdout). Tuning CLI construction unchanged; follow-up can reintroduce
tuning capture with a single clear model.
jioffe502 and others added 3 commits April 7, 2026 15:23
…list

Trim graph-refactor migration, git workflow, and obsolete batch/parser
notes. Keep operator map, CLI, artifacts, and a short results.json
contract. Add maintainer backlog and harness change checklist for agents.
Resolve harness conflicts: keep runtime-summary metrics path, add
_resolve_store_uri and graph store CLI flags from main; restore slim
HANDOFF with store note.
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the behavior, in the resolve_effective_tuning is as I describe in my comment. I think the real problem is that we need ensure that when we run the graph_pipeline.py module the defaults work correctly, wdyt?

return str(p)


def _resolve_effective_tuning(cfg: HarnessConfig) -> dict[str, int | float]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to be a catch all, so that if you dont set it, something still gets set? If so, what we are talking about is a default. Shouldn't that be in the actual graph_pipeline.py defaults so they are always picked up?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. _resolve_effective_tuning() was new in this PR, and I removed that helper plus the harness auto_tuning surface in the latest push to keep this change focused on the structured metric flow. I agree the broader default-ownership question belongs in graph_pipeline.py as follow-up work, especially if we want to preserve selective harness overrides later.

Keep this PR focused on structured runtime summary metrics and run_mode plumbing. Leave graph_pipeline default ownership and selective harness overrides for follow-up work.

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 merged commit 2f7b1bb into NVIDIA:main Apr 10, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants