Skip to content

feat: support component-based recommended actions#1124

Open
alexscjundev wants to merge 8 commits intoNVIDIA:mainfrom
alexscjundev:ajun/4_6_remediate_with_type
Open

feat: support component-based recommended actions#1124
alexscjundev wants to merge 8 commits intoNVIDIA:mainfrom
alexscjundev:ajun/4_6_remediate_with_type

Conversation

@alexscjundev
Copy link
Copy Markdown

@alexscjundev alexscjundev commented Apr 7, 2026

Signed-off-by: Alex Jun aljun@nvidia.com

Summary

Support ComponentRemediationActions in fault remediation, which maps from component type and recommended action to created CR.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Component-class-specific remediation: define per-component remediation actions that override global actions with automatic fallback.
  • Documentation

    • Docs updated to describe component-scoped configuration, resolution precedence, template variables, and new templates for component-specific workflows.
  • Behavior

    • Remediation tracking now records component class to improve deduplication and ensure correct action/template selection.
  • Tests

    • Added/updated tests and e2e coverage for component-scoped resolution and template loading.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

Adds component-class-scoped remediation actions with fallback to shared actions; updates Helm values/templates, TOML config, resolution helpers, template loading, annotation state, CR-status checks, reconciler flows, and tests to be component-aware.

Changes

Cohort / File(s) Summary
Helm templates & values
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml, distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml, distros/kubernetes/nvsentinel/values-full.yaml
Add maintenance.componentActions values and render [componentRemediationActions.<componentClass>.<actionName>] blocks in generated config; document resolution precedence (component override → shared).
Documentation
docs/fault-remediation.md, docs/configuration/fault-remediation.md
Document new componentActions mapping, resolution order, new templates (lpu-remediation-template.yaml, rebootnode-template.yaml), and examples for GPU/LPU overrides.
Config types & validation
fault-remediation/pkg/config/config.go, fault-remediation/pkg/config/config_test.go
Add ComponentRemediationActions and TOML field; implement resolution helpers (ResolveMaintenanceResource, ResolveStoredMaintenanceResource, SharedActionNames, ComponentActionNames, ResolvedActionKey); extend validation to include component-scoped actions and combined equivalence-group checks.
Remediation client & templates
fault-remediation/pkg/remediation/remediation.go, fault-remediation/pkg/remediation/remediation_test.go
Preload templates for component-aware action keys (componentClass/actionName), add loadTemplateForAction, validate namespaced component actions, select actions via ResolveMaintenanceResource, and propagate componentClass into annotation updates and CR creation.
Equivalence group logic & tests
fault-remediation/pkg/common/equivalence_groups.go, .../equivalence_groups_test.go
Change GetGroupConfigForEvent to accept *config.TomlConfig and resolve action via ResolveMaintenanceResource; update tests to cover component overrides and resulting group naming/impacted-entity resolution.
Annotation state & interface
fault-remediation/pkg/annotation/annotation.go, .../annotation_interface.go, .../annotation_test.go
Add ComponentClass to EquivalenceGroupState, update UpdateRemediationState signature to accept componentClass, and update tests/fixtures to persist and assert component class.
CR status checking & interface
fault-remediation/pkg/crstatus/checker.go, .../crstatus_interface.go, .../crstatus_test.go
Checker now holds *config.TomlConfig; update ShouldSkipCRCreation signature to include componentClass and resolve stored actions with ResolveStoredMaintenanceResource; adjust GET error handling and add component-aware tests.
Reconciler & tests
fault-remediation/pkg/reconciler/reconciler.go, .../reconciler_e2e_test.go, .../reconciler_test.go
Pass full TOML config into group resolution, include ComponentClass in skip checks, and update mocks/tests to support component-aware status checker and annotation manager behaviors.

Sequence Diagram

sequenceDiagram
    participant Event as HealthEvent\n(ComponentClass)
    participant Reconciler as Reconciler
    participant Config as TomlConfig\n(ResolveMaintenanceResource)
    participant Checker as CRStatusChecker
    participant Remediator as RemediationClient
    participant Annotator as AnnotationManager

    Event->>Reconciler: receive HealthEvent
    Reconciler->>Config: ResolveMaintenanceResource(ComponentClass, ActionName)
    alt component-specific found
        Config-->>Reconciler: MaintenanceResource (component-scoped)
    else fallback to shared
        Config-->>Reconciler: MaintenanceResource (shared)
    end
    Reconciler->>Checker: ShouldSkipCRCreation(ComponentClass, ActionName, crName)
    Checker->>Config: ResolveStoredMaintenanceResource(ComponentClass, ActionName)
    alt in-progress matches component
        Checker-->>Reconciler: skip creation
    else not in-progress / ambiguous
        Checker-->>Reconciler: create CR
    end
    Reconciler->>Remediator: CreateMaintenanceResource(HealthEvent, ComponentClass)
    Remediator->>Remediator: select template by ResolvedActionKey(ComponentClass/ActionName)
    Remediator->>Annotator: UpdateRemediationState(..., ComponentClass)
    Annotator-->>Remediator: state updated
    Remediator-->>Reconciler: CR created
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Poem

🐰 I hopped through TOML, stitched templates with glee,

GPU and LPU each get a key.
Actions now know which class to mend,
Annotations remember every friend,
A carrot-cheer for fixes set free!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the primary change: introducing support for component-based recommended actions in fault remediation. It directly relates to the main objective and changeset.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@alexscjundev alexscjundev force-pushed the ajun/4_6_remediate_with_type branch from ffe4de5 to 93815ff Compare April 7, 2026 17:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (7)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml (1)

51-73: Extract the repeated resource renderer into a helper.

This block mirrors the maintenance.actions TOML emission almost field-for-field. Now that both paths exist, the next MaintenanceResource change is easy to land in one block and miss in the other.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml`
around lines 51 - 73, The repeated TOML emission for component remediation
resources (the range over .Values.maintenance.componentActions producing the
[componentRemediationActions...] block) should be extracted into a Helm helper
so both this file and the existing maintenance.actions emitter can call the same
renderer; create a named template (e.g. "maintenance.renderResource" or
"maintenance.componentRemediationRenderer") that accepts the componentClass,
actionName and config and emits the identical fields (apiGroup, version, kind,
scope, namespace, completeConditionType, templateFileName, equivalenceGroup,
impactedEntityScope, supersedingEquivalenceGroups) with the same
defaults/quoting/json logic, then replace the inline block in this configmap and
the other maintenance.actions emitter to call that helper with include and
passing the context.
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

852-853: Add one real component-override integration case.

This test now routes through the component-aware resolver, but the fixture still only builds shared RemediationActions and the health events here do not populate ComponentClass. As written, the suite still doesn't validate maintenance.componentActions or component-aware dedup/annotation behavior end-to-end.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/reconciler/reconciler_e2e_test.go` around lines 852 -
853, The test routes through the component-aware resolver but the fixture only
builds shared RemediationActions and the HealthEvent used (event1.HealthEvent)
does not set ComponentClass, so the suite never exercises
maintenance.componentActions or component-aware dedup/annotation. Fix by adding
a real component-override integration case: create a HealthEvent with
ComponentClass populated, extend the test fixture to build component-specific
RemediationActions (not just shared ones) and verify behavior after calling
common.GetGroupConfigForEvent(cfg.RemediationClient.GetConfig(),
event1.HealthEvent) that maintenance.componentActions and component-aware
dedup/annotation behave as expected; reference the test in reconciler_e2e_test,
the event variable event1.HealthEvent, the RemediationActions fixtures, and the
maintenance.componentActions checks to locate where to add the case.
fault-remediation/pkg/crstatus/checker.go (1)

49-51: Minor: Dry-run log message could be clearer.

The log message "CR doesn't exist (dry-run mode)" is slightly misleading since we're not actually checking if the CR exists in dry-run mode—we're just skipping the check entirely.

📝 Suggested improvement
 if c.dryRun {
-    slog.Info("DRY-RUN: CR doesn't exist (dry-run mode)", "crName", crName, "action", actionName)
+    slog.Info("DRY-RUN: Skipping CR existence check", "crName", crName, "action", actionName)
     return false
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/crstatus/checker.go` around lines 49 - 51, The dry-run
log is misleading — update the slog.Info call in the checker where c.dryRun is
checked (the block referencing crName and actionName) to state that the
existence check was skipped due to dry-run mode (e.g., "DRY-RUN: skipping CR
existence check") and include crName and actionName as before so the log clearly
indicates the check was not performed rather than reporting a non-existence.
docs/configuration/fault-remediation.md (1)

105-112: Verify the LPU template metadata access pattern.

The LPU remediation template uses {{ index .HealthEvent.Metadata "device" }} which assumes:

  1. HealthEvent.Metadata is a map
  2. It contains a "device" key

Consider adding a comment or note clarifying that this is an illustrative example and operators should adapt the template based on their actual health event structure and LPU CRD requirements.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/configuration/fault-remediation.md` around lines 105 - 112, The template
assumes .HealthEvent.Metadata is a map and contains a "device" key; update the
LPU remediation template text to include a short comment above the snippet
stating this is illustrative and operators must adapt the field access to their
actual HealthEvent shape and LPU CRD (e.g., replace {{ index
.HealthEvent.Metadata "device" }} with the correct path or guarded lookup in
your environment). Mention the two explicit assumptions (map type and presence
of "device") and recommend using a conditional or appropriate key lookup when
implementing in production so the template doesn’t panic on missing keys.
fault-remediation/pkg/config/config.go (2)

75-76: Consider adding type documentation.

The ComponentRemediationActions type would benefit from a brief doc comment explaining its structure (componentClass → actionName → MaintenanceResource mapping).

📝 Suggested documentation
+// ComponentRemediationActions maps component classes to their action-specific remediation
+// resources. The outer key is the componentClass (e.g., "GPU", "LPU"), and the inner key
+// is the action name (e.g., "COMPONENT_RESET").
 type ComponentRemediationActions map[string]map[string]MaintenanceResource
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/config/config.go` around lines 75 - 76, Add a brief doc
comment immediately above the ComponentRemediationActions type declaration
describing that it maps component class (string) to a map of action names
(string) to MaintenanceResource values (i.e., componentClass → actionName →
MaintenanceResource), so readers understand the nested map structure and
intended usage; update the comment near the type `ComponentRemediationActions`
to reflect this explanation.

236-251: Minor capacity estimation note.

Line 238's capacity calculation uses len(c.ComponentRemediationActions) which counts component classes, not total component actions. This is a minor inefficiency (extra allocations if there are many actions per component class) but has negligible impact.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/config/config.go` around lines 236 - 251, The capacity
preallocation in allMaintenanceResources overestimates because it uses
len(c.ComponentRemediationActions) (the number of component keys) instead of the
total number of actions; in the allMaintenanceResources function compute the
true total by summing len(actions) for each entry in
c.ComponentRemediationActions (e.g., iterate the map/slice to accumulate
totalComponentActions) and then create allResources with capacity
len(c.RemediationActions)+totalComponentActions so append won’t cause extra
reallocations.
fault-remediation/pkg/config/config_test.go (1)

22-24: Consider using standard testing package for simple assertions.

The retrieved learning indicates that in this repository, the standard testing package assertions (t.Error, t.Errorf, t.Fatal) should be preferred for simple equality/inequality checks, reserving testify for complex scenarios. While TestTomlConfig_ResolveMaintenanceResource correctly uses standard assertions, TestResolvedActionKeyAndActionNameHelpers and TestTomlConfig_ResolveStoredMaintenanceResource use testify/assert for straightforward equality checks.

Based on learnings: "In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/config/config_test.go` around lines 22 - 24, Replace
testify usage in TestResolvedActionKeyAndActionNameHelpers and
TestTomlConfig_ResolveStoredMaintenanceResource with standard testing methods:
remove the "github.com/stretchr/testify/assert" import, use the testing.T
methods (t.Fatalf/t.Errorf/t.Helper as appropriate) for equality and error
checks, and update the assertions inside the functions (e.g., where
assert.Equal/assert.NoError/assert.Nil are used) to equivalent t.Fatalf/t.Errorf
calls that include expected vs actual values; keep
TestTomlConfig_ResolveMaintenanceResource unchanged as it already uses the
standard testing package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml`:
- Around line 51-73: The repeated TOML emission for component remediation
resources (the range over .Values.maintenance.componentActions producing the
[componentRemediationActions...] block) should be extracted into a Helm helper
so both this file and the existing maintenance.actions emitter can call the same
renderer; create a named template (e.g. "maintenance.renderResource" or
"maintenance.componentRemediationRenderer") that accepts the componentClass,
actionName and config and emits the identical fields (apiGroup, version, kind,
scope, namespace, completeConditionType, templateFileName, equivalenceGroup,
impactedEntityScope, supersedingEquivalenceGroups) with the same
defaults/quoting/json logic, then replace the inline block in this configmap and
the other maintenance.actions emitter to call that helper with include and
passing the context.

In `@docs/configuration/fault-remediation.md`:
- Around line 105-112: The template assumes .HealthEvent.Metadata is a map and
contains a "device" key; update the LPU remediation template text to include a
short comment above the snippet stating this is illustrative and operators must
adapt the field access to their actual HealthEvent shape and LPU CRD (e.g.,
replace {{ index .HealthEvent.Metadata "device" }} with the correct path or
guarded lookup in your environment). Mention the two explicit assumptions (map
type and presence of "device") and recommend using a conditional or appropriate
key lookup when implementing in production so the template doesn’t panic on
missing keys.

In `@fault-remediation/pkg/config/config_test.go`:
- Around line 22-24: Replace testify usage in
TestResolvedActionKeyAndActionNameHelpers and
TestTomlConfig_ResolveStoredMaintenanceResource with standard testing methods:
remove the "github.com/stretchr/testify/assert" import, use the testing.T
methods (t.Fatalf/t.Errorf/t.Helper as appropriate) for equality and error
checks, and update the assertions inside the functions (e.g., where
assert.Equal/assert.NoError/assert.Nil are used) to equivalent t.Fatalf/t.Errorf
calls that include expected vs actual values; keep
TestTomlConfig_ResolveMaintenanceResource unchanged as it already uses the
standard testing package.

In `@fault-remediation/pkg/config/config.go`:
- Around line 75-76: Add a brief doc comment immediately above the
ComponentRemediationActions type declaration describing that it maps component
class (string) to a map of action names (string) to MaintenanceResource values
(i.e., componentClass → actionName → MaintenanceResource), so readers understand
the nested map structure and intended usage; update the comment near the type
`ComponentRemediationActions` to reflect this explanation.
- Around line 236-251: The capacity preallocation in allMaintenanceResources
overestimates because it uses len(c.ComponentRemediationActions) (the number of
component keys) instead of the total number of actions; in the
allMaintenanceResources function compute the true total by summing len(actions)
for each entry in c.ComponentRemediationActions (e.g., iterate the map/slice to
accumulate totalComponentActions) and then create allResources with capacity
len(c.RemediationActions)+totalComponentActions so append won’t cause extra
reallocations.

In `@fault-remediation/pkg/crstatus/checker.go`:
- Around line 49-51: The dry-run log is misleading — update the slog.Info call
in the checker where c.dryRun is checked (the block referencing crName and
actionName) to state that the existence check was skipped due to dry-run mode
(e.g., "DRY-RUN: skipping CR existence check") and include crName and actionName
as before so the log clearly indicates the check was not performed rather than
reporting a non-existence.

In `@fault-remediation/pkg/reconciler/reconciler_e2e_test.go`:
- Around line 852-853: The test routes through the component-aware resolver but
the fixture only builds shared RemediationActions and the HealthEvent used
(event1.HealthEvent) does not set ComponentClass, so the suite never exercises
maintenance.componentActions or component-aware dedup/annotation. Fix by adding
a real component-override integration case: create a HealthEvent with
ComponentClass populated, extend the test fixture to build component-specific
RemediationActions (not just shared ones) and verify behavior after calling
common.GetGroupConfigForEvent(cfg.RemediationClient.GetConfig(),
event1.HealthEvent) that maintenance.componentActions and component-aware
dedup/annotation behave as expected; reference the test in reconciler_e2e_test,
the event variable event1.HealthEvent, the RemediationActions fixtures, and the
maintenance.componentActions checks to locate where to add the case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 26a73413-e505-44ec-bc6f-06a527ddbe84

📥 Commits

Reviewing files that changed from the base of the PR and between 065709f and ffe4de5.

📒 Files selected for processing (20)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • docs/configuration/fault-remediation.md
  • docs/fault-remediation.md
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/config/config_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go

Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
…onsume it in ShouldSkipCRCreation

Previously, ShouldSkipCRCreation resolved the stored CR type using the current health event’s ComponentClass plus the annotation’s ActionName. That was wrong when the existing CR had been created by a different component-specific override, because the lookup could miss the real in-progress CR and allow a duplicate remediation. The fix records the creating ComponentClass in EquivalenceGroupState and uses that stored value when checking existing CR status.

For backward compatibility, older annotations without ComponentClass still resolve when the stored ActionName remains unambiguous in the current config. If newer component-specific overrides make that mapping ambiguous, we treat the legacy annotation as stale rather than guessing.

Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
@alexscjundev alexscjundev force-pushed the ajun/4_6_remediate_with_type branch from 93815ff to 358fd3f Compare April 7, 2026 17:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
fault-remediation/pkg/crstatus/checker.go (1)

54-61: Consider lowering log level for expected legacy annotation scenarios.

When componentClass is empty (legacy annotations) and the action only exists in ComponentRemediationActions, ResolveStoredMaintenanceResource intentionally returns false to treat the mapping as stale. This is documented behavior per config.go:310-333.

However, the slog.Error at line 56 may generate noise for this expected scenario. Consider using slog.Warn or slog.Debug to distinguish between genuine misconfigurations and expected backward-compatibility handling.

♻️ Suggested change to reduce log noise
 	resource, found := c.remediationConfig.ResolveStoredMaintenanceResource(componentClass, actionName)
 	if !found {
-		slog.Error("No remediation configuration found for action",
+		slog.Warn("No remediation configuration found for stored action (may be stale annotation)",
 			"action", actionName,
 			"componentClass", componentClass)
 
 		return false
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fault-remediation/pkg/crstatus/checker.go` around lines 54 - 61, The
slog.Error call after calling
remediationConfig.ResolveStoredMaintenanceResource(componentClass, actionName)
is too noisy for expected legacy cases where componentClass is empty; change the
log level to slog.Warn (or slog.Debug if even less noise is desired) and adjust
the message to indicate this is an expected/backward-compatibility fallback
(include "action" and "componentClass" as currently logged) so
ResolveStoredMaintenanceResource and the surrounding conditional still return
false but with a less severe log level.
docs/fault-remediation.md (1)

94-101: Consider documenting missing metadata handling in templates.

The lpu-remediation-template.yaml uses {{ index .HealthEvent.Metadata "device" }} to access the device from metadata. This is valid Go template syntax, but if the "device" key is missing from Metadata, it will render as an empty string.

Consider adding a note about required metadata fields for each component type, or using a defensive template pattern.

📝 Optional: Defensive template example
# In documentation, note that templates should handle missing keys gracefully:
spec:
  nodeName: {{ .HealthEvent.NodeName }}
  {{- if index .HealthEvent.Metadata "device" }}
  device: {{ index .HealthEvent.Metadata "device" }}
  {{- end }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/fault-remediation.md` around lines 94 - 101, Update the docs to explain
that lpu-remediation-template.yaml’s use of {{ index .HealthEvent.Metadata
"device" }} will yield an empty string if the "device" key is absent; either (a)
list required metadata keys (e.g., "device") for LPURemediation and HealthEvent
consumers like .HealthEvent.NodeName and .HealthEventID, or (b) show a defensive
Go-template pattern (using an if test around index .HealthEvent.Metadata
"device") so templates omit the device field when missing—include both the
required-fields note and the defensive-template example for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docs/fault-remediation.md`:
- Around line 94-101: Update the docs to explain that
lpu-remediation-template.yaml’s use of {{ index .HealthEvent.Metadata "device"
}} will yield an empty string if the "device" key is absent; either (a) list
required metadata keys (e.g., "device") for LPURemediation and HealthEvent
consumers like .HealthEvent.NodeName and .HealthEventID, or (b) show a defensive
Go-template pattern (using an if test around index .HealthEvent.Metadata
"device") so templates omit the device field when missing—include both the
required-fields note and the defensive-template example for clarity.

In `@fault-remediation/pkg/crstatus/checker.go`:
- Around line 54-61: The slog.Error call after calling
remediationConfig.ResolveStoredMaintenanceResource(componentClass, actionName)
is too noisy for expected legacy cases where componentClass is empty; change the
log level to slog.Warn (or slog.Debug if even less noise is desired) and adjust
the message to indicate this is an expected/backward-compatibility fallback
(include "action" and "componentClass" as currently logged) so
ResolveStoredMaintenanceResource and the surrounding conditional still return
false but with a less severe log level.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 63af7997-1405-4c6b-94e0-9de1df22250c

📥 Commits

Reviewing files that changed from the base of the PR and between ffe4de5 and 93815ff.

📒 Files selected for processing (20)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • docs/configuration/fault-remediation.md
  • docs/fault-remediation.md
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/config/config_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
🚧 Files skipped from review as they are similar to previous changes (14)
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • docs/configuration/fault-remediation.md
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/config/config_test.go

@lalitadithya
Copy link
Copy Markdown
Collaborator

/ok to test 358fd3f

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/fault-remediation.md (1)

53-75: Clarify component class key expectations (case/enum) to prevent config mismatches.

Nice addition. Consider adding one line that componentActions keys (e.g., "GPU", "LPU") must exactly match emitted componentClass values (case-sensitive), otherwise lookup falls back to shared actions. This will reduce operator misconfigurations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/fault-remediation.md` around lines 53 - 75, Add a clarifying sentence in
the docs stating that the keys under componentActions (e.g., "GPU", "LPU") must
exactly match the emitted componentClass values (case-sensitive) to avoid
falling back to shared actions; update the section near componentActions to
explicitly call out that componentClass matching is exact and case-sensitive and
that mismatches will cause the lookup to use shared/default actions instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/fault-remediation.md`:
- Around line 94-101: In the LPURemediation template under the
"lpu-remediation-template.yaml" snippet, the unquoted device value pulled via
index from HealthEvent.Metadata can produce YAML parsing issues; update the
spec.nodeName/device line to render the device value wrapped in double quotes
(i.e., quote the result of index .HealthEvent.Metadata "device") so the
generated LPURemediation manifest (kind: LPURemediation) safely handles device
identifiers.

---

Nitpick comments:
In `@docs/fault-remediation.md`:
- Around line 53-75: Add a clarifying sentence in the docs stating that the keys
under componentActions (e.g., "GPU", "LPU") must exactly match the emitted
componentClass values (case-sensitive) to avoid falling back to shared actions;
update the section near componentActions to explicitly call out that
componentClass matching is exact and case-sensitive and that mismatches will
cause the lookup to use shared/default actions instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fc88528c-e9ce-4d5e-ad59-3fe382e948e1

📥 Commits

Reviewing files that changed from the base of the PR and between 93815ff and 358fd3f.

📒 Files selected for processing (20)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/values.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • docs/configuration/fault-remediation.md
  • docs/fault-remediation.md
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/config/config.go
  • fault-remediation/pkg/config/config_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
✅ Files skipped from review due to trivial changes (2)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
🚧 Files skipped from review as they are similar to previous changes (9)
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/common/equivalence_groups.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/common/equivalence_groups_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go

Comment thread docs/fault-remediation.md
Comment on lines +94 to +101
"lpu-remediation-template.yaml": |
apiVersion: {{.ApiGroup}}/{{.Version}}
kind: LPURemediation
metadata:
name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }}
spec:
nodeName: {{ .HealthEvent.NodeName }}
device: {{ index .HealthEvent.Metadata "device" }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Quote templated device value in the example for YAML safety.

At Line 101, prefer quoting the rendered value to avoid YAML parsing edge cases for device identifiers:
device: "{{ index .HealthEvent.Metadata "device" }}".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/fault-remediation.md` around lines 94 - 101, In the LPURemediation
template under the "lpu-remediation-template.yaml" snippet, the unquoted device
value pulled via index from HealthEvent.Metadata can produce YAML parsing
issues; update the spec.nodeName/device line to render the device value wrapped
in double quotes (i.e., quote the result of index .HealthEvent.Metadata
"device") so the generated LPURemediation manifest (kind: LPURemediation) safely
handles device identifiers.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/config 34.29% (+4.49%) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus 35.00% (+4.42%) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation 29.29% (+1.45%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/fault-remediation/pkg/config/config.go 34.29% (+4.49%) 700 (+361) 240 (+139) 460 (+222) 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/crstatus/checker.go 35.00% (+4.42%) 220 (+14) 77 (+14) 143 👍
github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation/remediation.go 29.29% (+1.45%) 1444 (+65) 423 (+39) 1021 (+26) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

statusChecker *crstatus.CRStatusChecker
}

// nolint:cyclop
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modularize the function so we have to not add nolint for cyclop?

Looks at the diff below:

	for actionName, maintenanceResource := range remediationConfig.RemediationActions {
		if err := loadTemplateForAction(templates, templateMountPath, actionName, maintenanceResource); err != nil {
			return nil, err
		}
	}

is repeated in the line 96-100, perhaps we can create a function for this that validate for map?

equivalenceGroup: "restart"

componentActions:
"GPU":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing some context here: What forbids an user to add Node here and do node level remediation flowing through this new pathaway?


for _, componentActions := range c.ComponentRemediationActions {
if _, ok := componentActions[actionName]; ok {
return MaintenanceResource{}, false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns false as soon when it finds the action in any component's overrides, even if only one component has it. This means an older annotation (with the previous version of annotation) for COMPONENT_RESET is treated as ambiguous even if only GPU (and no other component) has an override. So when nvsentinel is upgraded in a cluster, this means the existing annotation entry gets treated as stale and cleaned up, potentially causing a new CR to be created even though the in-progress one is still valid. I think a better alternative would be to treat it as unambiguous if exactly one component class defines the action, and resolve to that action. What do you think?

@KaivalyaMDabhadkar
Copy link
Copy Markdown
Contributor

In fault-remediation/pkg/reconciler/reconciler_e2e_test.go, the e2e tests compile with the new fucntion signature but don't actually test the new feature we have added. All health events in the e2e tests have an empty ComponentClass, and the config only has shared RemediationActions, no ComponentRemediationActions. So every test still is still going through the old fallback path.

It would be good to add one e2e test case that uses the component-aware path end to end. Something like: a GPU health event (ComponentClass: "GPU") with a GPU-specific override in the config, verify it creates a GPUReset CR (not the shared RebootNode), verify the annotation records ComponentClass: "GPU", then send a second identical event and verify it correctly deduplicates against the stored GPU-specific CR. This would catch any issues in the e2e flow with this new code path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants