feat: support component-based recommended actions#1124
feat: support component-based recommended actions#1124alexscjundev wants to merge 8 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds component-class-scoped remediation actions with fallback to shared actions; updates Helm values/templates, TOML config, resolution helpers, template loading, annotation state, CR-status checks, reconciler flows, and tests to be component-aware. Changes
Sequence DiagramsequenceDiagram
participant Event as HealthEvent\n(ComponentClass)
participant Reconciler as Reconciler
participant Config as TomlConfig\n(ResolveMaintenanceResource)
participant Checker as CRStatusChecker
participant Remediator as RemediationClient
participant Annotator as AnnotationManager
Event->>Reconciler: receive HealthEvent
Reconciler->>Config: ResolveMaintenanceResource(ComponentClass, ActionName)
alt component-specific found
Config-->>Reconciler: MaintenanceResource (component-scoped)
else fallback to shared
Config-->>Reconciler: MaintenanceResource (shared)
end
Reconciler->>Checker: ShouldSkipCRCreation(ComponentClass, ActionName, crName)
Checker->>Config: ResolveStoredMaintenanceResource(ComponentClass, ActionName)
alt in-progress matches component
Checker-->>Reconciler: skip creation
else not in-progress / ambiguous
Checker-->>Reconciler: create CR
end
Reconciler->>Remediator: CreateMaintenanceResource(HealthEvent, ComponentClass)
Remediator->>Remediator: select template by ResolvedActionKey(ComponentClass/ActionName)
Remediator->>Annotator: UpdateRemediationState(..., ComponentClass)
Annotator-->>Remediator: state updated
Remediator-->>Reconciler: CR created
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~55 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
ffe4de5 to
93815ff
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (7)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml (1)
51-73: Extract the repeated resource renderer into a helper.This block mirrors the
maintenance.actionsTOML emission almost field-for-field. Now that both paths exist, the nextMaintenanceResourcechange is easy to land in one block and miss in the other.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml` around lines 51 - 73, The repeated TOML emission for component remediation resources (the range over .Values.maintenance.componentActions producing the [componentRemediationActions...] block) should be extracted into a Helm helper so both this file and the existing maintenance.actions emitter can call the same renderer; create a named template (e.g. "maintenance.renderResource" or "maintenance.componentRemediationRenderer") that accepts the componentClass, actionName and config and emits the identical fields (apiGroup, version, kind, scope, namespace, completeConditionType, templateFileName, equivalenceGroup, impactedEntityScope, supersedingEquivalenceGroups) with the same defaults/quoting/json logic, then replace the inline block in this configmap and the other maintenance.actions emitter to call that helper with include and passing the context.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
852-853: Add one real component-override integration case.This test now routes through the component-aware resolver, but the fixture still only builds shared
RemediationActionsand the health events here do not populateComponentClass. As written, the suite still doesn't validatemaintenance.componentActionsor component-aware dedup/annotation behavior end-to-end.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/reconciler/reconciler_e2e_test.go` around lines 852 - 853, The test routes through the component-aware resolver but the fixture only builds shared RemediationActions and the HealthEvent used (event1.HealthEvent) does not set ComponentClass, so the suite never exercises maintenance.componentActions or component-aware dedup/annotation. Fix by adding a real component-override integration case: create a HealthEvent with ComponentClass populated, extend the test fixture to build component-specific RemediationActions (not just shared ones) and verify behavior after calling common.GetGroupConfigForEvent(cfg.RemediationClient.GetConfig(), event1.HealthEvent) that maintenance.componentActions and component-aware dedup/annotation behave as expected; reference the test in reconciler_e2e_test, the event variable event1.HealthEvent, the RemediationActions fixtures, and the maintenance.componentActions checks to locate where to add the case.fault-remediation/pkg/crstatus/checker.go (1)
49-51: Minor: Dry-run log message could be clearer.The log message "CR doesn't exist (dry-run mode)" is slightly misleading since we're not actually checking if the CR exists in dry-run mode—we're just skipping the check entirely.
📝 Suggested improvement
if c.dryRun { - slog.Info("DRY-RUN: CR doesn't exist (dry-run mode)", "crName", crName, "action", actionName) + slog.Info("DRY-RUN: Skipping CR existence check", "crName", crName, "action", actionName) return false }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/crstatus/checker.go` around lines 49 - 51, The dry-run log is misleading — update the slog.Info call in the checker where c.dryRun is checked (the block referencing crName and actionName) to state that the existence check was skipped due to dry-run mode (e.g., "DRY-RUN: skipping CR existence check") and include crName and actionName as before so the log clearly indicates the check was not performed rather than reporting a non-existence.docs/configuration/fault-remediation.md (1)
105-112: Verify the LPU template metadata access pattern.The LPU remediation template uses
{{ index .HealthEvent.Metadata "device" }}which assumes:
HealthEvent.Metadatais a map- It contains a
"device"keyConsider adding a comment or note clarifying that this is an illustrative example and operators should adapt the template based on their actual health event structure and LPU CRD requirements.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/configuration/fault-remediation.md` around lines 105 - 112, The template assumes .HealthEvent.Metadata is a map and contains a "device" key; update the LPU remediation template text to include a short comment above the snippet stating this is illustrative and operators must adapt the field access to their actual HealthEvent shape and LPU CRD (e.g., replace {{ index .HealthEvent.Metadata "device" }} with the correct path or guarded lookup in your environment). Mention the two explicit assumptions (map type and presence of "device") and recommend using a conditional or appropriate key lookup when implementing in production so the template doesn’t panic on missing keys.fault-remediation/pkg/config/config.go (2)
75-76: Consider adding type documentation.The
ComponentRemediationActionstype would benefit from a brief doc comment explaining its structure (componentClass → actionName → MaintenanceResource mapping).📝 Suggested documentation
+// ComponentRemediationActions maps component classes to their action-specific remediation +// resources. The outer key is the componentClass (e.g., "GPU", "LPU"), and the inner key +// is the action name (e.g., "COMPONENT_RESET"). type ComponentRemediationActions map[string]map[string]MaintenanceResource🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/config/config.go` around lines 75 - 76, Add a brief doc comment immediately above the ComponentRemediationActions type declaration describing that it maps component class (string) to a map of action names (string) to MaintenanceResource values (i.e., componentClass → actionName → MaintenanceResource), so readers understand the nested map structure and intended usage; update the comment near the type `ComponentRemediationActions` to reflect this explanation.
236-251: Minor capacity estimation note.Line 238's capacity calculation uses
len(c.ComponentRemediationActions)which counts component classes, not total component actions. This is a minor inefficiency (extra allocations if there are many actions per component class) but has negligible impact.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/config/config.go` around lines 236 - 251, The capacity preallocation in allMaintenanceResources overestimates because it uses len(c.ComponentRemediationActions) (the number of component keys) instead of the total number of actions; in the allMaintenanceResources function compute the true total by summing len(actions) for each entry in c.ComponentRemediationActions (e.g., iterate the map/slice to accumulate totalComponentActions) and then create allResources with capacity len(c.RemediationActions)+totalComponentActions so append won’t cause extra reallocations.fault-remediation/pkg/config/config_test.go (1)
22-24: Consider using standard testing package for simple assertions.The retrieved learning indicates that in this repository, the standard testing package assertions (
t.Error,t.Errorf,t.Fatal) should be preferred for simple equality/inequality checks, reserving testify for complex scenarios. WhileTestTomlConfig_ResolveMaintenanceResourcecorrectly uses standard assertions,TestResolvedActionKeyAndActionNameHelpersandTestTomlConfig_ResolveStoredMaintenanceResourceusetestify/assertfor straightforward equality checks.Based on learnings: "In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/config/config_test.go` around lines 22 - 24, Replace testify usage in TestResolvedActionKeyAndActionNameHelpers and TestTomlConfig_ResolveStoredMaintenanceResource with standard testing methods: remove the "github.com/stretchr/testify/assert" import, use the testing.T methods (t.Fatalf/t.Errorf/t.Helper as appropriate) for equality and error checks, and update the assertions inside the functions (e.g., where assert.Equal/assert.NoError/assert.Nil are used) to equivalent t.Fatalf/t.Errorf calls that include expected vs actual values; keep TestTomlConfig_ResolveMaintenanceResource unchanged as it already uses the standard testing package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml`:
- Around line 51-73: The repeated TOML emission for component remediation
resources (the range over .Values.maintenance.componentActions producing the
[componentRemediationActions...] block) should be extracted into a Helm helper
so both this file and the existing maintenance.actions emitter can call the same
renderer; create a named template (e.g. "maintenance.renderResource" or
"maintenance.componentRemediationRenderer") that accepts the componentClass,
actionName and config and emits the identical fields (apiGroup, version, kind,
scope, namespace, completeConditionType, templateFileName, equivalenceGroup,
impactedEntityScope, supersedingEquivalenceGroups) with the same
defaults/quoting/json logic, then replace the inline block in this configmap and
the other maintenance.actions emitter to call that helper with include and
passing the context.
In `@docs/configuration/fault-remediation.md`:
- Around line 105-112: The template assumes .HealthEvent.Metadata is a map and
contains a "device" key; update the LPU remediation template text to include a
short comment above the snippet stating this is illustrative and operators must
adapt the field access to their actual HealthEvent shape and LPU CRD (e.g.,
replace {{ index .HealthEvent.Metadata "device" }} with the correct path or
guarded lookup in your environment). Mention the two explicit assumptions (map
type and presence of "device") and recommend using a conditional or appropriate
key lookup when implementing in production so the template doesn’t panic on
missing keys.
In `@fault-remediation/pkg/config/config_test.go`:
- Around line 22-24: Replace testify usage in
TestResolvedActionKeyAndActionNameHelpers and
TestTomlConfig_ResolveStoredMaintenanceResource with standard testing methods:
remove the "github.com/stretchr/testify/assert" import, use the testing.T
methods (t.Fatalf/t.Errorf/t.Helper as appropriate) for equality and error
checks, and update the assertions inside the functions (e.g., where
assert.Equal/assert.NoError/assert.Nil are used) to equivalent t.Fatalf/t.Errorf
calls that include expected vs actual values; keep
TestTomlConfig_ResolveMaintenanceResource unchanged as it already uses the
standard testing package.
In `@fault-remediation/pkg/config/config.go`:
- Around line 75-76: Add a brief doc comment immediately above the
ComponentRemediationActions type declaration describing that it maps component
class (string) to a map of action names (string) to MaintenanceResource values
(i.e., componentClass → actionName → MaintenanceResource), so readers understand
the nested map structure and intended usage; update the comment near the type
`ComponentRemediationActions` to reflect this explanation.
- Around line 236-251: The capacity preallocation in allMaintenanceResources
overestimates because it uses len(c.ComponentRemediationActions) (the number of
component keys) instead of the total number of actions; in the
allMaintenanceResources function compute the true total by summing len(actions)
for each entry in c.ComponentRemediationActions (e.g., iterate the map/slice to
accumulate totalComponentActions) and then create allResources with capacity
len(c.RemediationActions)+totalComponentActions so append won’t cause extra
reallocations.
In `@fault-remediation/pkg/crstatus/checker.go`:
- Around line 49-51: The dry-run log is misleading — update the slog.Info call
in the checker where c.dryRun is checked (the block referencing crName and
actionName) to state that the existence check was skipped due to dry-run mode
(e.g., "DRY-RUN: skipping CR existence check") and include crName and actionName
as before so the log clearly indicates the check was not performed rather than
reporting a non-existence.
In `@fault-remediation/pkg/reconciler/reconciler_e2e_test.go`:
- Around line 852-853: The test routes through the component-aware resolver but
the fixture only builds shared RemediationActions and the HealthEvent used
(event1.HealthEvent) does not set ComponentClass, so the suite never exercises
maintenance.componentActions or component-aware dedup/annotation. Fix by adding
a real component-override integration case: create a HealthEvent with
ComponentClass populated, extend the test fixture to build component-specific
RemediationActions (not just shared ones) and verify behavior after calling
common.GetGroupConfigForEvent(cfg.RemediationClient.GetConfig(),
event1.HealthEvent) that maintenance.componentActions and component-aware
dedup/annotation behave as expected; reference the test in reconciler_e2e_test,
the event variable event1.HealthEvent, the RemediationActions fixtures, and the
maintenance.componentActions checks to locate where to add the case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 26a73413-e505-44ec-bc6f-06a527ddbe84
📒 Files selected for processing (20)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/fault-remediation/values.yamldistros/kubernetes/nvsentinel/values-full.yamldocs/configuration/fault-remediation.mddocs/fault-remediation.mdfault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/common/equivalence_groups.gofault-remediation/pkg/common/equivalence_groups_test.gofault-remediation/pkg/config/config.gofault-remediation/pkg/config/config_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.go
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
…onsume it in ShouldSkipCRCreation Previously, ShouldSkipCRCreation resolved the stored CR type using the current health event’s ComponentClass plus the annotation’s ActionName. That was wrong when the existing CR had been created by a different component-specific override, because the lookup could miss the real in-progress CR and allow a duplicate remediation. The fix records the creating ComponentClass in EquivalenceGroupState and uses that stored value when checking existing CR status. For backward compatibility, older annotations without ComponentClass still resolve when the stored ActionName remains unambiguous in the current config. If newer component-specific overrides make that mapping ambiguous, we treat the legacy annotation as stale rather than guessing. Signed-off-by: Alex Jun <aljun@nvidia.com>
Signed-off-by: Alex Jun <aljun@nvidia.com>
93815ff to
358fd3f
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
fault-remediation/pkg/crstatus/checker.go (1)
54-61: Consider lowering log level for expected legacy annotation scenarios.When
componentClassis empty (legacy annotations) and the action only exists inComponentRemediationActions,ResolveStoredMaintenanceResourceintentionally returnsfalseto treat the mapping as stale. This is documented behavior perconfig.go:310-333.However, the
slog.Errorat line 56 may generate noise for this expected scenario. Consider usingslog.Warnorslog.Debugto distinguish between genuine misconfigurations and expected backward-compatibility handling.♻️ Suggested change to reduce log noise
resource, found := c.remediationConfig.ResolveStoredMaintenanceResource(componentClass, actionName) if !found { - slog.Error("No remediation configuration found for action", + slog.Warn("No remediation configuration found for stored action (may be stale annotation)", "action", actionName, "componentClass", componentClass) return false }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fault-remediation/pkg/crstatus/checker.go` around lines 54 - 61, The slog.Error call after calling remediationConfig.ResolveStoredMaintenanceResource(componentClass, actionName) is too noisy for expected legacy cases where componentClass is empty; change the log level to slog.Warn (or slog.Debug if even less noise is desired) and adjust the message to indicate this is an expected/backward-compatibility fallback (include "action" and "componentClass" as currently logged) so ResolveStoredMaintenanceResource and the surrounding conditional still return false but with a less severe log level.docs/fault-remediation.md (1)
94-101: Consider documenting missing metadata handling in templates.The
lpu-remediation-template.yamluses{{ index .HealthEvent.Metadata "device" }}to access the device from metadata. This is valid Go template syntax, but if the"device"key is missing fromMetadata, it will render as an empty string.Consider adding a note about required metadata fields for each component type, or using a defensive template pattern.
📝 Optional: Defensive template example
# In documentation, note that templates should handle missing keys gracefully: spec: nodeName: {{ .HealthEvent.NodeName }} {{- if index .HealthEvent.Metadata "device" }} device: {{ index .HealthEvent.Metadata "device" }} {{- end }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/fault-remediation.md` around lines 94 - 101, Update the docs to explain that lpu-remediation-template.yaml’s use of {{ index .HealthEvent.Metadata "device" }} will yield an empty string if the "device" key is absent; either (a) list required metadata keys (e.g., "device") for LPURemediation and HealthEvent consumers like .HealthEvent.NodeName and .HealthEventID, or (b) show a defensive Go-template pattern (using an if test around index .HealthEvent.Metadata "device") so templates omit the device field when missing—include both the required-fields note and the defensive-template example for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/fault-remediation.md`:
- Around line 94-101: Update the docs to explain that
lpu-remediation-template.yaml’s use of {{ index .HealthEvent.Metadata "device"
}} will yield an empty string if the "device" key is absent; either (a) list
required metadata keys (e.g., "device") for LPURemediation and HealthEvent
consumers like .HealthEvent.NodeName and .HealthEventID, or (b) show a defensive
Go-template pattern (using an if test around index .HealthEvent.Metadata
"device") so templates omit the device field when missing—include both the
required-fields note and the defensive-template example for clarity.
In `@fault-remediation/pkg/crstatus/checker.go`:
- Around line 54-61: The slog.Error call after calling
remediationConfig.ResolveStoredMaintenanceResource(componentClass, actionName)
is too noisy for expected legacy cases where componentClass is empty; change the
log level to slog.Warn (or slog.Debug if even less noise is desired) and adjust
the message to indicate this is an expected/backward-compatibility fallback
(include "action" and "componentClass" as currently logged) so
ResolveStoredMaintenanceResource and the surrounding conditional still return
false but with a less severe log level.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 63af7997-1405-4c6b-94e0-9de1df22250c
📒 Files selected for processing (20)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/fault-remediation/values.yamldistros/kubernetes/nvsentinel/values-full.yamldocs/configuration/fault-remediation.mddocs/fault-remediation.mdfault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/common/equivalence_groups.gofault-remediation/pkg/common/equivalence_groups_test.gofault-remediation/pkg/config/config.gofault-remediation/pkg/config/config_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.go
🚧 Files skipped from review as they are similar to previous changes (14)
- fault-remediation/pkg/crstatus/crstatus_interface.go
- fault-remediation/pkg/reconciler/reconciler.go
- distros/kubernetes/nvsentinel/values-full.yaml
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
- fault-remediation/pkg/annotation/annotation.go
- fault-remediation/pkg/common/equivalence_groups.go
- fault-remediation/pkg/annotation/annotation_test.go
- fault-remediation/pkg/reconciler/reconciler_e2e_test.go
- fault-remediation/pkg/remediation/remediation_test.go
- docs/configuration/fault-remediation.md
- fault-remediation/pkg/reconciler/reconciler_test.go
- fault-remediation/pkg/common/equivalence_groups_test.go
- fault-remediation/pkg/config/config.go
- fault-remediation/pkg/config/config_test.go
|
/ok to test 358fd3f |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/fault-remediation.md (1)
53-75: Clarify component class key expectations (case/enum) to prevent config mismatches.Nice addition. Consider adding one line that
componentActionskeys (e.g.,"GPU","LPU") must exactly match emittedcomponentClassvalues (case-sensitive), otherwise lookup falls back to shared actions. This will reduce operator misconfigurations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/fault-remediation.md` around lines 53 - 75, Add a clarifying sentence in the docs stating that the keys under componentActions (e.g., "GPU", "LPU") must exactly match the emitted componentClass values (case-sensitive) to avoid falling back to shared actions; update the section near componentActions to explicitly call out that componentClass matching is exact and case-sensitive and that mismatches will cause the lookup to use shared/default actions instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/fault-remediation.md`:
- Around line 94-101: In the LPURemediation template under the
"lpu-remediation-template.yaml" snippet, the unquoted device value pulled via
index from HealthEvent.Metadata can produce YAML parsing issues; update the
spec.nodeName/device line to render the device value wrapped in double quotes
(i.e., quote the result of index .HealthEvent.Metadata "device") so the
generated LPURemediation manifest (kind: LPURemediation) safely handles device
identifiers.
---
Nitpick comments:
In `@docs/fault-remediation.md`:
- Around line 53-75: Add a clarifying sentence in the docs stating that the keys
under componentActions (e.g., "GPU", "LPU") must exactly match the emitted
componentClass values (case-sensitive) to avoid falling back to shared actions;
update the section near componentActions to explicitly call out that
componentClass matching is exact and case-sensitive and that mismatches will
cause the lookup to use shared/default actions instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fc88528c-e9ce-4d5e-ad59-3fe382e948e1
📒 Files selected for processing (20)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/fault-remediation/values.yamldistros/kubernetes/nvsentinel/values-full.yamldocs/configuration/fault-remediation.mddocs/fault-remediation.mdfault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/common/equivalence_groups.gofault-remediation/pkg/common/equivalence_groups_test.gofault-remediation/pkg/config/config.gofault-remediation/pkg/config/config_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.go
✅ Files skipped from review due to trivial changes (2)
- fault-remediation/pkg/reconciler/reconciler_e2e_test.go
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/configmap.yaml
🚧 Files skipped from review as they are similar to previous changes (9)
- fault-remediation/pkg/crstatus/crstatus_interface.go
- fault-remediation/pkg/annotation/annotation.go
- distros/kubernetes/nvsentinel/values-full.yaml
- fault-remediation/pkg/common/equivalence_groups.go
- fault-remediation/pkg/reconciler/reconciler.go
- fault-remediation/pkg/annotation/annotation_interface.go
- fault-remediation/pkg/common/equivalence_groups_test.go
- fault-remediation/pkg/reconciler/reconciler_test.go
- fault-remediation/pkg/remediation/remediation.go
| "lpu-remediation-template.yaml": | | ||
| apiVersion: {{.ApiGroup}}/{{.Version}} | ||
| kind: LPURemediation | ||
| metadata: | ||
| name: maintenance-{{ .HealthEvent.NodeName }}-{{ .HealthEventID }} | ||
| spec: | ||
| nodeName: {{ .HealthEvent.NodeName }} | ||
| device: {{ index .HealthEvent.Metadata "device" }} |
There was a problem hiding this comment.
Quote templated device value in the example for YAML safety.
At Line 101, prefer quoting the rendered value to avoid YAML parsing edge cases for device identifiers:
device: "{{ index .HealthEvent.Metadata "device" }}".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/fault-remediation.md` around lines 94 - 101, In the LPURemediation
template under the "lpu-remediation-template.yaml" snippet, the unquoted device
value pulled via index from HealthEvent.Metadata can produce YAML parsing
issues; update the spec.nodeName/device line to render the device value wrapped
in double quotes (i.e., quote the result of index .HealthEvent.Metadata
"device") so the generated LPURemediation manifest (kind: LPURemediation) safely
handles device identifiers.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
| statusChecker *crstatus.CRStatusChecker | ||
| } | ||
|
|
||
| // nolint:cyclop |
There was a problem hiding this comment.
Can we modularize the function so we have to not add nolint for cyclop?
Looks at the diff below:
for actionName, maintenanceResource := range remediationConfig.RemediationActions {
if err := loadTemplateForAction(templates, templateMountPath, actionName, maintenanceResource); err != nil {
return nil, err
}
}is repeated in the line 96-100, perhaps we can create a function for this that validate for map?
| equivalenceGroup: "restart" | ||
|
|
||
| componentActions: | ||
| "GPU": |
There was a problem hiding this comment.
Maybe I'm missing some context here: What forbids an user to add Node here and do node level remediation flowing through this new pathaway?
|
|
||
| for _, componentActions := range c.ComponentRemediationActions { | ||
| if _, ok := componentActions[actionName]; ok { | ||
| return MaintenanceResource{}, false |
There was a problem hiding this comment.
This returns false as soon when it finds the action in any component's overrides, even if only one component has it. This means an older annotation (with the previous version of annotation) for COMPONENT_RESET is treated as ambiguous even if only GPU (and no other component) has an override. So when nvsentinel is upgraded in a cluster, this means the existing annotation entry gets treated as stale and cleaned up, potentially causing a new CR to be created even though the in-progress one is still valid. I think a better alternative would be to treat it as unambiguous if exactly one component class defines the action, and resolve to that action. What do you think?
|
In It would be good to add one e2e test case that uses the component-aware path end to end. Something like: a GPU health event (ComponentClass: "GPU") with a GPU-specific override in the config, verify it creates a GPUReset CR (not the shared RebootNode), verify the annotation records ComponentClass: "GPU", then send a second identical event and verify it correctly deduplicates against the stored GPU-specific CR. This would catch any issues in the e2e flow with this new code path |
Signed-off-by: Alex Jun aljun@nvidia.com
Summary
Support
ComponentRemediationActionsin fault remediation, which maps from component type and recommended action to created CR.Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Documentation
Behavior
Tests