Skip to content

Feature request: passive InfiniBand fabric monitoring (link flap, port errors) #1236

@shaq918

Description

@shaq918

Problem

NVSentinel currently monitors GPU health (DCGM watches) and kernel syslog (XID/SXID errors), but has no passive monitoring for InfiniBand fabric events. This is a gap in the SemiAnalysis ClusterMAX rating system, which requires passive detection of:

  • InfiniBand/RoCEv2 link flaps
  • IB port error counters (symbol errors, link downed, link error recovery)
  • ConnectX NIC state changes

These events are critical for GPU cloud providers — a flaky IB link between test runs causes NCCL hangs and training job failures. Active health checks (periodic NCCL tests) catch these eventually, but there's a detection gap between runs where a link can degrade without alerting.

Why NVSentinel is the right place

  • NVIDIA owns both the IB stack (Mellanox/ConnectX) and NVSentinel — full domain expertise
  • The DaemonSet infrastructure already exists (syslog-health-monitor watches kernel logs)
  • The event pipeline already exists (platform-connectors → gRPC sink)
  • IB link events appear in dmesg as mlx5_core messages — same pattern as XID/SXID syslog matching

Proposed approach

Two complementary monitors, similar to how GPU monitoring works today:

1. Extend syslog-health-monitor to match IB patterns

mlx5_core kernel messages for link state changes and errors:

mlx5_core 0000:86:00.0: Port 1 link down
mlx5_core 0000:86:00.0: Port 1 link up (active)
mlx5_core 0000:86:00.0: mlx5_handle_error_cqe: completion error

These could be emitted as health events through the existing platform-connectors pipeline, with componentClass: "IB" and check names like IbLinkStateWatch, IbErrorWatch.

2. IB port counter watcher (new, similar to gpu-health-monitor)

Poll InfiniBand port counters from sysfs every N seconds:

/sys/class/infiniband/mlx5_*/ports/*/counters/symbol_error_count
/sys/class/infiniband/mlx5_*/ports/*/counters/link_downed
/sys/class/infiniband/mlx5_*/ports/*/counters/link_error_recovery
/sys/class/infiniband/mlx5_*/ports/*/counters/port_rcv_errors

Emit a health event when counters increment (delta-based detection). This catches degradation that doesn't produce kernel log messages.

Impact

This would close the last passive monitoring gap in the ClusterMAX rating system for GPU cloud providers using NVSentinel. Combined with NVSentinel's existing GPU/syslog monitors and active health checks (DCGM L3, NCCL, nvbandwidth), this would provide complete Platinum-tier health check coverage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions