Problem
NVSentinel currently monitors GPU health (DCGM watches) and kernel syslog (XID/SXID errors), but has no passive monitoring for InfiniBand fabric events. This is a gap in the SemiAnalysis ClusterMAX rating system, which requires passive detection of:
- InfiniBand/RoCEv2 link flaps
- IB port error counters (symbol errors, link downed, link error recovery)
- ConnectX NIC state changes
These events are critical for GPU cloud providers — a flaky IB link between test runs causes NCCL hangs and training job failures. Active health checks (periodic NCCL tests) catch these eventually, but there's a detection gap between runs where a link can degrade without alerting.
Why NVSentinel is the right place
- NVIDIA owns both the IB stack (Mellanox/ConnectX) and NVSentinel — full domain expertise
- The DaemonSet infrastructure already exists (
syslog-health-monitor watches kernel logs)
- The event pipeline already exists (platform-connectors → gRPC sink)
- IB link events appear in dmesg as
mlx5_core messages — same pattern as XID/SXID syslog matching
Proposed approach
Two complementary monitors, similar to how GPU monitoring works today:
1. Extend syslog-health-monitor to match IB patterns
mlx5_core kernel messages for link state changes and errors:
mlx5_core 0000:86:00.0: Port 1 link down
mlx5_core 0000:86:00.0: Port 1 link up (active)
mlx5_core 0000:86:00.0: mlx5_handle_error_cqe: completion error
These could be emitted as health events through the existing platform-connectors pipeline, with componentClass: "IB" and check names like IbLinkStateWatch, IbErrorWatch.
2. IB port counter watcher (new, similar to gpu-health-monitor)
Poll InfiniBand port counters from sysfs every N seconds:
/sys/class/infiniband/mlx5_*/ports/*/counters/symbol_error_count
/sys/class/infiniband/mlx5_*/ports/*/counters/link_downed
/sys/class/infiniband/mlx5_*/ports/*/counters/link_error_recovery
/sys/class/infiniband/mlx5_*/ports/*/counters/port_rcv_errors
Emit a health event when counters increment (delta-based detection). This catches degradation that doesn't produce kernel log messages.
Impact
This would close the last passive monitoring gap in the ClusterMAX rating system for GPU cloud providers using NVSentinel. Combined with NVSentinel's existing GPU/syslog monitors and active health checks (DCGM L3, NCCL, nvbandwidth), this would provide complete Platinum-tier health check coverage.
Problem
NVSentinel currently monitors GPU health (DCGM watches) and kernel syslog (XID/SXID errors), but has no passive monitoring for InfiniBand fabric events. This is a gap in the SemiAnalysis ClusterMAX rating system, which requires passive detection of:
These events are critical for GPU cloud providers — a flaky IB link between test runs causes NCCL hangs and training job failures. Active health checks (periodic NCCL tests) catch these eventually, but there's a detection gap between runs where a link can degrade without alerting.
Why NVSentinel is the right place
syslog-health-monitorwatches kernel logs)mlx5_coremessages — same pattern as XID/SXID syslog matchingProposed approach
Two complementary monitors, similar to how GPU monitoring works today:
1. Extend syslog-health-monitor to match IB patterns
mlx5_corekernel messages for link state changes and errors:These could be emitted as health events through the existing platform-connectors pipeline, with
componentClass: "IB"and check names likeIbLinkStateWatch,IbErrorWatch.2. IB port counter watcher (new, similar to gpu-health-monitor)
Poll InfiniBand port counters from sysfs every N seconds:
Emit a health event when counters increment (delta-based detection). This catches degradation that doesn't produce kernel log messages.
Impact
This would close the last passive monitoring gap in the ClusterMAX rating system for GPU cloud providers using NVSentinel. Combined with NVSentinel's existing GPU/syslog monitors and active health checks (DCGM L3, NCCL, nvbandwidth), this would provide complete Platinum-tier health check coverage.