Feature request: passive InfiniBand fabric monitoring (link flap, port errors)

## Problem

NVSentinel currently monitors GPU health (DCGM watches) and kernel syslog (XID/SXID errors), but has no passive monitoring for InfiniBand fabric events. This is a gap in the [SemiAnalysis ClusterMAX rating system](https://www.clustermax.ai/health-checks), which requires passive detection of:

- InfiniBand/RoCEv2 link flaps
- IB port error counters (symbol errors, link downed, link error recovery)
- ConnectX NIC state changes

These events are critical for GPU cloud providers — a flaky IB link between test runs causes NCCL hangs and training job failures. Active health checks (periodic NCCL tests) catch these eventually, but there's a detection gap between runs where a link can degrade without alerting.

## Why NVSentinel is the right place

- NVIDIA owns both the IB stack (Mellanox/ConnectX) and NVSentinel — full domain expertise
- The DaemonSet infrastructure already exists (`syslog-health-monitor` watches kernel logs)
- The event pipeline already exists (platform-connectors → gRPC sink)
- IB link events appear in dmesg as `mlx5_core` messages — same pattern as XID/SXID syslog matching

## Proposed approach

Two complementary monitors, similar to how GPU monitoring works today:

### 1. Extend syslog-health-monitor to match IB patterns

`mlx5_core` kernel messages for link state changes and errors:
```
mlx5_core 0000:86:00.0: Port 1 link down
mlx5_core 0000:86:00.0: Port 1 link up (active)
mlx5_core 0000:86:00.0: mlx5_handle_error_cqe: completion error
```

These could be emitted as health events through the existing platform-connectors pipeline, with `componentClass: "IB"` and check names like `IbLinkStateWatch`, `IbErrorWatch`.

### 2. IB port counter watcher (new, similar to gpu-health-monitor)

Poll InfiniBand port counters from sysfs every N seconds:
```
/sys/class/infiniband/mlx5_*/ports/*/counters/symbol_error_count
/sys/class/infiniband/mlx5_*/ports/*/counters/link_downed
/sys/class/infiniband/mlx5_*/ports/*/counters/link_error_recovery
/sys/class/infiniband/mlx5_*/ports/*/counters/port_rcv_errors
```

Emit a health event when counters increment (delta-based detection). This catches degradation that doesn't produce kernel log messages.

## Impact

This would close the last passive monitoring gap in the ClusterMAX rating system for GPU cloud providers using NVSentinel. Combined with NVSentinel's existing GPU/syslog monitors and active health checks (DCGM L3, NCCL, nvbandwidth), this would provide complete Platinum-tier health check coverage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: passive InfiniBand fabric monitoring (link flap, port errors) #1236

Problem

Why NVSentinel is the right place

Proposed approach

1. Extend syslog-health-monitor to match IB patterns

2. IB port counter watcher (new, similar to gpu-health-monitor)

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: passive InfiniBand fabric monitoring (link flap, port errors) #1236

Description

Problem

Why NVSentinel is the right place

Proposed approach

1. Extend syslog-health-monitor to match IB patterns

2. IB port counter watcher (new, similar to gpu-health-monitor)

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions