Skip to content

[rocprofiler-sdk] Fix SIGABRT in construct_agent_cache on WSL2 with DXG backend#4935

Open
ChrisLundquist wants to merge 1 commit intoROCm:developfrom
ChrisLundquist:fix/wsl2-agent-cache-crash
Open

[rocprofiler-sdk] Fix SIGABRT in construct_agent_cache on WSL2 with DXG backend#4935
ChrisLundquist wants to merge 1 commit intoROCm:developfrom
ChrisLundquist:fix/wsl2-agent-cache-crash

Conversation

@ChrisLundquist
Copy link
Copy Markdown

Summary

  • Fix fatal abort in construct_agent_cache() when running on WSL2 with the DXG GPU passthrough backend (HSA_ENABLE_DXG_DETECTION=1)
  • When sysfs topology is unavailable, read_topology() returns 0 agents but the HSA runtime discovers agents — the agent count mismatch triggers ROCP_FATAL_IFSIGABRT
  • Add an early-return path that logs a warning instead of crashing, allowing profiling to gracefully degrade
  • Add a GTest (agent_no_topology_no_crash) that validates the early-return path

Problem

Any program that triggers rocprofiler-sdk agent enumeration crashes on WSL2:

F agent.cpp:1192] Found 0 rocprofiler agents and 2 HSA agents.
HSA agents contained 2 internal node ids not found by rocprofiler: 0, 1
*** SIGABRT (@0x3e80000e17d) received by PID 57725 ***

This affects rocprofv3 --hip-trace, PyTorch with kineto profiling, and any custom tool using rocprofiler-sdk on WSL2. Also affects containers without /sys mounted.

Root Cause

  1. read_topology() reads agents from /sys/class/kfd/kfd/topology/nodes
  2. On WSL2 with DXG (or containers), this sysfs path does not exist → returns empty
  3. construct_agent_cache() enumerates HSA agents via the runtime (works via DXG, finds CPU + GPU)
  4. Agent count mismatch (0 vs 2) hits ROCP_FATAL_IFabort()

Fix

Before the fatal check, detect when rocprofiler agents are empty (topology unavailable) and return early with a warning. Uses rocp_agents.empty() rather than re-checking the filesystem, which is both simpler and more general.

Test Results (WSL2, AMD RX 9070 XT, ROCm 7.2.1)

New GTest:

[ RUN      ] rocprofiler_lib.agent_no_topology_no_crash
W agent.cpp:608] sysfs nodes path '/sys/class/kfd/kfd/topology/nodes' does not exist
W agent.cpp:1199] rocprofiler agent discovery unavailable (0 rocprofiler agents vs 2 HSA agents). Profiling features will be disabled.
[       OK ] rocprofiler_lib.agent_no_topology_no_crash (24 ms)
[  PASSED  ] 1 test.

End-to-end (before → after):

Test Before After
rocprofv3 --hip-trace SIGABRT (exit 134) Warning + clean exit (0)
GPU compute kernel under profiler SIGABRT (exit 134) Correct result + clean exit (0)
HIP without profiler Works Works (no change)
Native Linux (sysfs exists) Works No behavioral change

Related

🤖 Generated with Claude Code

…XG backend

On WSL2 with HSA_ENABLE_DXG_DETECTION=1, the HSA runtime discovers
agents via the Windows DXG kernel service, but /sys/class/kfd/kfd/
topology/nodes does not exist. read_topology() correctly returns 0
agents, but construct_agent_cache() fatally aborts because the HSA
agent count doesn't match the rocprofiler agent count:

  F agent.cpp:1192] Found 0 rocprofiler agents and 2 HSA agents.
  HSA agents contained 2 internal node ids not found by rocprofiler: 0, 1
  *** SIGABRT ***

This affects any program that triggers rocprofiler-sdk agent
enumeration on WSL2, including rocprofv3 and PyTorch with kineto.

Add a check before the ROCP_FATAL_IF: when rocprofiler agents are
empty but HSA agents were found, log a warning and return early
instead of aborting. This also covers containers without /sys mounted.

Add a GTest that validates construct_agent_cache() does not abort
when sysfs topology is unavailable (skips on systems with sysfs).

Tested on WSL2 (Windows 11, kernel 6.6.87.2-microsoft-standard-WSL2)
with AMD Radeon RX 9070 XT (gfx1201) and ROCm 7.2.1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant