[rocprofiler-sdk] Fix SIGABRT in construct_agent_cache on WSL2 with DXG backend#4935
Open
ChrisLundquist wants to merge 1 commit intoROCm:developfrom
Open
[rocprofiler-sdk] Fix SIGABRT in construct_agent_cache on WSL2 with DXG backend#4935ChrisLundquist wants to merge 1 commit intoROCm:developfrom
ChrisLundquist wants to merge 1 commit intoROCm:developfrom
Conversation
…XG backend On WSL2 with HSA_ENABLE_DXG_DETECTION=1, the HSA runtime discovers agents via the Windows DXG kernel service, but /sys/class/kfd/kfd/ topology/nodes does not exist. read_topology() correctly returns 0 agents, but construct_agent_cache() fatally aborts because the HSA agent count doesn't match the rocprofiler agent count: F agent.cpp:1192] Found 0 rocprofiler agents and 2 HSA agents. HSA agents contained 2 internal node ids not found by rocprofiler: 0, 1 *** SIGABRT *** This affects any program that triggers rocprofiler-sdk agent enumeration on WSL2, including rocprofv3 and PyTorch with kineto. Add a check before the ROCP_FATAL_IF: when rocprofiler agents are empty but HSA agents were found, log a warning and return early instead of aborting. This also covers containers without /sys mounted. Add a GTest that validates construct_agent_cache() does not abort when sysfs topology is unavailable (skips on systems with sysfs). Tested on WSL2 (Windows 11, kernel 6.6.87.2-microsoft-standard-WSL2) with AMD Radeon RX 9070 XT (gfx1201) and ROCm 7.2.1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
construct_agent_cache()when running on WSL2 with the DXG GPU passthrough backend (HSA_ENABLE_DXG_DETECTION=1)read_topology()returns 0 agents but the HSA runtime discovers agents — the agent count mismatch triggersROCP_FATAL_IF→SIGABRTagent_no_topology_no_crash) that validates the early-return pathProblem
Any program that triggers rocprofiler-sdk agent enumeration crashes on WSL2:
This affects
rocprofv3 --hip-trace, PyTorch with kineto profiling, and any custom tool using rocprofiler-sdk on WSL2. Also affects containers without/sysmounted.Root Cause
read_topology()reads agents from/sys/class/kfd/kfd/topology/nodesconstruct_agent_cache()enumerates HSA agents via the runtime (works via DXG, finds CPU + GPU)ROCP_FATAL_IF→abort()Fix
Before the fatal check, detect when rocprofiler agents are empty (topology unavailable) and return early with a warning. Uses
rocp_agents.empty()rather than re-checking the filesystem, which is both simpler and more general.Test Results (WSL2, AMD RX 9070 XT, ROCm 7.2.1)
New GTest:
End-to-end (before → after):
rocprofv3 --hip-traceRelated
🤖 Generated with Claude Code