You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LaunchContextBuilder::argpack_ptrs stores a raw const ArgPack * pointer without any ownership semantics. On the CUDA/LLVM backend, Program::delete_argpack() always deletes immediately because LlvmProgramImpl does not override used_in_kernel() (base class returns false). If Python's garbage collector frees an ArgPack wrapper between set_arg_argpack() and the actual kernel launch, the CUDA kernel launcher dereferences a dangling pointer — causing wild writes and memory corruption.
This bug causes random SEGV / heap corruption in any workload that uses ArgPack with the CUDA backend, especially under GC pressure (many kernel launches, multi-threaded).
Only GfxProgramImpl overrides this (gfx_program.h:47-48). LlvmProgramImpl inherits the base class, so delete_argpack() (program.cpp:428-443) always deletes immediately on CUDA/CPU/AMDGPU backends, regardless of pending kernel launches.
set_arg_argpack() deviates from this by storing an object pointer (&argpack), which is unsafe.
Evidence
Observed in a multi-threaded MARL training workload using Genesis physics simulation with Taichi CUDA backend (2048 parallel envs). 12 independent crashes over weeks of debugging:
Random SEGV in Python GC (tp_traverse NULL, object type confusion: range→code, dict→bool)
ASan (LD_PRELOAD) reported zero heap UAF in 19 hours — because the free happens inside Taichi's device allocator, not via system free()
ASan + PYTHONMALLOC=malloc: ASan's own allocator metadata was corrupted by a wild write (CHECK failed: rz_size=0x0)
Crash vfx missing in 2d examples #12: faulthandler showed "Garbage-collecting" during taichi.kernel_impl.launch_kernel — direct evidence of GC during kernel launch
Valgrind (serializes all threads): no crash in 10+ hours — consistent with a threading race
Genesis issue Genesis-Embodied-AI/Genesis#492 appears to be the same bug (segfault during scene.step() after 1000+ iterations, closed without root cause).
Workaround
Add the ArgPack Python object to the tmps GC-prevention list in kernel_impl.py, following the existing pattern used for numpy arrays (line 731: tmps.append(tmp) # Purpose: DO NOT GC |tmp|!):
# kernel_impl.py, inside recursive_set_args(), after set_arg_argpack():launch_ctx.set_arg_argpack(indices, v._ArgPack__argpack)
tmps.append(v) # prevent GC of ArgPack while C++ holds raw pointer
Suggested Fix
Option A (minimal): Apply the Python-side workaround above.
Option B (proper): Change argpack_ptrs to store DeviceAllocation by value instead of a raw object pointer, matching the safe pattern used by set_arg_ndarray(). This requires updating all kernel launchers. The nested argpack write-back (set_arg_nested_argpack_ptr) would need refactoring.
Option C (defense-in-depth): Implement used_in_kernel() in LlvmProgramImpl to actually track in-flight allocations, matching the existing GFX backend implementation.
Additional note: array_ptrs has the same pattern
set_arg_ndarray() stores (void *)&ndarray_alloc_ (address of a member inside the Ndarray object) in array_ptrs. If the Ndarray is GC'd, the member address becomes invalid. The CUDA kernel launcher dereferences this at cuda/kernel_launcher.cpp:110-112:
Summary
LaunchContextBuilder::argpack_ptrsstores a rawconst ArgPack *pointer without any ownership semantics. On the CUDA/LLVM backend,Program::delete_argpack()always deletes immediately becauseLlvmProgramImpldoes not overrideused_in_kernel()(base class returnsfalse). If Python's garbage collector frees anArgPackwrapper betweenset_arg_argpack()and the actual kernel launch, the CUDA kernel launcher dereferences a dangling pointer — causing wild writes and memory corruption.This bug causes random SEGV / heap corruption in any workload that uses ArgPack with the CUDA backend, especially under GC pressure (many kernel launches, multi-threaded).
Root Cause
1. Raw pointer storage without ownership
2. Dangling pointer dereference during kernel launch
Same pattern in
cpu/kernel_launcher.cpp:48-63,amdgpu/kernel_launcher.cpp:82-94, andgfx/runtime.cpp:486-490.3.
used_in_kernel()guard is unimplemented on LLVM backendsOnly
GfxProgramImploverrides this (gfx_program.h:47-48).LlvmProgramImplinherits the base class, sodelete_argpack()(program.cpp:428-443) always deletes immediately on CUDA/CPU/AMDGPU backends, regardless of pending kernel launches.4. Python GC triggers immediate C++ destruction
The race
set_arg_argpack()→ raw&argpackstored inargpack_ptrs__del__→delete_argpack()→used_in_kernel()returnsfalse→ C++ ArgPack freedargpack_ptrs[key]→ wild write / SEGVDesign inconsistency
set_arg_ndarray()follows a safe pattern — it copies the data pointer as an integer:set_arg_argpack()deviates from this by storing an object pointer (&argpack), which is unsafe.Evidence
Observed in a multi-threaded MARL training workload using Genesis physics simulation with Taichi CUDA backend (2048 parallel envs). 12 independent crashes over weeks of debugging:
tp_traverseNULL, object type confusion:range→code,dict→bool)free()PYTHONMALLOC=malloc: ASan's own allocator metadata was corrupted by a wild write (CHECK failed:rz_size=0x0)taichi.kernel_impl.launch_kernel— direct evidence of GC during kernel launchGenesis issue Genesis-Embodied-AI/Genesis#492 appears to be the same bug (segfault during
scene.step()after 1000+ iterations, closed without root cause).Workaround
Add the ArgPack Python object to the
tmpsGC-prevention list inkernel_impl.py, following the existing pattern used for numpy arrays (line 731:tmps.append(tmp) # Purpose: DO NOT GC |tmp|!):Suggested Fix
Option A (minimal): Apply the Python-side workaround above.
Option B (proper): Change
argpack_ptrsto storeDeviceAllocationby value instead of a raw object pointer, matching the safe pattern used byset_arg_ndarray(). This requires updating all kernel launchers. The nested argpack write-back (set_arg_nested_argpack_ptr) would need refactoring.Option C (defense-in-depth): Implement
used_in_kernel()inLlvmProgramImplto actually track in-flight allocations, matching the existing GFX backend implementation.Additional note:
array_ptrshas the same patternset_arg_ndarray()stores(void *)&ndarray_alloc_(address of a member inside the Ndarray object) inarray_ptrs. If the Ndarray is GC'd, the member address becomes invalid. The CUDA kernel launcher dereferences this atcuda/kernel_launcher.cpp:110-112:Same vulnerability, though Ndarrays tend to be long-lived so it's less likely to trigger in practice.
Environment
Affected code (introduced in July 2023)
525682fc0):argpack_ptrs[arg_id] = &argpack29cfb5c72):delete_argpack()withused_in_kernel()guardcfad91fc8):argpacks_in_use_tracking added to GFX only22a32e3a7): LLVM kernel launchers dereferenceargpack_ptrs