Add alloca intrinsic for per-workitem stack scratch#859
Conversation
|
Conceptually fine, but I didn't realize we needed this.
Hmm, isn't this optimized away?
Yeah, I guess. How do we make sure Julia's
|
|
The crux is that llvmcall + alloca is UB and I have seen LLVM delete the alloca and replace it with poison. I can dig up the old PR I had for that. My overarching goal is to remove the StaticArrays dependency that Kernel abstraction currently forces and which you have complained about before since creates behavior where we have to hope for escape analysis to work. The rest is Claude obsessing over details. We have fixed alloca Julia to be emitted correctly |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #859 +/- ##
==========================================
- Coverage 80.32% 73.55% -6.77%
==========================================
Files 25 25
Lines 4777 4773 -4
==========================================
- Hits 3837 3511 -326
- Misses 940 1262 +322 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Introduce `GPUCompiler.alloca(::Type{T}, ::Val{N})::Ptr{T}`, which hands device
code a fixed-size, per-workitem stack scratch buffer for `N` elements of `T`.
This is meant to replace abstractions like KernelAbstractions' `@private`
`MArray`-backed scratchpad with a direct stack allocation. Emitting the `alloca`
through `llvmcall` directly is unsound/ineffective: the `Ptr` round-trip through
`ptrtoint`/`inttoptr` blocks SROA/mem2reg promotion, the target stack address
space (e.g. addrspace 5 on NVPTX/AMDGPU) isn't known at the front end, and the
LangRef lifetime of the `alloca` is tied to the inlined `llvmcall` wrapper.
Instead, the front end emits a `julia.gpu.alloca.<bytes>.<align>` intrinsic that
`lower_alloca!` (run from `irgen`, before the optimizer) materializes as a real
entry-block `alloca` in the datalayout's alloca address space, cast back to
generic. Running before optimization lets the slot be promoted just like the
mutable stack allocations Julia already emits. `T` must be `isbits`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
116c36b to
09e210a
Compare
Summary
Adds
GPUCompiler.alloca(::Type{T}, ::Val{N})::Ptr{T}, a primitive that hands device code a fixed-size, per-workitem stack scratch buffer forNelements ofT. The intended use is to replace abstractions like KernelAbstractions'@privateMArray-backed scratchpad with a direct stack allocation (see companion KA PR).Why an intrinsic instead of
llvmcall+allocaEmitting the
allocadirectly throughllvmcallis unsound/ineffective:Ptrround-trips throughptrtoint/inttoptrat thellvmcallboundary, so SROA/mem2reg can never promote the slot to registers — which is the whole point of small@privatescratch.allocamust live in the datalayout's alloca address space (e.g. addrspace 5) and beaddrspacecastto generic; this can't be written portably across the PTX/GCN/Metal/SPIR-V back-ends from Julia.alloca's storage is freed when its function returns; in anllvmcallwrapper that's the wrapper, and correctness relies entirely on the inliner relocating a static entry-block alloca.How it works
julia.gpu.alloca.<bytes>.<align>intrinsic (modeled onjulia.gpu.debug_level).lower_alloca!, run fromirgenbefore the optimizer, materializes it as a real entry-blockallocain the datalayout's alloca address space (so SROA/mem2reg can promote it), cast back to generic for the returnedPtr. This produces the same IR shape Julia already emits for mutable stack allocations.Tmust beisbits; a zero-byte request returns a null pointer.Testing
Adds a native-target testset asserting an entry-block
alloca [N x i8], align <align>, full promotion when optimized, no surviving intrinsic, and the zero-byte → null path.Verified end-to-end (including execution) through the POCL/SPIR-V back-end via the companion KernelAbstractions change:
@private Float32 (4,)lowers toalloca [16 x i8], align 4in addrspace 0 (OpenCLFunction/private) and runs correctly.🤖 Generated with Claude Code