Skip to content

Add alloca intrinsic for per-workitem stack scratch#859

Open
vchuravy wants to merge 2 commits into
mainfrom
vc/alloca_intrinsic
Open

Add alloca intrinsic for per-workitem stack scratch#859
vchuravy wants to merge 2 commits into
mainfrom
vc/alloca_intrinsic

Conversation

@vchuravy

Copy link
Copy Markdown
Member

Summary

Adds GPUCompiler.alloca(::Type{T}, ::Val{N})::Ptr{T}, a primitive that hands device code a fixed-size, per-workitem stack scratch buffer for N elements of T. The intended use is to replace abstractions like KernelAbstractions' @private MArray-backed scratchpad with a direct stack allocation (see companion KA PR).

Why an intrinsic instead of llvmcall + alloca

Emitting the alloca directly through llvmcall is unsound/ineffective:

  1. Promotion is lost. A Ptr round-trips through ptrtoint/inttoptr at the llvmcall boundary, so SROA/mem2reg can never promote the slot to registers — which is the whole point of small @private scratch.
  2. Address space isn't known at the front end. On NVPTX/AMDGPU an alloca must live in the datalayout's alloca address space (e.g. addrspace 5) and be addrspacecast to generic; this can't be written portably across the PTX/GCN/Metal/SPIR-V back-ends from Julia.
  3. Lifetime. Per the LangRef, an alloca's storage is freed when its function returns; in an llvmcall wrapper that's the wrapper, and correctness relies entirely on the inliner relocating a static entry-block alloca.

How it works

  • The front end emits a declared-only julia.gpu.alloca.<bytes>.<align> intrinsic (modeled on julia.gpu.debug_level).
  • lower_alloca!, run from irgen before the optimizer, materializes it as a real entry-block alloca in the datalayout's alloca address space (so SROA/mem2reg can promote it), cast back to generic for the returned Ptr. This produces the same IR shape Julia already emits for mutable stack allocations.
  • T must be isbits; a zero-byte request returns a null pointer.

Testing

Adds a native-target testset asserting an entry-block alloca [N x i8], align <align>, full promotion when optimized, no surviving intrinsic, and the zero-byte → null path.

Verified end-to-end (including execution) through the POCL/SPIR-V back-end via the companion KernelAbstractions change: @private Float32 (4,) lowers to alloca [16 x i8], align 4 in addrspace 0 (OpenCL Function/private) and runs correctly.

🤖 Generated with Claude Code

@maleadt

maleadt commented Jun 22, 2026

Copy link
Copy Markdown
Member

Conceptually fine, but I didn't realize we needed this.

  • Promotion is lost. A Ptr round-trips through ptrtoint/inttoptr at the llvmcall boundary, so SROA/mem2reg can never promote the slot to registers — which is the whole point of small @private scratch.

Hmm, isn't this optimized away?

  • Address space isn't known at the front end. On NVPTX/AMDGPU an alloca must live in the datalayout's alloca address space (e.g. addrspace 5) and be addrspacecast to generic; this can't be written portably across the PTX/GCN/Metal/SPIR-V back-ends from Julia.

Yeah, I guess. How do we make sure Julia's alloca's are in the correct AS? Couldn't/Shouldn't we patch those up afterwards?

  • Lifetime. Per the LangRef, an alloca's storage is freed when its function returns; in an llvmcall wrapper that's the wrapper, and correctness relies entirely on the inliner relocating a static entry-block alloca.

llvmcall wrappers are alwaysinline; doesn't that suffice?

@vchuravy

Copy link
Copy Markdown
Member Author

The crux is that llvmcall + alloca is UB and I have seen LLVM delete the alloca and replace it with poison. I can dig up the old PR I had for that.

My overarching goal is to remove the StaticArrays dependency that Kernel abstraction currently forces and which you have complained about before since creates behavior where we have to hope for escape analysis to work.

The rest is Claude obsessing over details. We have fixed alloca Julia to be emitted correctly

Comment thread src/irgen.jl
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.14815% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.55%. Comparing base (68883fe) to head (09e210a).

Files with missing lines Patch % Lines
src/irgen.jl 98.14% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #859      +/-   ##
==========================================
- Coverage   80.32%   73.55%   -6.77%     
==========================================
  Files          25       25              
  Lines        4777     4773       -4     
==========================================
- Hits         3837     3511     -326     
- Misses        940     1262     +322     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vchuravy and others added 2 commits June 23, 2026 09:00
Introduce `GPUCompiler.alloca(::Type{T}, ::Val{N})::Ptr{T}`, which hands device
code a fixed-size, per-workitem stack scratch buffer for `N` elements of `T`.

This is meant to replace abstractions like KernelAbstractions' `@private`
`MArray`-backed scratchpad with a direct stack allocation. Emitting the `alloca`
through `llvmcall` directly is unsound/ineffective: the `Ptr` round-trip through
`ptrtoint`/`inttoptr` blocks SROA/mem2reg promotion, the target stack address
space (e.g. addrspace 5 on NVPTX/AMDGPU) isn't known at the front end, and the
LangRef lifetime of the `alloca` is tied to the inlined `llvmcall` wrapper.

Instead, the front end emits a `julia.gpu.alloca.<bytes>.<align>` intrinsic that
`lower_alloca!` (run from `irgen`, before the optimizer) materializes as a real
entry-block `alloca` in the datalayout's alloca address space, cast back to
generic. Running before optimization lets the slot be promoted just like the
mutable stack allocations Julia already emits. `T` must be `isbits`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vchuravy vchuravy force-pushed the vc/alloca_intrinsic branch from 116c36b to 09e210a Compare June 23, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants