Add `alloca` intrinsic for per-workitem stack scratch by vchuravy · Pull Request #859 · JuliaGPU/GPUCompiler.jl

vchuravy · 2026-06-22T15:37:44Z

Summary

Adds GPUCompiler.alloca(::Type{T}, ::Val{N})::Ptr{T}, a primitive that hands device code a fixed-size, per-workitem stack scratch buffer for N elements of T. The intended use is to replace abstractions like KernelAbstractions' @private MArray-backed scratchpad with a direct stack allocation (see companion KA PR).

Why an intrinsic instead of `llvmcall` + `alloca`

Emitting the alloca directly through llvmcall is unsound/ineffective:

Promotion is lost. A Ptr round-trips through ptrtoint/inttoptr at the llvmcall boundary, so SROA/mem2reg can never promote the slot to registers — which is the whole point of small @private scratch.
Address space isn't known at the front end. On NVPTX/AMDGPU an alloca must live in the datalayout's alloca address space (e.g. addrspace 5) and be addrspacecast to generic; this can't be written portably across the PTX/GCN/Metal/SPIR-V back-ends from Julia.
Lifetime. Per the LangRef, an alloca's storage is freed when its function returns; in an llvmcall wrapper that's the wrapper, and correctness relies entirely on the inliner relocating a static entry-block alloca.

How it works

The front end emits a declared-only julia.gpu.alloca.<bytes>.<align> intrinsic (modeled on julia.gpu.debug_level).
lower_alloca!, run from irgen before the optimizer, materializes it as a real entry-block alloca in the datalayout's alloca address space (so SROA/mem2reg can promote it), cast back to generic for the returned Ptr. This produces the same IR shape Julia already emits for mutable stack allocations.
T must be isbits; a zero-byte request returns a null pointer.

Testing

Adds a native-target testset asserting an entry-block alloca [N x i8], align <align>, full promotion when optimized, no surviving intrinsic, and the zero-byte → null path.

Verified end-to-end (including execution) through the POCL/SPIR-V back-end via the companion KernelAbstractions change: @private Float32 (4,) lowers to alloca [16 x i8], align 4 in addrspace 0 (OpenCL Function/private) and runs correctly.

🤖 Generated with Claude Code

maleadt · 2026-06-22T16:15:30Z

Conceptually fine, but I didn't realize we needed this.

Promotion is lost. A Ptr round-trips through ptrtoint/inttoptr at the llvmcall boundary, so SROA/mem2reg can never promote the slot to registers — which is the whole point of small @private scratch.

Hmm, isn't this optimized away?

Address space isn't known at the front end. On NVPTX/AMDGPU an alloca must live in the datalayout's alloca address space (e.g. addrspace 5) and be addrspacecast to generic; this can't be written portably across the PTX/GCN/Metal/SPIR-V back-ends from Julia.

Yeah, I guess. How do we make sure Julia's alloca's are in the correct AS? Couldn't/Shouldn't we patch those up afterwards?

Lifetime. Per the LangRef, an alloca's storage is freed when its function returns; in an llvmcall wrapper that's the wrapper, and correctness relies entirely on the inliner relocating a static entry-block alloca.

llvmcall wrappers are alwaysinline; doesn't that suffice?

vchuravy · 2026-06-22T16:20:58Z

The crux is that llvmcall + alloca is UB and I have seen LLVM delete the alloca and replace it with poison. I can dig up the old PR I had for that.

My overarching goal is to remove the StaticArrays dependency that Kernel abstraction currently forces and which you have complained about before since creates behavior where we have to hope for escape analysis to work.

The rest is Claude obsessing over details. We have fixed alloca Julia to be emitted correctly

codecov · 2026-06-22T21:29:56Z

Codecov Report

❌ Patch coverage is 98.14815% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.55%. Comparing base (68883fe) to head (09e210a).

Files with missing lines	Patch %	Lines
src/irgen.jl	98.14%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #859      +/-   ##
==========================================
- Coverage   80.32%   73.55%   -6.77%     
==========================================
  Files          25       25              
  Lines        4777     4773       -4     
==========================================
- Hits         3837     3511     -326     
- Misses        940     1262     +322

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Introduce `GPUCompiler.alloca(::Type{T}, ::Val{N})::Ptr{T}`, which hands device code a fixed-size, per-workitem stack scratch buffer for `N` elements of `T`. This is meant to replace abstractions like KernelAbstractions' `@private` `MArray`-backed scratchpad with a direct stack allocation. Emitting the `alloca` through `llvmcall` directly is unsound/ineffective: the `Ptr` round-trip through `ptrtoint`/`inttoptr` blocks SROA/mem2reg promotion, the target stack address space (e.g. addrspace 5 on NVPTX/AMDGPU) isn't known at the front end, and the LangRef lifetime of the `alloca` is tied to the inlined `llvmcall` wrapper. Instead, the front end emits a `julia.gpu.alloca.<bytes>.<align>` intrinsic that `lower_alloca!` (run from `irgen`, before the optimizer) materializes as a real entry-block `alloca` in the datalayout's alloca address space, cast back to generic. Running before optimization lets the slot be promoted just like the mutable stack allocations Julia already emits. `T` must be `isbits`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vchuravy mentioned this pull request Jun 22, 2026

[pocl] Back @private Scratchpad with GPUCompiler.alloca JuliaGPU/KernelAbstractions.jl#714

Draft

vchuravy commented Jun 22, 2026

View reviewed changes

Comment thread src/irgen.jl

vchuravy and others added 2 commits June 23, 2026 09:00

cleanup

09e210a

vchuravy force-pushed the vc/alloca_intrinsic branch from 116c36b to 09e210a Compare June 23, 2026 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `alloca` intrinsic for per-workitem stack scratch#859

Add `alloca` intrinsic for per-workitem stack scratch#859
vchuravy wants to merge 2 commits into
mainfrom
vc/alloca_intrinsic

vchuravy commented Jun 22, 2026

Uh oh!

maleadt commented Jun 22, 2026

Uh oh!

vchuravy commented Jun 22, 2026

Uh oh!

Uh oh!

codecov Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vchuravy commented Jun 22, 2026

Summary

Why an intrinsic instead of llvmcall + alloca

How it works

Testing

Uh oh!

maleadt commented Jun 22, 2026

Uh oh!

vchuravy commented Jun 22, 2026

Uh oh!

Uh oh!

codecov Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why an intrinsic instead of `llvmcall` + `alloca`

codecov Bot commented Jun 22, 2026 •

edited

Loading