Give workgroup barriers their memory-fence flags#587
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #587 +/- ##
==========================================
+ Coverage 80.91% 80.93% +0.02%
==========================================
Files 48 48
Lines 3238 3237 -1
==========================================
Hits 2620 2620
+ Misses 618 617 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
MWEThe KernelAbstractions The
|
`barrier(0)` lowers to an `OpControlBarrier` with `SequentiallyConsistent` semantics but no storage-class bit, which the SPIR-V spec treats as ordering no memory. So shared-local (and global) writes are not guaranteed visible to other work-items after the barrier, which can silently drop updates (e.g. a workgroup local-atomic accumulation losing counts). Pass the appropriate fence flags so the barrier actually orders memory: `LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE` for KA `@synchronize` (matching CUDA `__syncthreads`), and `LOCAL_MEM_FENCE` for the mapreduce reduce_group shared-memory tree.
b24db59 to
bc34dfe
Compare
The matmul tiling example called `barrier()` (no method exists; the only signature is `barrier(flags)`) and demonstrated the unfenced pattern that orders no memory. Use `barrier(oneAPI.LOCAL_MEM_FENCE)` so the public example matches the corrected guidance and actually fences the local-memory tile writes.
06d4d30 to
f572343
Compare
Summary
barrier(0)lowers to anOpControlBarrierwithSequentiallyConsistentsemantics but no storage-class bit, which the SPIR-V spec treats as ordering no memory. As a result, shared-local (and global) writes are not guaranteed visible to other work-items after the barrier, which can silently drop updates — e.g. a workgroup local-atomic accumulation losing counts.This passes the appropriate fence flags so the barrier actually orders memory:
KA.__synchronize()→barrier(LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE), matching CUDA__syncthreadssemantics (src/oneAPIKernels.jl).reduce_groupshared-memory reduction tree →barrier(LOCAL_MEM_FENCE)(src/mapreduce.jl).Notes
Latent correctness issue independent of the GPU stack (
barrier(0)orders no memory on any conforming runtime). Verified theLOCAL_MEM_FENCE/GLOBAL_MEM_FENCEconstants exist in SPIRVIntrinsics 1.0.0 (the versionmainpins).