Skip to content

Give workgroup barriers their memory-fence flags#587

Merged
michel2323 merged 2 commits into
mainfrom
syncthreads-memory-fence
Jun 22, 2026
Merged

Give workgroup barriers their memory-fence flags#587
michel2323 merged 2 commits into
mainfrom
syncthreads-memory-fence

Conversation

@michel2323

@michel2323 michel2323 commented Jun 18, 2026

Copy link
Copy Markdown
Member

Summary

barrier(0) lowers to an OpControlBarrier with SequentiallyConsistent semantics but no storage-class bit, which the SPIR-V spec treats as ordering no memory. As a result, shared-local (and global) writes are not guaranteed visible to other work-items after the barrier, which can silently drop updates — e.g. a workgroup local-atomic accumulation losing counts.

This passes the appropriate fence flags so the barrier actually orders memory:

  • KA.__synchronize()barrier(LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE), matching CUDA __syncthreads semantics (src/oneAPIKernels.jl).
  • the reduce_group shared-memory reduction tree → barrier(LOCAL_MEM_FENCE) (src/mapreduce.jl).

Notes

Latent correctness issue independent of the GPU stack (barrier(0) orders no memory on any conforming runtime). Verified the LOCAL_MEM_FENCE/GLOBAL_MEM_FENCE constants exist in SPIRVIntrinsics 1.0.0 (the version main pins).

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.93%. Comparing base (8d49d4a) to head (f572343).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #587      +/-   ##
==========================================
+ Coverage   80.91%   80.93%   +0.02%     
==========================================
  Files          48       48              
  Lines        3238     3237       -1     
==========================================
  Hits         2620     2620              
+ Misses        618      617       -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@michel2323

Copy link
Copy Markdown
Member Author

MWE

The KernelAbstractions histogram example, reduced to the single case that fails without this patch: 512 inputs that all fall in the same bin (bin 2), groupsize 256. All 256 lanes hammer @atomic shared_histogram[2] += 1, then the flush reads shared_histogram[lid] after @synchronize(). Because barrier(0) orders no memory, the flush can read stale/partial SLM and counts are silently dropped ([0, 512] → e.g. [0, 496]).

The linear/rand histogram cases (bins spread out, no same-slot contention) always pass, so they're omitted here.

histogram_mwe.jl
using oneAPI, KernelAbstractions
using KernelAbstractions: @atomic

# Canonical KA histogram kernel — the textbook `@synchronize` user.
# `@synchronize()` lowered to `barrier(0)`, which orders no memory, so the SLM
# accumulation in `shared_histogram` was not guaranteed visible to the flush below.
@kernel unsafe_indices = true function histogram_kernel!(histogram_output, input)
    gid = @index(Group, Linear)
    lid = @index(Local, Linear)
    @uniform gs = prod(@groupsize())
    tid = (gid - 1) * gs + lid
    @uniform N = length(histogram_output)
    shared_histogram = @localmem eltype(input) (gs)
    for min_element in 1:gs:N
        @inbounds shared_histogram[lid] = 0
        @synchronize()
        max_element = min_element + gs
        if max_element > N
            max_element = N + 1
        end
        bin = tid <= length(input) ? input[tid] : 0
        if bin >= min_element && bin < max_element
            bin -= min_element - 1
            @atomic shared_histogram[bin] += 1
        end
        @synchronize()
        if ((lid + min_element - 1) <= N)
            @atomic histogram_output[lid + min_element - 1] += shared_histogram[lid]
        end
    end
end

function histogram!(out, input, groupsize = 256)
    backend = get_backend(out)
    histogram_kernel!(backend, (groupsize,))(out, input, ndrange = size(input))
    KernelAbstractions.synchronize(backend)
    return out
end

# Worst case for the missing fence: every lane atomically accumulates into the
# SAME slot (all 512 inputs are bin 2), groupsize 256. Expected histogram: [0, 512].
backend = oneAPIBackend()
fails = 0
for it in 1:100
    h = KernelAbstractions.zeros(backend, Int32, 2)
    histogram!(h, oneArray(fill(Int32(2), 512)), 256)
    got = Array(h)
    got == Int32[0, 512] || (global fails += 1; @info "iter $it dropped counts" got)
end
println("fail_two = $fails / 100   (0 = fixed)")

Run with julia --project histogram_mwe.jl. Without the fence flags this intermittently drops counts (previously observed ~6–19 failures / 100); with this patch it reports fail_two = 0 / 100.

`barrier(0)` lowers to an `OpControlBarrier` with `SequentiallyConsistent`
semantics but no storage-class bit, which the SPIR-V spec treats as
ordering no memory. So shared-local (and global) writes are not
guaranteed visible to other work-items after the barrier, which can
silently drop updates (e.g. a workgroup local-atomic accumulation losing
counts).

Pass the appropriate fence flags so the barrier actually orders memory:
`LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE` for KA `@synchronize` (matching CUDA
`__syncthreads`), and `LOCAL_MEM_FENCE` for the mapreduce reduce_group
shared-memory tree.
@michel2323 michel2323 force-pushed the syncthreads-memory-fence branch from b24db59 to bc34dfe Compare June 22, 2026 14:51
@michel2323 michel2323 marked this pull request as ready for review June 22, 2026 15:03
The matmul tiling example called `barrier()` (no method exists; the only
signature is `barrier(flags)`) and demonstrated the unfenced pattern that
orders no memory. Use `barrier(oneAPI.LOCAL_MEM_FENCE)` so the public
example matches the corrected guidance and actually fences the local-memory
tile writes.
@michel2323 michel2323 force-pushed the syncthreads-memory-fence branch from 06d4d30 to f572343 Compare June 22, 2026 15:04
@michel2323 michel2323 enabled auto-merge (rebase) June 22, 2026 15:05
@michel2323 michel2323 disabled auto-merge June 22, 2026 15:05
@michel2323 michel2323 enabled auto-merge (squash) June 22, 2026 15:06
@michel2323 michel2323 disabled auto-merge June 22, 2026 15:06
@michel2323 michel2323 enabled auto-merge (squash) June 22, 2026 15:06
@michel2323 michel2323 merged commit 642790d into main Jun 22, 2026
5 checks passed
@michel2323 michel2323 deleted the syncthreads-memory-fence branch June 22, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant