Give workgroup barriers their memory-fence flags by michel2323 · Pull Request #587 · JuliaGPU/oneAPI.jl

michel2323 · 2026-06-18T15:46:33Z

Summary

barrier(0) lowers to an OpControlBarrier with SequentiallyConsistent semantics but no storage-class bit, which the SPIR-V spec treats as ordering no memory. As a result, shared-local (and global) writes are not guaranteed visible to other work-items after the barrier, which can silently drop updates — e.g. a workgroup local-atomic accumulation losing counts.

This passes the appropriate fence flags so the barrier actually orders memory:

KA.__synchronize() → barrier(LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE), matching CUDA __syncthreads semantics (src/oneAPIKernels.jl).
the reduce_group shared-memory reduction tree → barrier(LOCAL_MEM_FENCE) (src/mapreduce.jl).

Notes

Latent correctness issue independent of the GPU stack (barrier(0) orders no memory on any conforming runtime). Verified the LOCAL_MEM_FENCE/GLOBAL_MEM_FENCE constants exist in SPIRVIntrinsics 1.0.0 (the version main pins).

codecov · 2026-06-18T18:22:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.93%. Comparing base (8d49d4a) to head (f572343).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #587      +/-   ##
==========================================
+ Coverage   80.91%   80.93%   +0.02%     
==========================================
  Files          48       48              
  Lines        3238     3237       -1     
==========================================
  Hits         2620     2620              
+ Misses        618      617       -1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

michel2323 · 2026-06-22T14:39:26Z

MWE

The KernelAbstractions histogram example, reduced to the single case that fails without this patch: 512 inputs that all fall in the same bin (bin 2), groupsize 256. All 256 lanes hammer @atomic shared_histogram[2] += 1, then the flush reads shared_histogram[lid] after @synchronize(). Because barrier(0) orders no memory, the flush can read stale/partial SLM and counts are silently dropped ([0, 512] → e.g. [0, 496]).

The linear/rand histogram cases (bins spread out, no same-slot contention) always pass, so they're omitted here.

histogram_mwe.jl

using oneAPI, KernelAbstractions
using KernelAbstractions: @atomic

# Canonical KA histogram kernel — the textbook `@synchronize` user.
# `@synchronize()` lowered to `barrier(0)`, which orders no memory, so the SLM
# accumulation in `shared_histogram` was not guaranteed visible to the flush below.
@kernel unsafe_indices = true function histogram_kernel!(histogram_output, input)
    gid = @index(Group, Linear)
    lid = @index(Local, Linear)
    @uniform gs = prod(@groupsize())
    tid = (gid - 1) * gs + lid
    @uniform N = length(histogram_output)
    shared_histogram = @localmem eltype(input) (gs)
    for min_element in 1:gs:N
        @inbounds shared_histogram[lid] = 0
        @synchronize()
        max_element = min_element + gs
        if max_element > N
            max_element = N + 1
        end
        bin = tid <= length(input) ? input[tid] : 0
        if bin >= min_element && bin < max_element
            bin -= min_element - 1
            @atomic shared_histogram[bin] += 1
        end
        @synchronize()
        if ((lid + min_element - 1) <= N)
            @atomic histogram_output[lid + min_element - 1] += shared_histogram[lid]
        end
    end
end

function histogram!(out, input, groupsize = 256)
    backend = get_backend(out)
    histogram_kernel!(backend, (groupsize,))(out, input, ndrange = size(input))
    KernelAbstractions.synchronize(backend)
    return out
end

# Worst case for the missing fence: every lane atomically accumulates into the
# SAME slot (all 512 inputs are bin 2), groupsize 256. Expected histogram: [0, 512].
backend = oneAPIBackend()
fails = 0
for it in 1:100
    h = KernelAbstractions.zeros(backend, Int32, 2)
    histogram!(h, oneArray(fill(Int32(2), 512)), 256)
    got = Array(h)
    got == Int32[0, 512] || (global fails += 1; @info "iter $it dropped counts" got)
end
println("fail_two = $fails / 100   (0 = fixed)")

Run with julia --project histogram_mwe.jl. Without the fence flags this intermittently drops counts (previously observed ~6–19 failures / 100); with this patch it reports fail_two = 0 / 100.

`barrier(0)` lowers to an `OpControlBarrier` with `SequentiallyConsistent` semantics but no storage-class bit, which the SPIR-V spec treats as ordering no memory. So shared-local (and global) writes are not guaranteed visible to other work-items after the barrier, which can silently drop updates (e.g. a workgroup local-atomic accumulation losing counts). Pass the appropriate fence flags so the barrier actually orders memory: `LOCAL_MEM_FENCE | GLOBAL_MEM_FENCE` for KA `@synchronize` (matching CUDA `__syncthreads`), and `LOCAL_MEM_FENCE` for the mapreduce reduce_group shared-memory tree.

The matmul tiling example called `barrier()` (no method exists; the only signature is `barrier(flags)`) and demonstrated the unfenced pattern that orders no memory. Use `barrier(oneAPI.LOCAL_MEM_FENCE)` so the public example matches the corrected guidance and actually fences the local-memory tile writes.

michel2323 force-pushed the syncthreads-memory-fence branch from b24db59 to bc34dfe Compare June 22, 2026 14:51

michel2323 marked this pull request as ready for review June 22, 2026 15:03

michel2323 force-pushed the syncthreads-memory-fence branch from 06d4d30 to f572343 Compare June 22, 2026 15:04

michel2323 enabled auto-merge (rebase) June 22, 2026 15:05

michel2323 disabled auto-merge June 22, 2026 15:05

michel2323 enabled auto-merge (squash) June 22, 2026 15:06

michel2323 disabled auto-merge June 22, 2026 15:06

michel2323 enabled auto-merge (squash) June 22, 2026 15:06

michel2323 merged commit 642790d into main Jun 22, 2026
5 checks passed

michel2323 deleted the syncthreads-memory-fence branch June 22, 2026 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give workgroup barriers their memory-fence flags#587

Give workgroup barriers their memory-fence flags#587
michel2323 merged 2 commits into
mainfrom
syncthreads-memory-fence

michel2323 commented Jun 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

michel2323 commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michel2323 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Uh oh!

codecov Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

michel2323 commented Jun 22, 2026

MWE

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michel2323 commented Jun 18, 2026 •

edited

Loading

codecov Bot commented Jun 18, 2026 •

edited

Loading