GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec by dmatth1 · Pull Request #50030 · apache/arrow

dmatth1 · 2026-05-24T15:55:37Z

Rationale for this change

BlockSplitBloomFilter::FindHash ships the scalar reference probe — an 8-iteration short-circuit loop. The short-circuit blocks autovectorization, and on miss-heavy workloads (Parquet row-group skipping) the per-lane branch-mispredict dominates probe latency.

Closes #50026. Dev list discussion: https://lists.apache.org/thread/omof0fq47tndfd80g5hwp2bvjmzvpb40. Sibling change in Rust: apache/arrow-rs#10011.

What changes are included in this PR?

Rewrite FindHash as a branchless OR-accumulator reduction. The new shape autovectorizes to SSE on x86 and NEON on aarch64 at the baseline.
Add bloom_filter_avx2.cc (xsimd kernel built with -mavx2) behind CpuInfo-based DynamicDispatch, mirroring the existing level_comparison_avx2 pattern. xsimd was a requirement from the dev thread; the AVX2 target spells the reduction explicitly because gcc/MSVC don't lower the autovec body to a single vptest.
No on-disk format change, no public API change, and bit-identical to the scalar reference.
Insert path uses the same loop shape and will land in a follow-up PR.

Performance

End-to-end FindHash via parquet/benches/bloom_filter_benchmark.cc.

M1 (Apple clang -O3, NEON via autovec, 10 reps, CV ≤ 0.4%):

Bench	upstream/main	this PR	Speedup
`BM_FindExistingHash` (hit-heavy)	3.85 ns/probe	2.41 ns/probe	1.60×
`BM_FindNonExistingHash` (miss-heavy)	9.04 ns/probe	2.41 ns/probe	3.75×

x86-64 (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV ≤ 0.6%):

Bench	upstream/main	this PR	Speedup
`BM_FindExistingHash` (hit-heavy)	8.62 ns/probe	4.32 ns/probe	2.00×
`BM_FindNonExistingHash` (miss-heavy)	15.29 ns/probe	4.33 ns/probe	3.53×

The scalar miss path stalls on the data-dependent early-exit (slower than its own hit path on both archs); the branchless reduction is constant-time across hit/miss. InsertHash, BatchInsertHash, ComputeHash, BatchComputeHash all unchanged (16 benches within ±0.6%, inside CV).

Cache regime sweep: scalar vs xsimd, post-hash probe latency:

Regime	scalar	xsimd	Speedup
Small in-cache (0.5 MiB)	12.35 ns	2.48 ns	5.0×
Medium out-of-L3 (128 MiB)	18.40 ns	7.41 ns	2.5×
Large deep DRAM (1 GiB)	31.05 ns	22.10 ns	1.4×

Branchless body alone (no xsimd kernel) on AVX2:

on clang -mavx2 it's within noise of the hand-written xsimd kernel in every regime
on gcc it matches except ~0.79× of xsimd in the out-of-L3 regime.
That gap is why this PR ships a separate xsimd kernel for the AVX2 TU rather than relying on autovec alone — on clang-only builds the xsimd kernel is essentially a no-op but on gcc/MSVC it pins the vptest lowering.

Are these changes tested?

Yes. New BloomFilterProbeKernel test calls both dispatch targets directly across 20K random blocks + 200 production-populated blocks per CI run, asserting bit-identical output. DynamicDispatch resolves once at static init, so without this
test the un-picked target would never be exercised in CI.

Existing BasicTest, FPPTest, and CompatibilityTest continue to pass on both the scalar baseline and the AVX2 dispatch path.

Are there any user-facing changes?

No. Read-path implementation change only.

GitHub Issue: [C++][Parquet] SIMD-accelerate the SBBF probe in BlockSplitBloomFilter::FindHash #50026

…ess autovec Rewrite BlockSplitBloomFilter::FindHash from a short-circuit early-exit loop to a branchless OR-accumulator reduction. The early `return false` blocked compilers from collapsing the 8-lane probe to a horizontal block test; the reduction autovectorizes to a single SSE/NEON block test on clang, gcc, and MSVC. Wire the probe through CpuInfo runtime dispatch, mirroring the existing level_comparison_avx2 pattern. The shared body in bloom_filter_block_inc.h is built once at the baseline (SSE on x86, NEON on aarch64) and once in bloom_filter_avx2.cc compiled with `-mavx2`. The AVX2 TU spells the reduction in xsimd rather than relying on autovec: clang lowers the autovec body to a single vptest, but gcc/MSVC emit a longer horizontal vpor reduction that costs ~20% out-of-L3. xsimd is guaranteed available under ARROW_HAVE_RUNTIME_AVX2. A new cross-target diff test calls both probe bodies directly across 20K random + 200 production-populated blocks per CI run, so neither path can silently drift. A static_assert ties the 8-lane assumption to BlockSplitBloomFilter::kBitsSetPerBlock. On-disk format unchanged. SALT, XXH64, bucket index unchanged. Bit-identical to the scalar reference. End-to-end FindHash perf via parquet/benches/bloom_filter_benchmark.cc. M1 (Apple clang -O3, NEON via autovec, 10 reps, CV<=0.4%): | Bench | upstream/main (scalar) | simd-sbbf-autovec | Speedup | |-------------------------------------|---------------------------|---------------------------|---------| | BM_FindExistingHash (hit-heavy) | 3.85 ns/probe (259.6 M/s) | 2.41 ns/probe (415.1 M/s) | 1.60x | | BM_FindNonExistingHash (miss-heavy) | 9.04 ns/probe (110.6 M/s) | 2.41 ns/probe (415.4 M/s) | 3.75x | x86-64 (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV<=0.6%): | Bench | upstream/main (scalar) | simd-sbbf-autovec | Speedup | |-------------------------------------|---------------------------|---------------------------|---------| | BM_FindExistingHash (hit-heavy) | 8.62 ns/probe (116.0 M/s) | 4.32 ns/probe (231.6 M/s) | 2.00x | | BM_FindNonExistingHash (miss-heavy) | 15.29 ns/probe (65.4 M/s) | 4.33 ns/probe (230.8 M/s) | 3.53x | The scalar miss path stalls on the data-dependent early-exit (slower than its own hit path on both archs); the branchless reduction is constant-time across hit and miss. Miss-heavy is the common case for Parquet row-group skipping. Insert/ComputeHash/batch paths unchanged (16 benches within +/-0.6%). Cache-regime sweep in the PR description. Insert path uses the same loop shape and follows in a separate PR.

github-actions · 2026-05-24T15:56:03Z

⚠️ GitHub issue #50026 has been automatically assigned in GitHub to PR creator.

Copilot

Pull request overview

This PR accelerates Parquet’s BlockSplitBloomFilter::FindHash probe by reshaping the scalar short-circuit loop into a branchless reduction that autovectorizes, and by adding an AVX2 runtime-dispatched probe kernel for x86 targets.

Changes:

Rework BlockSplitBloomFilter::FindHash to call a dispatchable per-block probe (FindHashBlockImpl) implemented as a branchless OR-accumulator reduction.
Add an AVX2-specific probe implementation in a separate translation unit (bloom_filter_avx2.cc) using xsimd, wired through DynamicDispatch.
Add a kernel agreement test that compares baseline vs AVX2 implementations on AVX2-capable hosts.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
cpp/src/parquet/CMakeLists.txt	Adds `bloom_filter_avx2.cc` to Parquet sources under runtime-AVX2 builds and applies AVX2 compile flags.
cpp/src/parquet/bloom_filter.cc	Introduces `DynamicDispatch` plumbing and routes `FindHash` through the new per-block probe kernels.
cpp/src/parquet/bloom_filter_test.cc	Adds an AVX2-only cross-kernel agreement test and includes the baseline/AVX2 probe entrypoints.
cpp/src/parquet/bloom_filter_block_inc.h	New header containing the baseline branchless per-block probe implementation.
cpp/src/parquet/bloom_filter_avx2.cc	New AVX2 probe kernel implementation using xsimd.
cpp/src/parquet/bloom_filter_avx2_internal.h	New internal header declaring the AVX2 probe entrypoint (exported for Windows/MinGW test usage).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

mapleFU · 2026-05-25T02:07:55Z

What would just branchless code bench result be with/without avx2?
What would the probe speed change when bloom filter size changing?

I wonder how would the avx2 path faster than scalar path 🤔

dmatth1 · 2026-05-25T03:57:02Z

Branchless body alone (no xsimd kernel) on AVX2:

on clang -mavx2 it's within noise of the hand-written xsimd kernel in every regime
on gcc it matches except ~0.79× of xsimd in the out-of-L3 regime.
That gap is why this PR ships a separate xsimd kernel for the AVX2 TU rather than relying on autovec alone — on clang-only builds the xsimd kernel is essentially a no-op but on gcc/MSVC it pins the vptest lowering.

Cache regime sweep: scalar vs xsimd, post-hash probe latency:

Regime	scalar	xsimd	Speedup
Small in-cache (0.5 MiB)	12.35 ns	2.48 ns	5.0×
Medium out-of-L3 (128 MiB)	18.40 ns	7.41 ns	2.5×
Large deep DRAM (1 GiB)	31.05 ns	22.10 ns	1.4×

These numbers are with the as_batch_bool xsimd form (~1 cycle faster in-cache than the shipped miss != 0 spelling — out-of-cache regimes unchanged) and the post-hash only (XXH64 excluded) so absolute values don't compare directly to the end-to-end commit-body table. The regime shape (biggest gain in-cache, smallest in DRAM) holds for the shipped form.

Can re-bench in-tree with the commit if you want directly-comparable numbers.

AntoinePrv

If we add an xsimd implementation, I wonder if it is worth using it for Neon/SSE.

On the one hand the current autovec works and is minimal to maintain/test.
On the other hand autovec is a black box.

Though with a bit more work, the xsimd implementation could be generic and also support AVX512, SVE, and future targets too.

I have no intuition how xsimd compares to autovec. Given the compiler also optimizes xsimd's code, I'd say slightly better, but again it's possible (and it has been the case) some things are not properly expressed in xsimd as well.

dmatth1 · 2026-05-29T11:30:30Z

If we add an xsimd implementation, I wonder if it is worth using it for Neon/SSE.

On the one hand the current autovec works and is minimal to maintain/test.

On the other hand autovec is a black box.

Though with a bit more work, the xsimd implementation could be generic and also support AVX512, SVE, and future targets too.

I have no intuition how xsimd compares to autovec. Given the compiler also optimizes xsimd's code, I'd say slightly better, but again it's possible (and it has been the case) some things are not properly expressed in xsimd as well.

Measured a microbenchmark on my M1 macbook, probe-only (no hash) and in-cache:

clang autovec and xsimd are about the same in performance
gcc 15 xsimd was 3x faster then autovec

So I think there's a real argument to be made for using xsimd on Neon. We could use the dispatch array. Will increase the scope of this change a bit and I might lean towards addressing in a follow-up (I can create an issue) but whatever you guys think is best.

mapleFU · 2026-05-29T13:32:16Z

This is general OK to me, can you update the x86 code without AVX2 to pr discription? And Small in-cache / medium / ... also in pr description?

dmatth1 · 2026-05-29T15:13:32Z

@mapleFU Done!

- Dispatch guard: defensive ARROW_HAVE_AVX2 || ARROW_HAVE_RUNTIME_AVX2 - AVX2 kernel: generic xsimd batch<uint32_t> + bitwise_rshift<27> - Drop PARQUET_IMPL_NAMESPACE/ODR machinery; rename block header to *_impl_internal so it is not installed - Collapse namespaces; comment + test cleanups (kBitsSetPerBlock, drop alignas)

pitrou · 2026-06-09T14:41:01Z

+// Mirror of BlockSplitBloomFilter::kBitsSetPerBlock (private to the class).
+constexpr int kBitsSetPerBlock = 8;


Well, why not make it public in the class? This would avoid hard-coding it in multiple places (in cpp/src/parquet/bloom_filter_block_impl_internal.h as well).

pitrou · 2026-06-09T14:43:01Z

@HuaHuaY Since you are working on bloom filters too, would you like to review this PR?

HuaHuaY · 2026-06-09T15:20:10Z

@@ -195,7 +195,11 @@ set(PARQUET_SRCS

 if(ARROW_HAVE_RUNTIME_AVX2)


Should we add ARROW_HAVE_AVX2 here? There is following code in bloom_filter.cc

#if defined(ARROW_HAVE_AVX2) || defined(ARROW_HAVE_RUNTIME_AVX2) # include "parquet/bloom_filter_avx2_internal.h" #endif

HuaHuaY · 2026-06-09T15:22:19Z

+
+#include "parquet/bloom_filter_block_impl_internal.h"
+
+#include "arrow/util/dispatch_internal.h"


I suggest adjusting the include order. Can we move line 44 to line 26?

HuaHuaY · 2026-06-09T15:35:01Z

+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.


I recommend adding a newline.

HuaHuaY · 2026-06-09T15:48:42Z

+
+namespace parquet::internal {
+
+inline bool FindHashBlockImpl(const uint32_t* block, const uint32_t* salt, uint32_t key) {


Can we use std::span<const uint32_t, kBitsSetPerBlock> block as the first parameter? Will doing this improve readability?

HuaHuaY · 2026-06-09T16:04:05Z

+// bloom_filter_block_impl_internal.h: only clang lowers that body to a single vptest;
+// gcc and MSVC emit a longer horizontal vpor reduction.
+bool FindHashBlockAvx2(const uint32_t* block, const uint32_t* salt, uint32_t key) {
+  using batch = xsimd::batch<uint32_t>;


May xsimd generate avx512 here? I think we may need to use xsimd::batch<uint32_t, xsimd::avx2>.

HuaHuaY · 2026-06-09T16:04:59Z

 bool BlockSplitBloomFilter::FindHash(uint64_t hash) const {
+  // Probe kernels in bloom_filter_block_impl_internal.h and bloom_filter_avx2.cc both
+  // hard-code an 8-lane (256-bit) block.
+  static_assert(kBitsSetPerBlock == 8,


I suggest to move this static_assert to block_filter_avx2.cc.

dmatth1 requested a review from wgtmac as a code owner May 24, 2026 15:55

github-actions Bot added Component: Parquet Component: C++ awaiting review Awaiting review labels May 24, 2026

kou requested a review from Copilot May 24, 2026 21:01

Copilot started reviewing on behalf of kou May 24, 2026 21:01 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

Potential fix for pull request finding

5a99d50

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

wgtmac reviewed May 28, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

Comment thread cpp/src/parquet/bloom_filter_avx2_internal.h Outdated

Comment thread cpp/src/parquet/bloom_filter_block_inc.h Outdated

AntoinePrv reviewed May 28, 2026

View reviewed changes

Comment thread cpp/src/parquet/bloom_filter.cc Outdated

Comment thread cpp/src/parquet/bloom_filter_avx2.cc Outdated

Comment thread cpp/src/parquet/bloom_filter_avx2.cc Outdated

pitrou reviewed Jun 1, 2026

View reviewed changes

dmatth1 force-pushed the simd-sbbf-autovec branch from 1591295 to 764f0e5 Compare June 4, 2026 20:52

github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 9, 2026

pitrou reviewed Jun 9, 2026

View reviewed changes

HuaHuaY reviewed Jun 9, 2026

View reviewed changes

		// Mirror of BlockSplitBloomFilter::kBitsSetPerBlock (private to the class).
		constexpr int kBitsSetPerBlock = 8;

		@@ -195,7 +195,11 @@ set(PARQUET_SRCS

		if(ARROW_HAVE_RUNTIME_AVX2)


		#include "parquet/bloom_filter_block_impl_internal.h"

		#include "arrow/util/dispatch_internal.h"


		namespace parquet::internal {

		inline bool FindHashBlockImpl(const uint32_t* block, const uint32_t* salt, uint32_t key) {

Conversation

dmatth1 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Performance

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented May 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

mapleFU commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmatth1 commented May 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AntoinePrv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmatth1 commented May 29, 2026

Uh oh!

mapleFU commented May 29, 2026

Uh oh!

dmatth1 commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

pitrou commented Jun 9, 2026

Uh oh!

HuaHuaY Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dmatth1 commented May 24, 2026 •

edited

Loading

mapleFU commented May 25, 2026 •

edited

Loading

HuaHuaY Jun 9, 2026 •

edited

Loading

HuaHuaY Jun 9, 2026 •

edited

Loading