Summary
Implement two GPU optimizations for the QDP IQP encoding path in qdp-kernels:
- IQP kernel fusion – Reduce kernel launch count and global memory traffic by fusing phase, FWT, and/or normalization steps where possible.
- Persistent kernel / grid-stride – Use a fixed grid size with grid-stride loops in single-sample IQP kernels so that large
state_len scales without exceeding grid limits and launch overhead is predictable.
These changes improve GPU utilization and memory bandwidth use, especially for larger qubit counts.
Use Case
- Fusion: The current single-sample IQP path (FWT-based) does 1 phase launch + (1 or n) FWT launch(es) + 1 normalize launch. The full state vector is read/written multiple times to global memory. Fusing steps reduces launch overhead and global memory traffic (e.g. ~2× less state traffic for phase+normalize fusion).
- Grid-stride: Single-sample IQP kernels (
iqp_phase_kernel, fwt_butterfly_stage_kernel, normalize_state_kernel) assign one index per thread and use a grid size that grows with state_len. For very large n this can approach grid limits and increase launch overhead. The batch path already uses grid-stride and a grid cap; applying the same pattern to single-sample keeps behavior consistent and scalable.
Proposed Implementation
1. IQP kernel fusion
-
Phase + Normalize (all n)
- Add a single kernel (or modify the phase kernel) that writes
state[x] = exp(i*θ(x)) * norm_factor so that a separate normalize kernel is not needed after phase.
- In
launch_iqp_encode / batch launcher: call this fused phase+normalize path and skip the standalone normalize_state_kernel / normalize_batch_kernel for IQP.
- Files:
qdp/qdp-kernels/src/iqp.cu (new or modified kernel, and launch logic in the extern "C" section).
-
Phase + FWT(shared) + Normalize in one kernel (n ≤ FWT_SHARED_MEM_THRESHOLD, e.g. n ≤ 10)
- One kernel that: (1) computes phase into shared memory (no global write for phase), (2) runs the full FWT in shared memory, (3) normalizes in shared memory, (4) writes the result to global memory once.
- Replaces the current sequence: phase kernel → shared-memory FWT kernel → normalize kernel (3 launches → 1).
- Respect existing
FWT_SHARED_MEM_THRESHOLD and shared memory size (e.g. n=10 ⇒ 16KB for cuDoubleComplex).
- Files:
qdp/qdp-kernels/src/iqp.cu.
-
Last FWT stage + Normalize (n > FWT_SHARED_MEM_THRESHOLD)
- Fuse the final
fwt_butterfly_stage_kernel with normalization: when writing the butterfly result for each pair, multiply by norm_factor and write once. Then do not launch a separate normalize kernel.
- Files:
qdp/qdp-kernels/src/iqp.cu (modified last-stage kernel and launch logic).
2. Persistent kernel / grid-stride (single-sample IQP)
Batch IQP already uses grid-stride and a grid cap; no change required there unless we want to align constant names or cap values.
Additional Context
- Relevant files:
qdp/qdp-kernels/src/iqp.cu, qdp/qdp-kernels/src/kernel_config.h.
- Algorithm: FWT-based IQP is O(n·2^n) vs naive O(4^n). Steps are: (1) phase array
f[x]=exp(i*θ(x)), (2) Walsh–Hadamard transform (FWT) on f, (3) normalize by 1/2^n.
- Testing: Ensure existing QDP tests (e.g. in
testing/qdp/) and any qdp-kernels tests still pass; consider a quick benchmark for n in [4, 10, 16, 20] (single and batch) to confirm no regression and ideally improved throughput.
Summary
Implement two GPU optimizations for the QDP IQP encoding path in
qdp-kernels:state_lenscales without exceeding grid limits and launch overhead is predictable.These changes improve GPU utilization and memory bandwidth use, especially for larger qubit counts.
Use Case
iqp_phase_kernel,fwt_butterfly_stage_kernel,normalize_state_kernel) assign one index per thread and use a grid size that grows withstate_len. For very large n this can approach grid limits and increase launch overhead. The batch path already uses grid-stride and a grid cap; applying the same pattern to single-sample keeps behavior consistent and scalable.Proposed Implementation
1. IQP kernel fusion
Phase + Normalize (all n)
state[x] = exp(i*θ(x)) * norm_factorso that a separate normalize kernel is not needed after phase.launch_iqp_encode/ batch launcher: call this fused phase+normalize path and skip the standalonenormalize_state_kernel/normalize_batch_kernelfor IQP.qdp/qdp-kernels/src/iqp.cu(new or modified kernel, and launch logic in theextern "C"section).Phase + FWT(shared) + Normalize in one kernel (n ≤ FWT_SHARED_MEM_THRESHOLD, e.g. n ≤ 10)
FWT_SHARED_MEM_THRESHOLDand shared memory size (e.g. n=10 ⇒ 16KB forcuDoubleComplex).qdp/qdp-kernels/src/iqp.cu.Last FWT stage + Normalize (n > FWT_SHARED_MEM_THRESHOLD)
fwt_butterfly_stage_kernelwith normalization: when writing the butterfly result for each pair, multiply bynorm_factorand write once. Then do not launch a separate normalize kernel.qdp/qdp-kernels/src/iqp.cu(modified last-stage kernel and launch logic).2. Persistent kernel / grid-stride (single-sample IQP)
Phase kernel
iqp_phase_kernelto use a grid-stride loop overxand cap grid size (e.g.min(blocks_needed, MAX_GRID_BLOCKS)).iqp.cu).Normalize kernel (when still used)
normalize_state_kernelto use a grid-stride loop over indices and the same grid cap.iqp.cu.FWT global-memory stage kernel
fwt_butterfly_stage_kernelto use a grid-stride loop over butterfly pair index and cap grid (e.g.min(blocks_needed, MAX_GRID_BLOCKS)).iqp.cu.Launch logic
launch_iqp_encode, compute grid size asmin((state_len + blockSize - 1) / blockSize, MAX_GRID_BLOCKS)for phase and normalize; for each FWT stage,min((num_pairs + blockSize - 1) / blockSize, MAX_GRID_BLOCKS).MAX_GRID_BLOCKSfromkernel_config.h(currently 2048).Batch IQP already uses grid-stride and a grid cap; no change required there unless we want to align constant names or cap values.
Additional Context
qdp/qdp-kernels/src/iqp.cu,qdp/qdp-kernels/src/kernel_config.h.f[x]=exp(i*θ(x)), (2) Walsh–Hadamard transform (FWT) onf, (3) normalize by 1/2^n.testing/qdp/) and anyqdp-kernelstests still pass; consider a quick benchmark for n in [4, 10, 16, 20] (single and batch) to confirm no regression and ideally improved throughput.