Skip to content

[Feature] [QDP] IQP kernel fusion and persistent kernel / grid-stride optimizations #1015

@400Ping

Description

@400Ping

Summary

Implement two GPU optimizations for the QDP IQP encoding path in qdp-kernels:

  1. IQP kernel fusion – Reduce kernel launch count and global memory traffic by fusing phase, FWT, and/or normalization steps where possible.
  2. Persistent kernel / grid-stride – Use a fixed grid size with grid-stride loops in single-sample IQP kernels so that large state_len scales without exceeding grid limits and launch overhead is predictable.

These changes improve GPU utilization and memory bandwidth use, especially for larger qubit counts.

Use Case

  • Fusion: The current single-sample IQP path (FWT-based) does 1 phase launch + (1 or n) FWT launch(es) + 1 normalize launch. The full state vector is read/written multiple times to global memory. Fusing steps reduces launch overhead and global memory traffic (e.g. ~2× less state traffic for phase+normalize fusion).
  • Grid-stride: Single-sample IQP kernels (iqp_phase_kernel, fwt_butterfly_stage_kernel, normalize_state_kernel) assign one index per thread and use a grid size that grows with state_len. For very large n this can approach grid limits and increase launch overhead. The batch path already uses grid-stride and a grid cap; applying the same pattern to single-sample keeps behavior consistent and scalable.

Proposed Implementation

1. IQP kernel fusion

  • Phase + Normalize (all n)

    • Add a single kernel (or modify the phase kernel) that writes state[x] = exp(i*θ(x)) * norm_factor so that a separate normalize kernel is not needed after phase.
    • In launch_iqp_encode / batch launcher: call this fused phase+normalize path and skip the standalone normalize_state_kernel / normalize_batch_kernel for IQP.
    • Files: qdp/qdp-kernels/src/iqp.cu (new or modified kernel, and launch logic in the extern "C" section).
  • Phase + FWT(shared) + Normalize in one kernel (n ≤ FWT_SHARED_MEM_THRESHOLD, e.g. n ≤ 10)

    • One kernel that: (1) computes phase into shared memory (no global write for phase), (2) runs the full FWT in shared memory, (3) normalizes in shared memory, (4) writes the result to global memory once.
    • Replaces the current sequence: phase kernel → shared-memory FWT kernel → normalize kernel (3 launches → 1).
    • Respect existing FWT_SHARED_MEM_THRESHOLD and shared memory size (e.g. n=10 ⇒ 16KB for cuDoubleComplex).
    • Files: qdp/qdp-kernels/src/iqp.cu.
  • Last FWT stage + Normalize (n > FWT_SHARED_MEM_THRESHOLD)

    • Fuse the final fwt_butterfly_stage_kernel with normalization: when writing the butterfly result for each pair, multiply by norm_factor and write once. Then do not launch a separate normalize kernel.
    • Files: qdp/qdp-kernels/src/iqp.cu (modified last-stage kernel and launch logic).

2. Persistent kernel / grid-stride (single-sample IQP)

  • Phase kernel

    • Change iqp_phase_kernel to use a grid-stride loop over x and cap grid size (e.g. min(blocks_needed, MAX_GRID_BLOCKS)).
    • Ref: current code uses single index per thread (~lines 111–126 in iqp.cu).
  • Normalize kernel (when still used)

    • Change normalize_state_kernel to use a grid-stride loop over indices and the same grid cap.
    • Ref: ~lines 212–224 in iqp.cu.
  • FWT global-memory stage kernel

    • Change fwt_butterfly_stage_kernel to use a grid-stride loop over butterfly pair index and cap grid (e.g. min(blocks_needed, MAX_GRID_BLOCKS)).
    • Ref: ~lines 131–159 in iqp.cu.
  • Launch logic

    • In launch_iqp_encode, compute grid size as min((state_len + blockSize - 1) / blockSize, MAX_GRID_BLOCKS) for phase and normalize; for each FWT stage, min((num_pairs + blockSize - 1) / blockSize, MAX_GRID_BLOCKS).
    • Reuse MAX_GRID_BLOCKS from kernel_config.h (currently 2048).

Batch IQP already uses grid-stride and a grid cap; no change required there unless we want to align constant names or cap values.

Additional Context

  • Relevant files: qdp/qdp-kernels/src/iqp.cu, qdp/qdp-kernels/src/kernel_config.h.
  • Algorithm: FWT-based IQP is O(n·2^n) vs naive O(4^n). Steps are: (1) phase array f[x]=exp(i*θ(x)), (2) Walsh–Hadamard transform (FWT) on f, (3) normalize by 1/2^n.
  • Testing: Ensure existing QDP tests (e.g. in testing/qdp/) and any qdp-kernels tests still pass; consider a quick benchmark for n in [4, 10, 16, 20] (single and batch) to confirm no regression and ideally improved throughput.

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions