[Feature] [QDP] IQP kernel fusion and persistent kernel / grid-stride optimizations

### Summary

Implement two GPU optimizations for the QDP IQP encoding path in `qdp-kernels`:

1. **IQP kernel fusion** – Reduce kernel launch count and global memory traffic by fusing phase, FWT, and/or normalization steps where possible.
2. **Persistent kernel / grid-stride** – Use a fixed grid size with grid-stride loops in single-sample IQP kernels so that large `state_len` scales without exceeding grid limits and launch overhead is predictable.

These changes improve GPU utilization and memory bandwidth use, especially for larger qubit counts.

### Use Case

- **Fusion**: The current single-sample IQP path (FWT-based) does 1 phase launch + (1 or n) FWT launch(es) + 1 normalize launch. The full state vector is read/written multiple times to global memory. Fusing steps reduces launch overhead and global memory traffic (e.g. ~2× less state traffic for phase+normalize fusion).
- **Grid-stride**: Single-sample IQP kernels (`iqp_phase_kernel`, `fwt_butterfly_stage_kernel`, `normalize_state_kernel`) assign one index per thread and use a grid size that grows with `state_len`. For very large n this can approach grid limits and increase launch overhead. The batch path already uses grid-stride and a grid cap; applying the same pattern to single-sample keeps behavior consistent and scalable.

### Proposed Implementation

#### 1. IQP kernel fusion

- **Phase + Normalize (all n)**  
  - Add a single kernel (or modify the phase kernel) that writes `state[x] = exp(i*θ(x)) * norm_factor` so that a separate normalize kernel is not needed after phase.  
  - In `launch_iqp_encode` / batch launcher: call this fused phase+normalize path and skip the standalone `normalize_state_kernel` / `normalize_batch_kernel` for IQP.  
  - Files: `qdp/qdp-kernels/src/iqp.cu` (new or modified kernel, and launch logic in the `extern "C"` section).

- **Phase + FWT(shared) + Normalize in one kernel (n ≤ FWT_SHARED_MEM_THRESHOLD, e.g. n ≤ 10)**  
  - One kernel that: (1) computes phase into shared memory (no global write for phase), (2) runs the full FWT in shared memory, (3) normalizes in shared memory, (4) writes the result to global memory once.  
  - Replaces the current sequence: phase kernel → shared-memory FWT kernel → normalize kernel (3 launches → 1).  
  - Respect existing `FWT_SHARED_MEM_THRESHOLD` and shared memory size (e.g. n=10 ⇒ 16KB for `cuDoubleComplex`).  
  - Files: `qdp/qdp-kernels/src/iqp.cu`.

- **Last FWT stage + Normalize (n > FWT_SHARED_MEM_THRESHOLD)**  
  - Fuse the final `fwt_butterfly_stage_kernel` with normalization: when writing the butterfly result for each pair, multiply by `norm_factor` and write once. Then do not launch a separate normalize kernel.  
  - Files: `qdp/qdp-kernels/src/iqp.cu` (modified last-stage kernel and launch logic).

#### 2. Persistent kernel / grid-stride (single-sample IQP)

- **Phase kernel**  
  - Change `iqp_phase_kernel` to use a grid-stride loop over `x` and cap grid size (e.g. `min(blocks_needed, MAX_GRID_BLOCKS)`).  
  - Ref: current code uses single index per thread (~lines 111–126 in `iqp.cu`).

- **Normalize kernel (when still used)**  
  - Change `normalize_state_kernel` to use a grid-stride loop over indices and the same grid cap.  
  - Ref: ~lines 212–224 in `iqp.cu`.

- **FWT global-memory stage kernel**  
  - Change `fwt_butterfly_stage_kernel` to use a grid-stride loop over butterfly pair index and cap grid (e.g. `min(blocks_needed, MAX_GRID_BLOCKS)`).  
  - Ref: ~lines 131–159 in `iqp.cu`.

- **Launch logic**  
  - In `launch_iqp_encode`, compute grid size as `min((state_len + blockSize - 1) / blockSize, MAX_GRID_BLOCKS)` for phase and normalize; for each FWT stage, `min((num_pairs + blockSize - 1) / blockSize, MAX_GRID_BLOCKS)`.  
  - Reuse `MAX_GRID_BLOCKS` from `kernel_config.h` (currently 2048).

Batch IQP already uses grid-stride and a grid cap; no change required there unless we want to align constant names or cap values.

### Additional Context

- **Relevant files**: `qdp/qdp-kernels/src/iqp.cu`, `qdp/qdp-kernels/src/kernel_config.h`.
- **Algorithm**: FWT-based IQP is O(n·2^n) vs naive O(4^n). Steps are: (1) phase array `f[x]=exp(i*θ(x))`, (2) Walsh–Hadamard transform (FWT) on `f`, (3) normalize by 1/2^n.
- **Testing**: Ensure existing QDP tests (e.g. in `testing/qdp/`) and any `qdp-kernels` tests still pass; consider a quick benchmark for n in [4, 10, 16, 20] (single and batch) to confirm no regression and ideally improved throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] [QDP] IQP kernel fusion and persistent kernel / grid-stride optimizations #1015

Summary

Use Case

Proposed Implementation

1. IQP kernel fusion

2. Persistent kernel / grid-stride (single-sample IQP)

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] [QDP] IQP kernel fusion and persistent kernel / grid-stride optimizations #1015

Description

Summary

Use Case

Proposed Implementation

1. IQP kernel fusion

2. Persistent kernel / grid-stride (single-sample IQP)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions