[cuda.compute]: add benchmarks to measure host side overhead by NaderAlAwar · Pull Request #9432 · NVIDIA/cccl

NaderAlAwar · 2026-06-12T21:20:19Z

Description

closes #9028
closes #9431

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

coderabbitai · 2026-06-12T21:28:22Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e8bab9c8-a3ff-4463-a5a4-c9a1cad011aa

📥 Commits

Reviewing files that changed from the base of the PR and between 1571d76 and 8172fd6.

📒 Files selected for processing (1)

python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py

🚧 Files skipped from review as they are similar to previous changes (1)

python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR adds host-side pytest-benchmark benchmarks for cuda.compute to measure host-side overheads and first-time wrapper build/JIT costs, addressing issues #9028 and #9431. It provides benchmark-case infrastructure that exercises many cuda.compute primitives while allowing the native CUDA compute path to be bypassed so measurements focus on wrapper and host-side work.

Key Changes

New benchmark case infrastructure:
- python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
  - HostBenchmarkCase dataclass describing per-case setup, wrapper construction, oneshot/twoshot callables, noop return behavior, and optional skip reasons.
  - NoopBuildResult proxy and patch_wrapper_to_skip_native_compute() to replace cached wrapper internals so native compute is skipped and deterministic no-op results are returned (NoopReturnKind: "none", "temp_storage_bytes", "temp_storage_and_selector").
  - Helpers for tiny temp storage, CUDA stream creation/synchronization, raw C++ and Python op implementations, and a broad CASES list plus STREAM_CASES; CALL_CASES = CASES + STREAM_CASES.
  - Module-level skip reason when numba.cuda is unavailable; Python/Numba-backed cases are skipped accordingly.
New pytest-benchmark test suite:
- python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py
  - test_build_time: measures wrapper construction/build time using benchmark.pedantic.
  - test_oneshot_cached_host_overhead: measures host-side overhead for a single cached wrapper invocation after patching to skip native compute.
  - test_twoshot_call_host_overhead: measures host-side overhead for two successive calls (twoshot).
  - Per-case parametrization applies pytest.skip when a case reports a skip_reason. Fixed round/iteration constants are defined for each test group.
Dependency and repo updates:
- python/cuda_cccl/benchmarks/compute/pixi.toml: added pytest-benchmark = "*".
- python/cuda_cccl/pyproject.toml: added pytest-benchmark to bench-* optional dependency extras.
- .gitignore: added .benchmarks/ to ignore pytest-benchmark local save artifacts.

Other notes for reviewers / TODOs

The PR description checklist leaves test coverage and documentation updates unchecked — consider whether CI integration, README/docs updates, or an automated benchmark runner/agent (as suggested in #9028) are desired before merging.
Verify that NoopBuildResult and the noop return kinds accurately mimic real wrapper return shapes (temp-storage sizing, selector semantics) for each case so host-only timings are meaningful.
The commits show automated pre-commit.ci autofix activity; there are no additional implementation-detail commits beyond formatting mentioned in the PR metadata.

suggestion:

Walkthrough

Adds a host-side benchmark framework for cuda.compute: types and noop-patching, per-primitive harnesses, a CASES catalog, pytest-benchmark tests for build/oneshot/twoshot host overhead, and dependency/config updates to enable pytest-benchmark.

Changes

Host-side benchmark framework for cuda.compute

Layer / File(s)	Summary
Benchmark types and no-op patching `python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py`	`HostBenchmarkCase`, `NoopReturnKind`, `NoopBuildResult`, `patch_wrapper_to_skip_native_compute()`, and CUDA helpers (temp storage, stream, synchronize).
Operation harnesses and raw ops `python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py`	Raw i32 op constructors (compiled C++ and Python) and harnesses (setup/make/oneshot/twoshot) for reduce, scan, segmented-reduce, transforms, histogram, search, partition, unique_by_key, and sorting, including stream variants.
Benchmark case factories and catalog `python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py`	State enrichment helpers, `_make_case()`, and `CASES`, `STREAM_CASES`, `CALL_CASES` populated with many operation variants and conditional skip reasons.
Pytest benchmark test definitions `python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py`	Three pytest-benchmark tests measure wrapper build time (`test_build_time`), oneshot cached host overhead (`test_oneshot_cached_host_overhead`), and twoshot call host overhead (`test_twoshot_call_host_overhead`).
Dependency and configuration updates `.gitignore`, `python/cuda_cccl/benchmarks/compute/pixi.toml`, `python/cuda_cccl/pyproject.toml`	`.gitignore` ignores `.benchmarks/`. `pytest-benchmark` added to `pixi.toml` bench feature and to `bench-*` optional extras in `pyproject.toml`.

Assessment against linked issues

Objective	Addressed	Explanation
Add host-side benchmarks to measure host overhead [`#9431`]	✅
Use pytest-benchmark for benchmarking infrastructure [`#9028`, `#9431`]	✅
Add JIT compilation time benchmarks to c.parallel.v2 [`#9028`]	❌	PR implements cuda.compute host-side benchmarks; it does not add JIT compilation-time benchmarks for c.parallel.v2.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 457dee23-afbc-4cdb-a72e-afc227efe60a

📥 Commits

Reviewing files that changed from the base of the PR and between 809e315 and 1786eb0.

📒 Files selected for processing (5)

.gitignore
python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py
python/cuda_cccl/benchmarks/compute/pixi.toml
python/cuda_cccl/pyproject.toml

NaderAlAwar · 2026-06-12T21:30:29Z

pre-commit.ci autofix

copy-pr-bot · 2026-06-12T21:34:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

NaderAlAwar · 2026-06-12T21:35:25Z

/ok to test a921939

NaderAlAwar · 2026-06-12T21:51:46Z

pre-commit.ci autofix

NaderAlAwar · 2026-06-12T21:54:07Z

/ok to test 8172fd6

github-actions · 2026-06-12T23:25:31Z

🥳 CI Workflow Results

🟩 Finished in 1h 27m: Pass: 100%/51 | Total: 14h 07m | Max: 54m 34s

See results here.

NaderAlAwar added 2 commits June 12, 2026 15:51

Add benchmarks to measure host side overhead

dda2d2a

Use pytest-benchmark instead

1786eb0

NaderAlAwar requested review from a team as code owners June 12, 2026 21:20

NaderAlAwar requested a review from gonidelis June 12, 2026 21:20

github-project-automation Bot added this to CCCL Jun 12, 2026

NaderAlAwar requested a review from shwina June 12, 2026 21:20

github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py

[pre-commit.ci] auto code formatting

a921939

Add a case that accepts a stream

1571d76

[pre-commit.ci] auto code formatting

8172fd6

NaderAlAwar mentioned this pull request Jun 15, 2026

[cuda.compute]: make cuda.compute thread safe to enable free threaded wheels #9475

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda.compute]: add benchmarks to measure host side overhead#9432

[cuda.compute]: add benchmarks to measure host side overhead#9432
NaderAlAwar wants to merge 5 commits into
NVIDIA:mainfrom
NaderAlAwar:cuda-compute-host-benchmarks

NaderAlAwar commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NaderAlAwar commented Jun 12, 2026

Description

Checklist

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Other notes for reviewers / TODOs

Walkthrough

Changes

Assessment against linked issues

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

NaderAlAwar commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 27m: Pass: 100%/51 | Total: 14h 07m | Max: 54m 34s

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading