Skip to content

[cuda.compute]: add benchmarks to measure host side overhead#9432

Open
NaderAlAwar wants to merge 5 commits into
NVIDIA:mainfrom
NaderAlAwar:cuda-compute-host-benchmarks
Open

[cuda.compute]: add benchmarks to measure host side overhead#9432
NaderAlAwar wants to merge 5 commits into
NVIDIA:mainfrom
NaderAlAwar:cuda-compute-host-benchmarks

Conversation

@NaderAlAwar

Copy link
Copy Markdown
Contributor

Description

closes #9028
closes #9431

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@NaderAlAwar NaderAlAwar requested review from a team as code owners June 12, 2026 21:20
@NaderAlAwar NaderAlAwar requested a review from gonidelis June 12, 2026 21:20
@NaderAlAwar NaderAlAwar requested a review from shwina June 12, 2026 21:20
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e8bab9c8-a3ff-4463-a5a4-c9a1cad011aa

📥 Commits

Reviewing files that changed from the base of the PR and between 1571d76 and 8172fd6.

📒 Files selected for processing (1)
  • python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR adds host-side pytest-benchmark benchmarks for cuda.compute to measure host-side overheads and first-time wrapper build/JIT costs, addressing issues #9028 and #9431. It provides benchmark-case infrastructure that exercises many cuda.compute primitives while allowing the native CUDA compute path to be bypassed so measurements focus on wrapper and host-side work.

Key Changes

  • New benchmark case infrastructure:

    • python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
      • HostBenchmarkCase dataclass describing per-case setup, wrapper construction, oneshot/twoshot callables, noop return behavior, and optional skip reasons.
      • NoopBuildResult proxy and patch_wrapper_to_skip_native_compute() to replace cached wrapper internals so native compute is skipped and deterministic no-op results are returned (NoopReturnKind: "none", "temp_storage_bytes", "temp_storage_and_selector").
      • Helpers for tiny temp storage, CUDA stream creation/synchronization, raw C++ and Python op implementations, and a broad CASES list plus STREAM_CASES; CALL_CASES = CASES + STREAM_CASES.
      • Module-level skip reason when numba.cuda is unavailable; Python/Numba-backed cases are skipped accordingly.
  • New pytest-benchmark test suite:

    • python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py
      • test_build_time: measures wrapper construction/build time using benchmark.pedantic.
      • test_oneshot_cached_host_overhead: measures host-side overhead for a single cached wrapper invocation after patching to skip native compute.
      • test_twoshot_call_host_overhead: measures host-side overhead for two successive calls (twoshot).
      • Per-case parametrization applies pytest.skip when a case reports a skip_reason. Fixed round/iteration constants are defined for each test group.
  • Dependency and repo updates:

    • python/cuda_cccl/benchmarks/compute/pixi.toml: added pytest-benchmark = "*".
    • python/cuda_cccl/pyproject.toml: added pytest-benchmark to bench-* optional dependency extras.
    • .gitignore: added .benchmarks/ to ignore pytest-benchmark local save artifacts.

Other notes for reviewers / TODOs

  • The PR description checklist leaves test coverage and documentation updates unchecked — consider whether CI integration, README/docs updates, or an automated benchmark runner/agent (as suggested in #9028) are desired before merging.
  • Verify that NoopBuildResult and the noop return kinds accurately mimic real wrapper return shapes (temp-storage sizing, selector semantics) for each case so host-only timings are meaningful.
  • The commits show automated pre-commit.ci autofix activity; there are no additional implementation-detail commits beyond formatting mentioned in the PR metadata.

suggestion:

Walkthrough

Adds a host-side benchmark framework for cuda.compute: types and noop-patching, per-primitive harnesses, a CASES catalog, pytest-benchmark tests for build/oneshot/twoshot host overhead, and dependency/config updates to enable pytest-benchmark.

Changes

Host-side benchmark framework for cuda.compute

Layer / File(s) Summary
Benchmark types and no-op patching
python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
HostBenchmarkCase, NoopReturnKind, NoopBuildResult, patch_wrapper_to_skip_native_compute(), and CUDA helpers (temp storage, stream, synchronize).
Operation harnesses and raw ops
python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
Raw i32 op constructors (compiled C++ and Python) and harnesses (setup/make/oneshot/twoshot) for reduce, scan, segmented-reduce, transforms, histogram, search, partition, unique_by_key, and sorting, including stream variants.
Benchmark case factories and catalog
python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
State enrichment helpers, _make_case(), and CASES, STREAM_CASES, CALL_CASES populated with many operation variants and conditional skip reasons.
Pytest benchmark test definitions
python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py
Three pytest-benchmark tests measure wrapper build time (test_build_time), oneshot cached host overhead (test_oneshot_cached_host_overhead), and twoshot call host overhead (test_twoshot_call_host_overhead).
Dependency and configuration updates
.gitignore, python/cuda_cccl/benchmarks/compute/pixi.toml, python/cuda_cccl/pyproject.toml
.gitignore ignores .benchmarks/. pytest-benchmark added to pixi.toml bench feature and to bench-* optional extras in pyproject.toml.

Assessment against linked issues

Objective Addressed Explanation
Add host-side benchmarks to measure host overhead [#9431]
Use pytest-benchmark for benchmarking infrastructure [#9028, #9431]
Add JIT compilation time benchmarks to c.parallel.v2 [#9028] PR implements cuda.compute host-side benchmarks; it does not add JIT compilation-time benchmarks for c.parallel.v2.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 457dee23-afbc-4cdb-a72e-afc227efe60a

📥 Commits

Reviewing files that changed from the base of the PR and between 809e315 and 1786eb0.

📒 Files selected for processing (5)
  • .gitignore
  • python/cuda_cccl/benchmarks/compute/host/host_benchmark_cases.py
  • python/cuda_cccl/benchmarks/compute/host/test_host_pytest_benchmark.py
  • python/cuda_cccl/benchmarks/compute/pixi.toml
  • python/cuda_cccl/pyproject.toml

@NaderAlAwar

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@NaderAlAwar

Copy link
Copy Markdown
Contributor Author

/ok to test a921939

@NaderAlAwar

Copy link
Copy Markdown
Contributor Author

pre-commit.ci autofix

@NaderAlAwar

Copy link
Copy Markdown
Contributor Author

/ok to test 8172fd6

@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 27m: Pass: 100%/51 | Total: 14h 07m | Max: 54m 34s

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[cuda.compute]: add host side benchmarks to measure host overheads Add benchmarks for JIT compilation time to c.parallel.v2

1 participant