Productize tuning API for `DeviceSegmentedScan` by gonidelis · Pull Request #9430 · NVIDIA/cccl

gonidelis · 2026-06-12T21:11:42Z

coderabbitai · 2026-06-12T21:18:03Z

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR productizes the tuning API for cub::DeviceSegmentedScan by promoting internal tuning policy types into stable, public API types and updating codepaths, documentation, and tests to use them. The change makes environment-based tuning for segmented scans accessible and production-ready.

Key Changes

Public API types:
- Introduced public SegmentedScanBlockPolicy (block-level tuning) and SegmentedScanPolicy (overall tuning).
- Replaced uses of internal detail::segmented_scan::segmented_scan_policy / block types with the public types across benchmarks, dispatch, kernels, and tests.
API/ABI and behavior fixes:
- Policy selectors (both generic and type-deduced) now return cub::SegmentedScanPolicy.
- dispatch now constructs active_policy as SegmentedScanPolicy from the selector.
- Added runtime assertions in dispatch to validate positive threads_per_block and items_per_thread for selected policies.
Policy struct improvements:
- operator== marked noexcept; operator!= implemented as !(lhs == rhs) and noexcept.
- Host-side operator<< guards switched to _CCCL_HOSTED().
- Policies are aggregate-initializable and validated as semiregular/aggregate in tests.
Documentation:
- DeviceSegmentedScan header extended with an @par Tuning section documenting how to pass a policy selector that returns SegmentedScanPolicy, with examples.
Tests and examples:
- Benchmarks and unit tests updated to construct and use cub::SegmentedScanPolicy / SegmentedScanBlockPolicy.
- New/updated C2H tests:
  - Compile-time checks that SegmentedScanPolicy and SegmentedScanBlockPolicy are semiregular and aggregates.
  - Runtime test demonstrating env-based tuning API using cuda::execution::tune(...) with a selector that varies items_per_thread by compute capability.
- Benchmarks updated to return SegmentedScanPolicy from policy selectors.

Impact

Downstream users can now tune DeviceSegmentedScan via cuda::execution::tune or other env-based APIs using the stable public types SegmentedScanPolicy and SegmentedScanBlockPolicy. The change is backward-compatible for callers using environment tuning (they must now supply/receive the public policy types) and improves robustness via additional runtime assertions and clearer documentation.

Walkthrough

important: Promote internal segmented-scan tuning types to public SegmentedScanBlockPolicy/SegmentedScanPolicy, update selectors and dispatch to use them, add runtime assertions and docs, and update tests and benchmarks to validate and exercise the public tuning API.

Changes

Segmented-scan public policy promotion and tuning API

Layer / File(s)	Summary
Public policy type definitions `cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh`	`SegmentedScanBlockPolicy` and `SegmentedScanPolicy` are introduced as public aggregate types with noexcept comparison and host-only stream formatting.
Policy selector return type updates `cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh`	`policy_selector::operator()` and `policy_selector_from_types::operator()` in `detail::segmented_scan` now return the public `SegmentedScanPolicy` type constructed from `SegmentedScanBlockPolicy`.
Dispatch active_policy construction and validation `cub/cub/device/dispatch/dispatch_segmented_scan.cuh`	`active_policy` is constructed as `SegmentedScanPolicy` from `policy_selector(cc)`, and new `_CCCL_ASSERT` checks ensure `threads_per_block` and `items_per_thread` are positive.
Kernel and API documentation `cub/cub/device/dispatch/kernels/kernel_segmented_scan.cuh`, `cub/cub/device/device_segmented_scan.cuh`	Kernel comment updated to reference `SegmentedScanPolicy`. New tuning documentation section in `DeviceSegmentedScan` explains environment-based API tuning with custom policy selectors and examples.
Public policy compile-time validation `cub/test/catch2_test_device_segmented_scan_env.cu`	Test tuning helper returns public `SegmentedScanPolicy`. New compile-time tests verify semiregular/aggregate properties and designated initialization equality for both policy types.
Environment-based API tuning integration test `cub/test/catch2_test_device_segmented_scan_env_api.cu`	Includes `cuda/__execution/tune.h` and adds a `SegmentedScanPolicySelector` functor selecting policies by compute capability, integrated with `cuda::execution::tune()` for end-to-end validation.
Benchmark and existing test updates `cub/benchmarks/bench/segmented_scan/base.cuh`, `cub/test/catch2_test_device_segmented_scan_multi_segment.cu`	Policy selectors updated to construct and return public `SegmentedScanPolicy` from `SegmentedScanBlockPolicy` instead of internal types.

Assessment against linked issues

Objective	Addressed	Explanation
Productize tuning API for `cub::DeviceSegmentedScan` [`#8581`]	✅

Suggested reviewers

shwina
gevtushenko
NaderAlAwar

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh (1)
148-154: 📐 Maintainability & Code Quality | 💤 Low value

suggestion: Remove the trailing semicolon after the function body closing brace on line 154. While technically allowed, it's unconventional and inconsistent with policy_selector::operator() (line 137).
Proposed fix
     return policy_selector{accum_size, accum_align}(cc);
-  };
+  }
 };

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44478f62-2646-4c8e-9019-604f3c985d2b

📥 Commits

Reviewing files that changed from the base of the PR and between 809e315 and 516f1c7.

📒 Files selected for processing (8)

cub/benchmarks/bench/segmented_scan/base.cuh
cub/cub/device/device_segmented_scan.cuh
cub/cub/device/dispatch/dispatch_segmented_scan.cuh
cub/cub/device/dispatch/kernels/kernel_segmented_scan.cuh
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh
cub/test/catch2_test_device_segmented_scan_env.cu
cub/test/catch2_test_device_segmented_scan_env_api.cu
cub/test/catch2_test_device_segmented_scan_multi_segment.cu

bernhardmgruber

Looks pretty good, just two comments!

bernhardmgruber · 2026-06-12T21:14:57Z

-{
-struct block_segmented_scan_policy
+//! The policy for block-level kernels in @ref DeviceSegmentedScan.
+struct BlockSegmentedScanPolicy


Strong suggestion: Please rename this to SegmentedScanBlockPolicy.

bernhardmgruber · 2026-06-12T21:16:29Z

+  int threads_per_block; //!< Number of threads per block
+  int items_per_thread; //!< Number of items per thread
+  BlockLoadAlgorithm load_algorithm; //!< Algorithm for loading input
+  CacheLoadModifier load_modifier; //!< Cache load modifier
+  BlockStoreAlgorithm store_algorithm; //!< Algorithm for storing output
+  BlockScanAlgorithm scan_algorithm; //!< Algorithm for the block scan


Suggestion: Please try to repeat the documentation comments I used for those members in other places, for example:

struct MergeSortPolicy { int threads_per_block; //!< Number of threads in a CUDA block int items_per_thread; //!< Number of items processed per thread BlockLoadAlgorithm load_algorithm; //!< The @ref BlockLoadAlgorithm used for loading items from global memory CacheLoadModifier load_modifier; //!< The @ref CacheLoadModifier used for loading items from global memory BlockStoreAlgorithm store_algorithm; //!< The @ref BlockStoreAlgorithm used for storing items to global memory

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh (1)
57-62: 📐 Maintainability & Code Quality | 💤 Low value

suggestion: The debug output label .max_segments_per_block does not match the member name max_segments. Either rename the member to max_segments_per_block for consistency, or update the label to .max_segments.
-        << ", .max_segments_per_block = " << policy.max_segments
+        << ", .max_segments = " << policy.max_segments
cub/benchmarks/bench/segmented_scan/base.cuh (1)

33-33: 📐 Maintainability & Code Quality | 💤 Low value

suggestion: Add noexcept specification.

This constexpr function constructs aggregate types and cannot throw. As per coding guidelines, "All functions that don't throw exceptions must use noexcept specification." The upstream contract also lacks noexcept, so this may be a broader pattern, but adding it improves exception safety guarantees and enables compiler optimizations.

Append noexcept after const: [[nodiscard]] _CCCL_HOST_DEVICE_API constexpr auto operator()(::cuda::compute_capability) const noexcept -> ::cub::SegmentedScanPolicy

Source: Coding guidelines

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3f71acd-dca8-448a-a2ed-705481cf3e37

📥 Commits

Reviewing files that changed from the base of the PR and between 516f1c7 and c65aecc.

📒 Files selected for processing (4)

cub/benchmarks/bench/segmented_scan/base.cuh
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh
cub/test/catch2_test_device_segmented_scan_env.cu
cub/test/catch2_test_device_segmented_scan_multi_segment.cu

🚧 Files skipped from review as they are similar to previous changes (2)

cub/test/catch2_test_device_segmented_scan_multi_segment.cu
cub/test/catch2_test_device_segmented_scan_env.cu

coderabbitai · 2026-06-12T22:10:50Z

 {
-  [[nodiscard]] _CCCL_HOST_DEVICE constexpr auto operator()(cuda::compute_capability) const
-    -> cub::detail::segmented_scan::segmented_scan_policy
+  [[nodiscard]] _CCCL_HOST_DEVICE constexpr auto operator()(cuda::compute_capability) const -> cub::SegmentedScanPolicy


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

important: Annotation and type-qualification mismatch with upstream contract.

The function uses _CCCL_HOST_DEVICE but the upstream policy selector (tuning_segmented_scan.cuh:106) uses _CCCL_HOST_DEVICE_API. Additionally, the parameter type cuda::compute_capability and return types cub::SegmentedScanPolicy / cub::SegmentedScanBlockPolicy lack the :: prefix required for full qualification starting from the global namespace. As per coding guidelines, type names must be fully qualified except when already declared in the current or enclosing namespace.

Change _CCCL_HOST_DEVICE to _CCCL_HOST_DEVICE_API, use ::cuda::compute_capability for the parameter, and ::cub::SegmentedScanPolicy / ::cub::SegmentedScanBlockPolicy for the types.

Source: Coding guidelines

We could change to _CCCL_HOST_DEVICE_API here, but not required as part of this PR.

suggestion: Understood. Would you like me to open a follow-up issue to track switching _CCCL_HOST_DEVICE → _CCCL_HOST_DEVICE_API and adding fully-qualified type names (::cuda::compute_capability, ::cub::SegmentedScanPolicy, ::cub::SegmentedScanBlockPolicy) in this benchmark file?

🧠 Learnings used

Learnt from: sleeepyjack Repo: NVIDIA/cccl PR: 8857 File: cudax/include/cuda/experimental/__cuco/__hyperloglog/default_policy.cuh:92-95 Timestamp: 2026-05-06T23:47:10.296Z Learning: In this CCCL (CUDA C++ Core Libraries) codebase, do NOT treat `_CCCL_API` as host-only. `_CCCL_API` is defined in `libcudacxx/include/cuda/std/__cccl/visibility.h` and always expands to include `_CCCL_HOST_DEVICE` (possibly alongside `_CCCL_TILE`, `_CCCL_VISIBILITY_HIDDEN`, and `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION`). So any function annotated with `_CCCL_API` is already valid for both host and device code; reviewers should not flag it as host-only.

Learnt from: sleeepyjack Repo: NVIDIA/cccl PR: 8857 File: cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh:426-428 Timestamp: 2026-05-06T23:47:17.597Z Learning: In the CCCL/CUDA C++ Core Libraries codebase (NVIDIA/cccl), treat `_CCCL_API` as host-and-device callable. `_CCCL_API` expands to `_CCCL_HOST_DEVICE` via `libcudacxx/include/cuda/std/__cccl/visibility.h`, so functions marked with `_CCCL_API` may be invoked from both host and device code. When reviewing device-code contexts, do not flag `_CCCL_API`-annotated functions as if they were host-only.

Learnt from: NaderAlAwar Repo: NVIDIA/cccl PR: 9171 File: libcudacxx/include/cuda/std/__cccl/diagnostic.h:126-126 Timestamp: 2026-06-03T13:24:36.356Z Learning: When reviewing code that uses NVIDIA NVCC/NVRTC diagnostic suppression pragmas, treat `nv_diag_suppress` as supporting comma-separated warning numbers in a single directive. For example, `_Pragma("nv_diag_suppress 1444,20199")` (or an equivalent `#pragma nv_diag_suppress 1444,20199`) is valid; do not flag comma-separated warning IDs inside `nv_diag_suppress` pragmas as unsupported.

Learnt from: elstehle Repo: NVIDIA/cccl PR: 9288 File: cub/benchmarks/bench/segmented_topk/variable/common.cuh:4-4 Timestamp: 2026-06-07T16:13:21.020Z Learning: In NVIDIA/cccl, headers under cub/benchmarks/ that are .cuh benchmark files use `#pragma once` consistently. During code reviews, do not flag `#pragma once` in these benchmark `.cuh` files as a style violation; the preferred include guard style for this area is `#pragma once` rather than `#ifndef`/`#define`/`#endif`.

Learnt from: elstehle Repo: NVIDIA/cccl PR: 9288 File: cub/benchmarks/bench/segmented_topk/variable/common.cuh:4-5 Timestamp: 2026-06-07T16:18:35.149Z Learning: In NVIDIA/cccl, benchmark-internal `.cuh` headers under `cub/benchmarks/` should not use CCCL library-header infrastructure. Do not include `#include <cuda/std/detail/__config>`, do not add system-header pragmas (`#pragma GCC system_header`, `#pragma clang system_header`, `#pragma MSVC system_header`), and do not include the CCCL prologue/epilogue headers (`#include <cuda/std/__cccl/prologue.h>` / `#include <cuda/std/__cccl/epilogue.h>`). Instead, these benchmark headers should start with `#pragma once` and then their includes directly.

github-actions · 2026-06-12T23:50:08Z

😬 CI Workflow Results

🟥 Finished in 1h 42m: Pass: 99%/287 | Total: 2d 23h | Max: 1h 10m | Hits: 85%/229638

See results here.

productize tuning API for segscan

516f1c7

gonidelis requested review from a team as code owners June 12, 2026 21:11

gonidelis requested a review from oleksandr-pavlyk June 12, 2026 21:11

github-project-automation Bot added this to CCCL Jun 12, 2026

gonidelis requested a review from pauleonix June 12, 2026 21:11

github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026

gonidelis requested a review from bernhardmgruber June 12, 2026 21:11

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

bernhardmgruber reviewed Jun 12, 2026

View reviewed changes

reviews

c65aecc

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

bernhardmgruber approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Productize tuning API for `DeviceSegmentedScan`#9430

Productize tuning API for `DeviceSegmentedScan`#9430
gonidelis wants to merge 2 commits into
NVIDIA:mainfrom
gonidelis:productize_seg_scan

gonidelis commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

bernhardmgruber left a comment

Uh oh!

bernhardmgruber Jun 12, 2026

Uh oh!

bernhardmgruber Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026 •

edited

Loading

Uh oh!

bernhardmgruber Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gonidelis commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Impact

Walkthrough

Changes

Assessment against linked issues

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

😬 CI Workflow Results

🟥 Finished in 1h 42m: Pass: 99%/287 | Total: 2d 23h | Max: 1h 10m | Hits: 85%/229638

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot Jun 12, 2026 •

edited

Loading