Skip to content

Productize tuning API for DeviceSegmentedScan#9430

Open
gonidelis wants to merge 2 commits into
NVIDIA:mainfrom
gonidelis:productize_seg_scan
Open

Productize tuning API for DeviceSegmentedScan#9430
gonidelis wants to merge 2 commits into
NVIDIA:mainfrom
gonidelis:productize_seg_scan

Conversation

@gonidelis

Copy link
Copy Markdown
Member

fixes #8581

@gonidelis gonidelis requested review from a team as code owners June 12, 2026 21:11
@gonidelis gonidelis requested a review from pauleonix June 12, 2026 21:11
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR productizes the tuning API for cub::DeviceSegmentedScan by promoting internal tuning policy types into stable, public API types and updating codepaths, documentation, and tests to use them. The change makes environment-based tuning for segmented scans accessible and production-ready.

Key Changes

  • Public API types:

    • Introduced public SegmentedScanBlockPolicy (block-level tuning) and SegmentedScanPolicy (overall tuning).
    • Replaced uses of internal detail::segmented_scan::segmented_scan_policy / block types with the public types across benchmarks, dispatch, kernels, and tests.
  • API/ABI and behavior fixes:

    • Policy selectors (both generic and type-deduced) now return cub::SegmentedScanPolicy.
    • dispatch now constructs active_policy as SegmentedScanPolicy from the selector.
    • Added runtime assertions in dispatch to validate positive threads_per_block and items_per_thread for selected policies.
  • Policy struct improvements:

    • operator== marked noexcept; operator!= implemented as !(lhs == rhs) and noexcept.
    • Host-side operator<< guards switched to _CCCL_HOSTED().
    • Policies are aggregate-initializable and validated as semiregular/aggregate in tests.
  • Documentation:

    • DeviceSegmentedScan header extended with an @par Tuning section documenting how to pass a policy selector that returns SegmentedScanPolicy, with examples.
  • Tests and examples:

    • Benchmarks and unit tests updated to construct and use cub::SegmentedScanPolicy / SegmentedScanBlockPolicy.
    • New/updated C2H tests:
      • Compile-time checks that SegmentedScanPolicy and SegmentedScanBlockPolicy are semiregular and aggregates.
      • Runtime test demonstrating env-based tuning API using cuda::execution::tune(...) with a selector that varies items_per_thread by compute capability.
    • Benchmarks updated to return SegmentedScanPolicy from policy selectors.

Impact

Downstream users can now tune DeviceSegmentedScan via cuda::execution::tune or other env-based APIs using the stable public types SegmentedScanPolicy and SegmentedScanBlockPolicy. The change is backward-compatible for callers using environment tuning (they must now supply/receive the public policy types) and improves robustness via additional runtime assertions and clearer documentation.

Walkthrough

important: Promote internal segmented-scan tuning types to public SegmentedScanBlockPolicy/SegmentedScanPolicy, update selectors and dispatch to use them, add runtime assertions and docs, and update tests and benchmarks to validate and exercise the public tuning API.

Changes

Segmented-scan public policy promotion and tuning API

Layer / File(s) Summary
Public policy type definitions
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh
SegmentedScanBlockPolicy and SegmentedScanPolicy are introduced as public aggregate types with noexcept comparison and host-only stream formatting.
Policy selector return type updates
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh
policy_selector::operator() and policy_selector_from_types::operator() in detail::segmented_scan now return the public SegmentedScanPolicy type constructed from SegmentedScanBlockPolicy.
Dispatch active_policy construction and validation
cub/cub/device/dispatch/dispatch_segmented_scan.cuh
active_policy is constructed as SegmentedScanPolicy from policy_selector(cc), and new _CCCL_ASSERT checks ensure threads_per_block and items_per_thread are positive.
Kernel and API documentation
cub/cub/device/dispatch/kernels/kernel_segmented_scan.cuh, cub/cub/device/device_segmented_scan.cuh
Kernel comment updated to reference SegmentedScanPolicy. New tuning documentation section in DeviceSegmentedScan explains environment-based API tuning with custom policy selectors and examples.
Public policy compile-time validation
cub/test/catch2_test_device_segmented_scan_env.cu
Test tuning helper returns public SegmentedScanPolicy. New compile-time tests verify semiregular/aggregate properties and designated initialization equality for both policy types.
Environment-based API tuning integration test
cub/test/catch2_test_device_segmented_scan_env_api.cu
Includes cuda/__execution/tune.h and adds a SegmentedScanPolicySelector functor selecting policies by compute capability, integrated with cuda::execution::tune() for end-to-end validation.
Benchmark and existing test updates
cub/benchmarks/bench/segmented_scan/base.cuh, cub/test/catch2_test_device_segmented_scan_multi_segment.cu
Policy selectors updated to construct and return public SegmentedScanPolicy from SegmentedScanBlockPolicy instead of internal types.

Assessment against linked issues

Objective Addressed Explanation
Productize tuning API for cub::DeviceSegmentedScan [#8581]

Suggested reviewers

  • shwina
  • gevtushenko
  • NaderAlAwar

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh (1)

148-154: 📐 Maintainability & Code Quality | 💤 Low value

suggestion: Remove the trailing semicolon after the function body closing brace on line 154. While technically allowed, it's unconventional and inconsistent with policy_selector::operator() (line 137).

Proposed fix
     return policy_selector{accum_size, accum_align}(cc);
-  };
+  }
 };

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 44478f62-2646-4c8e-9019-604f3c985d2b

📥 Commits

Reviewing files that changed from the base of the PR and between 809e315 and 516f1c7.

📒 Files selected for processing (8)
  • cub/benchmarks/bench/segmented_scan/base.cuh
  • cub/cub/device/device_segmented_scan.cuh
  • cub/cub/device/dispatch/dispatch_segmented_scan.cuh
  • cub/cub/device/dispatch/kernels/kernel_segmented_scan.cuh
  • cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh
  • cub/test/catch2_test_device_segmented_scan_env.cu
  • cub/test/catch2_test_device_segmented_scan_env_api.cu
  • cub/test/catch2_test_device_segmented_scan_multi_segment.cu

@bernhardmgruber bernhardmgruber left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just two comments!

{
struct block_segmented_scan_policy
//! The policy for block-level kernels in @ref DeviceSegmentedScan.
struct BlockSegmentedScanPolicy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong suggestion: Please rename this to SegmentedScanBlockPolicy.

Comment on lines +31 to +36
int threads_per_block; //!< Number of threads per block
int items_per_thread; //!< Number of items per thread
BlockLoadAlgorithm load_algorithm; //!< Algorithm for loading input
CacheLoadModifier load_modifier; //!< Cache load modifier
BlockStoreAlgorithm store_algorithm; //!< Algorithm for storing output
BlockScanAlgorithm scan_algorithm; //!< Algorithm for the block scan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Please try to repeat the documentation comments I used for those members in other places, for example:

struct MergeSortPolicy
{
  int threads_per_block; //!< Number of threads in a CUDA block
  int items_per_thread; //!< Number of items processed per thread
  BlockLoadAlgorithm load_algorithm; //!< The @ref BlockLoadAlgorithm used for loading items from global memory
  CacheLoadModifier load_modifier; //!< The @ref CacheLoadModifier used for loading items from global memory
  BlockStoreAlgorithm store_algorithm; //!< The @ref BlockStoreAlgorithm used for storing items to global memory

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh (1)

57-62: 📐 Maintainability & Code Quality | 💤 Low value

suggestion: The debug output label .max_segments_per_block does not match the member name max_segments. Either rename the member to max_segments_per_block for consistency, or update the label to .max_segments.

-        << ", .max_segments_per_block = " << policy.max_segments
+        << ", .max_segments = " << policy.max_segments
cub/benchmarks/bench/segmented_scan/base.cuh (1)

33-33: 📐 Maintainability & Code Quality | 💤 Low value

suggestion: Add noexcept specification.

This constexpr function constructs aggregate types and cannot throw. As per coding guidelines, "All functions that don't throw exceptions must use noexcept specification." The upstream contract also lacks noexcept, so this may be a broader pattern, but adding it improves exception safety guarantees and enables compiler optimizations.

Append noexcept after const: [[nodiscard]] _CCCL_HOST_DEVICE_API constexpr auto operator()(::cuda::compute_capability) const noexcept -> ::cub::SegmentedScanPolicy

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3f71acd-dca8-448a-a2ed-705481cf3e37

📥 Commits

Reviewing files that changed from the base of the PR and between 516f1c7 and c65aecc.

📒 Files selected for processing (4)
  • cub/benchmarks/bench/segmented_scan/base.cuh
  • cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh
  • cub/test/catch2_test_device_segmented_scan_env.cu
  • cub/test/catch2_test_device_segmented_scan_multi_segment.cu
🚧 Files skipped from review as they are similar to previous changes (2)
  • cub/test/catch2_test_device_segmented_scan_multi_segment.cu
  • cub/test/catch2_test_device_segmented_scan_env.cu

{
[[nodiscard]] _CCCL_HOST_DEVICE constexpr auto operator()(cuda::compute_capability) const
-> cub::detail::segmented_scan::segmented_scan_policy
[[nodiscard]] _CCCL_HOST_DEVICE constexpr auto operator()(cuda::compute_capability) const -> cub::SegmentedScanPolicy

@coderabbitai coderabbitai Bot Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

important: Annotation and type-qualification mismatch with upstream contract.

The function uses _CCCL_HOST_DEVICE but the upstream policy selector (tuning_segmented_scan.cuh:106) uses _CCCL_HOST_DEVICE_API. Additionally, the parameter type cuda::compute_capability and return types cub::SegmentedScanPolicy / cub::SegmentedScanBlockPolicy lack the :: prefix required for full qualification starting from the global namespace. As per coding guidelines, type names must be fully qualified except when already declared in the current or enclosing namespace.

Change _CCCL_HOST_DEVICE to _CCCL_HOST_DEVICE_API, use ::cuda::compute_capability for the parameter, and ::cub::SegmentedScanPolicy / ::cub::SegmentedScanBlockPolicy for the types.

Source: Coding guidelines

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change to _CCCL_HOST_DEVICE_API here, but not required as part of this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Understood. Would you like me to open a follow-up issue to track switching _CCCL_HOST_DEVICE_CCCL_HOST_DEVICE_API and adding fully-qualified type names (::cuda::compute_capability, ::cub::SegmentedScanPolicy, ::cub::SegmentedScanBlockPolicy) in this benchmark file?


🧠 Learnings used
Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/default_policy.cuh:92-95
Timestamp: 2026-05-06T23:47:10.296Z
Learning: In this CCCL (CUDA C++ Core Libraries) codebase, do NOT treat `_CCCL_API` as host-only. `_CCCL_API` is defined in `libcudacxx/include/cuda/std/__cccl/visibility.h` and always expands to include `_CCCL_HOST_DEVICE` (possibly alongside `_CCCL_TILE`, `_CCCL_VISIBILITY_HIDDEN`, and `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION`). So any function annotated with `_CCCL_API` is already valid for both host and device code; reviewers should not flag it as host-only.

Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh:426-428
Timestamp: 2026-05-06T23:47:17.597Z
Learning: In the CCCL/CUDA C++ Core Libraries codebase (NVIDIA/cccl), treat `_CCCL_API` as host-and-device callable. `_CCCL_API` expands to `_CCCL_HOST_DEVICE` via `libcudacxx/include/cuda/std/__cccl/visibility.h`, so functions marked with `_CCCL_API` may be invoked from both host and device code. When reviewing device-code contexts, do not flag `_CCCL_API`-annotated functions as if they were host-only.

Learnt from: NaderAlAwar
Repo: NVIDIA/cccl PR: 9171
File: libcudacxx/include/cuda/std/__cccl/diagnostic.h:126-126
Timestamp: 2026-06-03T13:24:36.356Z
Learning: When reviewing code that uses NVIDIA NVCC/NVRTC diagnostic suppression pragmas, treat `nv_diag_suppress` as supporting comma-separated warning numbers in a single directive. For example, `_Pragma("nv_diag_suppress 1444,20199")` (or an equivalent `#pragma nv_diag_suppress 1444,20199`) is valid; do not flag comma-separated warning IDs inside `nv_diag_suppress` pragmas as unsupported.

Learnt from: elstehle
Repo: NVIDIA/cccl PR: 9288
File: cub/benchmarks/bench/segmented_topk/variable/common.cuh:4-4
Timestamp: 2026-06-07T16:13:21.020Z
Learning: In NVIDIA/cccl, headers under cub/benchmarks/ that are .cuh benchmark files use `#pragma once` consistently. During code reviews, do not flag `#pragma once` in these benchmark `.cuh` files as a style violation; the preferred include guard style for this area is `#pragma once` rather than `#ifndef`/`#define`/`#endif`.

Learnt from: elstehle
Repo: NVIDIA/cccl PR: 9288
File: cub/benchmarks/bench/segmented_topk/variable/common.cuh:4-5
Timestamp: 2026-06-07T16:18:35.149Z
Learning: In NVIDIA/cccl, benchmark-internal `.cuh` headers under `cub/benchmarks/` should not use CCCL library-header infrastructure. Do not include `#include <cuda/std/detail/__config>`, do not add system-header pragmas (`#pragma GCC system_header`, `#pragma clang system_header`, `#pragma MSVC system_header`), and do not include the CCCL prologue/epilogue headers (`#include <cuda/std/__cccl/prologue.h>` / `#include <cuda/std/__cccl/epilogue.h>`). Instead, these benchmark headers should start with `#pragma once` and then their includes directly.

@github-actions

Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 42m: Pass: 99%/287 | Total: 2d 23h | Max: 1h 10m | Hits: 85%/229638

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Productize the cub::DeviceSegmentedScan tuning API

2 participants