Productize tuning API for DeviceSegmentedScan#9430
Conversation
SummaryThis PR productizes the tuning API for cub::DeviceSegmentedScan by promoting internal tuning policy types into stable, public API types and updating codepaths, documentation, and tests to use them. The change makes environment-based tuning for segmented scans accessible and production-ready. Key Changes
ImpactDownstream users can now tune DeviceSegmentedScan via cuda::execution::tune or other env-based APIs using the stable public types SegmentedScanPolicy and SegmentedScanBlockPolicy. The change is backward-compatible for callers using environment tuning (they must now supply/receive the public policy types) and improves robustness via additional runtime assertions and clearer documentation. Walkthroughimportant: Promote internal segmented-scan tuning types to public ChangesSegmented-scan public policy promotion and tuning API
Assessment against linked issues
Suggested reviewers
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh (1)
148-154: 📐 Maintainability & Code Quality | 💤 Low valuesuggestion: Remove the trailing semicolon after the function body closing brace on line 154. While technically allowed, it's unconventional and inconsistent with
policy_selector::operator()(line 137).Proposed fix
return policy_selector{accum_size, accum_align}(cc); - }; + } };
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 44478f62-2646-4c8e-9019-604f3c985d2b
📒 Files selected for processing (8)
cub/benchmarks/bench/segmented_scan/base.cuhcub/cub/device/device_segmented_scan.cuhcub/cub/device/dispatch/dispatch_segmented_scan.cuhcub/cub/device/dispatch/kernels/kernel_segmented_scan.cuhcub/cub/device/dispatch/tuning/tuning_segmented_scan.cuhcub/test/catch2_test_device_segmented_scan_env.cucub/test/catch2_test_device_segmented_scan_env_api.cucub/test/catch2_test_device_segmented_scan_multi_segment.cu
bernhardmgruber
left a comment
There was a problem hiding this comment.
Looks pretty good, just two comments!
| { | ||
| struct block_segmented_scan_policy | ||
| //! The policy for block-level kernels in @ref DeviceSegmentedScan. | ||
| struct BlockSegmentedScanPolicy |
There was a problem hiding this comment.
Strong suggestion: Please rename this to SegmentedScanBlockPolicy.
| int threads_per_block; //!< Number of threads per block | ||
| int items_per_thread; //!< Number of items per thread | ||
| BlockLoadAlgorithm load_algorithm; //!< Algorithm for loading input | ||
| CacheLoadModifier load_modifier; //!< Cache load modifier | ||
| BlockStoreAlgorithm store_algorithm; //!< Algorithm for storing output | ||
| BlockScanAlgorithm scan_algorithm; //!< Algorithm for the block scan |
There was a problem hiding this comment.
Suggestion: Please try to repeat the documentation comments I used for those members in other places, for example:
struct MergeSortPolicy
{
int threads_per_block; //!< Number of threads in a CUDA block
int items_per_thread; //!< Number of items processed per thread
BlockLoadAlgorithm load_algorithm; //!< The @ref BlockLoadAlgorithm used for loading items from global memory
CacheLoadModifier load_modifier; //!< The @ref CacheLoadModifier used for loading items from global memory
BlockStoreAlgorithm store_algorithm; //!< The @ref BlockStoreAlgorithm used for storing items to global memoryThere was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
cub/cub/device/dispatch/tuning/tuning_segmented_scan.cuh (1)
57-62: 📐 Maintainability & Code Quality | 💤 Low valuesuggestion: The debug output label
.max_segments_per_blockdoes not match the member namemax_segments. Either rename the member tomax_segments_per_blockfor consistency, or update the label to.max_segments.- << ", .max_segments_per_block = " << policy.max_segments + << ", .max_segments = " << policy.max_segmentscub/benchmarks/bench/segmented_scan/base.cuh (1)
33-33: 📐 Maintainability & Code Quality | 💤 Low valuesuggestion: Add
noexceptspecification.This constexpr function constructs aggregate types and cannot throw. As per coding guidelines, "All functions that don't throw exceptions must use
noexceptspecification." The upstream contract also lacksnoexcept, so this may be a broader pattern, but adding it improves exception safety guarantees and enables compiler optimizations.Append
noexceptafterconst:[[nodiscard]] _CCCL_HOST_DEVICE_API constexpr auto operator()(::cuda::compute_capability) const noexcept -> ::cub::SegmentedScanPolicySource: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a3f71acd-dca8-448a-a2ed-705481cf3e37
📒 Files selected for processing (4)
cub/benchmarks/bench/segmented_scan/base.cuhcub/cub/device/dispatch/tuning/tuning_segmented_scan.cuhcub/test/catch2_test_device_segmented_scan_env.cucub/test/catch2_test_device_segmented_scan_multi_segment.cu
🚧 Files skipped from review as they are similar to previous changes (2)
- cub/test/catch2_test_device_segmented_scan_multi_segment.cu
- cub/test/catch2_test_device_segmented_scan_env.cu
| { | ||
| [[nodiscard]] _CCCL_HOST_DEVICE constexpr auto operator()(cuda::compute_capability) const | ||
| -> cub::detail::segmented_scan::segmented_scan_policy | ||
| [[nodiscard]] _CCCL_HOST_DEVICE constexpr auto operator()(cuda::compute_capability) const -> cub::SegmentedScanPolicy |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
important: Annotation and type-qualification mismatch with upstream contract.
The function uses _CCCL_HOST_DEVICE but the upstream policy selector (tuning_segmented_scan.cuh:106) uses _CCCL_HOST_DEVICE_API. Additionally, the parameter type cuda::compute_capability and return types cub::SegmentedScanPolicy / cub::SegmentedScanBlockPolicy lack the :: prefix required for full qualification starting from the global namespace. As per coding guidelines, type names must be fully qualified except when already declared in the current or enclosing namespace.
Change _CCCL_HOST_DEVICE to _CCCL_HOST_DEVICE_API, use ::cuda::compute_capability for the parameter, and ::cub::SegmentedScanPolicy / ::cub::SegmentedScanBlockPolicy for the types.
Source: Coding guidelines
There was a problem hiding this comment.
We could change to _CCCL_HOST_DEVICE_API here, but not required as part of this PR.
There was a problem hiding this comment.
suggestion: Understood. Would you like me to open a follow-up issue to track switching _CCCL_HOST_DEVICE → _CCCL_HOST_DEVICE_API and adding fully-qualified type names (::cuda::compute_capability, ::cub::SegmentedScanPolicy, ::cub::SegmentedScanBlockPolicy) in this benchmark file?
🧠 Learnings used
Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/default_policy.cuh:92-95
Timestamp: 2026-05-06T23:47:10.296Z
Learning: In this CCCL (CUDA C++ Core Libraries) codebase, do NOT treat `_CCCL_API` as host-only. `_CCCL_API` is defined in `libcudacxx/include/cuda/std/__cccl/visibility.h` and always expands to include `_CCCL_HOST_DEVICE` (possibly alongside `_CCCL_TILE`, `_CCCL_VISIBILITY_HIDDEN`, and `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION`). So any function annotated with `_CCCL_API` is already valid for both host and device code; reviewers should not flag it as host-only.
Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh:426-428
Timestamp: 2026-05-06T23:47:17.597Z
Learning: In the CCCL/CUDA C++ Core Libraries codebase (NVIDIA/cccl), treat `_CCCL_API` as host-and-device callable. `_CCCL_API` expands to `_CCCL_HOST_DEVICE` via `libcudacxx/include/cuda/std/__cccl/visibility.h`, so functions marked with `_CCCL_API` may be invoked from both host and device code. When reviewing device-code contexts, do not flag `_CCCL_API`-annotated functions as if they were host-only.
Learnt from: NaderAlAwar
Repo: NVIDIA/cccl PR: 9171
File: libcudacxx/include/cuda/std/__cccl/diagnostic.h:126-126
Timestamp: 2026-06-03T13:24:36.356Z
Learning: When reviewing code that uses NVIDIA NVCC/NVRTC diagnostic suppression pragmas, treat `nv_diag_suppress` as supporting comma-separated warning numbers in a single directive. For example, `_Pragma("nv_diag_suppress 1444,20199")` (or an equivalent `#pragma nv_diag_suppress 1444,20199`) is valid; do not flag comma-separated warning IDs inside `nv_diag_suppress` pragmas as unsupported.
Learnt from: elstehle
Repo: NVIDIA/cccl PR: 9288
File: cub/benchmarks/bench/segmented_topk/variable/common.cuh:4-4
Timestamp: 2026-06-07T16:13:21.020Z
Learning: In NVIDIA/cccl, headers under cub/benchmarks/ that are .cuh benchmark files use `#pragma once` consistently. During code reviews, do not flag `#pragma once` in these benchmark `.cuh` files as a style violation; the preferred include guard style for this area is `#pragma once` rather than `#ifndef`/`#define`/`#endif`.
Learnt from: elstehle
Repo: NVIDIA/cccl PR: 9288
File: cub/benchmarks/bench/segmented_topk/variable/common.cuh:4-5
Timestamp: 2026-06-07T16:18:35.149Z
Learning: In NVIDIA/cccl, benchmark-internal `.cuh` headers under `cub/benchmarks/` should not use CCCL library-header infrastructure. Do not include `#include <cuda/std/detail/__config>`, do not add system-header pragmas (`#pragma GCC system_header`, `#pragma clang system_header`, `#pragma MSVC system_header`), and do not include the CCCL prologue/epilogue headers (`#include <cuda/std/__cccl/prologue.h>` / `#include <cuda/std/__cccl/epilogue.h>`). Instead, these benchmark headers should start with `#pragma once` and then their includes directly.
😬 CI Workflow Results🟥 Finished in 1h 42m: Pass: 99%/287 | Total: 2d 23h | Max: 1h 10m | Hits: 85%/229638See results here. |
fixes #8581