Skip to content

Feat/hll dict size optimization clean#18144

Open
deeppatel710 wants to merge 5 commits intoapache:masterfrom
deeppatel710:feat/hll-dict-size-optimization-clean
Open

Feat/hll dict size optimization clean#18144
deeppatel710 wants to merge 5 commits intoapache:masterfrom
deeppatel710:feat/hll-dict-size-optimization-clean

Conversation

@deeppatel710
Copy link
Copy Markdown
Contributor

Summary

Fixes the performance bottleneck in DISTINCTCOUNTHLL for dictionary-encoded columns with high cardinality (reported in #17336).

DISTINCTCOUNTHLL currently always accumulates dictionary IDs into a RoaringBitmap before converting to HLL at finalization. For high-cardinality columns
(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.

This adds an optional third argument dictSizeThreshold (default: 100,000). When dictionary.length() > dictSizeThreshold, dictionary values are offered
directly to the HyperLogLog, bypassing the bitmap entirely. Since DISTINCTCOUNTHLL already returns an approximate result, exact bitmap deduplication is
unnecessary — HLL handles duplicate offers gracefully.

The same root cause was fixed for DISTINCT_COUNT_SMART_HLL in #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PR
applies the equivalent optimization to the more commonly used DISTINCTCOUNTHLL function.

Usage

DISTINCTCOUNTHLL(col)             -- default threshold (100K)
DISTINCTCOUNTHLL(col, 12)         -- custom log2m, default threshold
DISTINCTCOUNTHLL(col, 12, 50000)  -- custom log2m and threshold     
DISTINCTCOUNTHLL(col, 12, 0)      -- disable optimization (threshold = Integer.MAX_VALUE)                                                                      

Changes                                                                                                                                                        
                                                          
- DistinctCountHLLAggregationFunction: added DEFAULT_DICT_SIZE_THRESHOLD = 100_000, _dictSizeThreshold field, optional 3rd constructor arg, and direct-HLL path
 in all 6 aggregation paths (non-group-by SV/MV, group-by SV/MV with SV/MV group keys)
- Tests: added 4 new test cases covering parameter defaults, custom threshold, disabled optimization, and correctness of direct-HLL path vs bitmap path        
                                                                                                                                                               
Test plan
                                                                                                                                                               
- All existing HLL tests pass                                                                                                                                  
- New tests verify parameter parsing and approximate correctness of direct-HLL path
- Manual benchmark against high-cardinality column (expected ~4-10x speedup matching #17411 results)  

Deep Patel and others added 2 commits April 9, 2026 00:26
…high-cardinality columns

For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values),
DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap
before converting to HLL at finalization. This mirrors the performance issue originally
reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411).

This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000).
When the dictionary size exceeds the threshold, dictionary values are offered directly
to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL
already produces an approximate result, bitmap deduplication is not needed for correctness
in high-cardinality scenarios — HLL handles duplicate offers gracefully.

The optimization applies to all aggregation paths:
- Non-group-by SV and MV
- Group-by SV (both SV and MV group keys)
- Group-by MV (both SV and MV group keys)

Usage:
  DISTINCTCOUNTHLL(col)               -- default threshold (100K)
  DISTINCTCOUNTHLL(col, 12)           -- custom log2m, default threshold
  DISTINCTCOUNTHLL(col, 12, 50000)    -- custom log2m and threshold
  DISTINCTCOUNTHLL(col, 12, 0)        -- disable optimization (threshold = MAX_VALUE)

Expected speedup for high-cardinality columns: 4x-10x, consistent with the
benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@deeppatel710
Copy link
Copy Markdown
Contributor Author

@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal correctness issue; see inline comment.

if (dictionary != null) {
int[] dictIds = blockValSet.getDictionaryIdsSV();
getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, length);
if (dictionary.length() > _dictSizeThreshold) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dictionary-size branch is not safe because the same aggregation result holder is reused across all scanned segments. If one segment stores a DictIdsWrapper and a later segment crosses the threshold, the next getHyperLogLog() call will cast the existing holder state to the wrong type and fail with ClassCastException. DISTINCTCOUNTHLL needs a stable holder representation here, or an explicit conversion when switching between bitmap-backed and direct-HLL aggregation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @xiangfu0! Great catch on the type-safety concern.

You're right that mixing the bitmap and HLL paths in the same result holder is unsafe. To be precise about when this can bite: within a single consuming (realtime) segment, the dictionary grows as new data is ingested. If one doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a subsequent batch on the same scan sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper to HyperLogLog and throw a ClassCastException.
The fix centralizes the defense in both getHyperLogLog() helpers — they now check instanceof DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 aggregation paths since they all funnel through these two methods. Added a regression test that simulates the dictionary-growth scenario directly.

@Jackie-Jiang Jackie-Jiang added the enhancement Improvement to existing functionality label Apr 11, 2026
Deep Patel and others added 2 commits April 12, 2026 00:15
…n DISTINCTCOUNTHLL

In consuming (realtime) segments, the dictionary grows during ingestion. If a
prior block used the bitmap path (dict size below threshold) and a later block
sees the grown dictionary above the threshold, getHyperLogLog() would cast the
existing DictIdsWrapper result to HyperLogLog and throw ClassCastException.

Fix: check instanceof DictIdsWrapper in both getHyperLogLog() helpers and
convert to HyperLogLog before returning, so the holder type always stays
consistent regardless of when the threshold is crossed.

Also adds a regression test that simulates this exact scenario.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 45.71429% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.16%. Comparing base (2e80bff) to head (0327211).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
.../function/DistinctCountHLLAggregationFunction.java 45.71% 31 Missing and 7 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18144      +/-   ##
============================================
+ Coverage     63.04%   63.16%   +0.11%     
+ Complexity     1617     1616       -1     
============================================
  Files          3202     3214      +12     
  Lines        194718   195885    +1167     
  Branches      30047    30272     +225     
============================================
+ Hits         122760   123729     +969     
- Misses        62233    62273      +40     
- Partials       9725     9883     +158     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.12% <45.71%> (+0.11%) ⬆️
java-21 63.11% <45.71%> (+0.09%) ⬆️
temurin 63.16% <45.71%> (+0.11%) ⬆️
unittests 63.16% <45.71%> (+0.11%) ⬆️
unittests1 55.39% <45.71%> (-0.18%) ⬇️
unittests2 34.75% <1.42%> (+1.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bottleneck is from poor performance of RoaringBitmap insertion, should we switch to a BitSet solution? Even with 1M cardinality, BitSet memory overhead is negligible (128KB). I believe the performance should be much better than directly aggregating HLL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants