Feat/hll dict size optimization clean#18144
Feat/hll dict size optimization clean#18144deeppatel710 wants to merge 5 commits intoapache:masterfrom
Conversation
…high-cardinality columns For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values), DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap before converting to HLL at finalization. This mirrors the performance issue originally reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411). This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000). When the dictionary size exceeds the threshold, dictionary values are offered directly to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL already produces an approximate result, bitmap deduplication is not needed for correctness in high-cardinality scenarios — HLL handles duplicate offers gracefully. The optimization applies to all aggregation paths: - Non-group-by SV and MV - Group-by SV (both SV and MV group keys) - Group-by MV (both SV and MV group keys) Usage: DISTINCTCOUNTHLL(col) -- default threshold (100K) DISTINCTCOUNTHLL(col, 12) -- custom log2m, default threshold DISTINCTCOUNTHLL(col, 12, 50000) -- custom log2m and threshold DISTINCTCOUNTHLL(col, 12, 0) -- disable optimization (threshold = MAX_VALUE) Expected speedup for high-cardinality columns: 4x-10x, consistent with the benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks |
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal correctness issue; see inline comment.
| if (dictionary != null) { | ||
| int[] dictIds = blockValSet.getDictionaryIdsSV(); | ||
| getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, length); | ||
| if (dictionary.length() > _dictSizeThreshold) { |
There was a problem hiding this comment.
This dictionary-size branch is not safe because the same aggregation result holder is reused across all scanned segments. If one segment stores a DictIdsWrapper and a later segment crosses the threshold, the next getHyperLogLog() call will cast the existing holder state to the wrong type and fail with ClassCastException. DISTINCTCOUNTHLL needs a stable holder representation here, or an explicit conversion when switching between bitmap-backed and direct-HLL aggregation.
There was a problem hiding this comment.
Thanks for the review, @xiangfu0! Great catch on the type-safety concern.
You're right that mixing the bitmap and HLL paths in the same result holder is unsafe. To be precise about when this can bite: within a single consuming (realtime) segment, the dictionary grows as new data is ingested. If one doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a subsequent batch on the same scan sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper to HyperLogLog and throw a ClassCastException.
The fix centralizes the defense in both getHyperLogLog() helpers — they now check instanceof DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 aggregation paths since they all funnel through these two methods. Added a regression test that simulates the dictionary-growth scenario directly.
…n DISTINCTCOUNTHLL In consuming (realtime) segments, the dictionary grows during ingestion. If a prior block used the bitmap path (dict size below threshold) and a later block sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper result to HyperLogLog and throw ClassCastException. Fix: check instanceof DictIdsWrapper in both getHyperLogLog() helpers and convert to HyperLogLog before returning, so the holder type always stays consistent regardless of when the threshold is crossed. Also adds a regression test that simulates this exact scenario. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18144 +/- ##
============================================
+ Coverage 63.04% 63.16% +0.11%
+ Complexity 1617 1616 -1
============================================
Files 3202 3214 +12
Lines 194718 195885 +1167
Branches 30047 30272 +225
============================================
+ Hits 122760 123729 +969
- Misses 62233 62273 +40
- Partials 9725 9883 +158
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Jackie-Jiang
left a comment
There was a problem hiding this comment.
If the bottleneck is from poor performance of RoaringBitmap insertion, should we switch to a BitSet solution? Even with 1M cardinality, BitSet memory overhead is negligible (128KB). I believe the performance should be much better than directly aggregating HLL
Summary
Fixes the performance bottleneck in
DISTINCTCOUNTHLLfor dictionary-encoded columns with high cardinality (reported in #17336).DISTINCTCOUNTHLLcurrently always accumulates dictionary IDs into aRoaringBitmapbefore converting to HLL at finalization. For high-cardinality columns(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.
This adds an optional third argument
dictSizeThreshold(default:100,000). Whendictionary.length() > dictSizeThreshold, dictionary values are offereddirectly to the HyperLogLog, bypassing the bitmap entirely. Since
DISTINCTCOUNTHLLalready returns an approximate result, exact bitmap deduplication isunnecessary — HLL handles duplicate offers gracefully.
The same root cause was fixed for
DISTINCT_COUNT_SMART_HLLin #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PRapplies the equivalent optimization to the more commonly used
DISTINCTCOUNTHLLfunction.Usage