Skip to content

Add parameterized forward index compression codecs#18097

Open
xiangfu0 wants to merge 3 commits intoapache:masterfrom
xiangfu0:codex/parameterized-compression-codec
Open

Add parameterized forward index compression codecs#18097
xiangfu0 wants to merge 3 commits intoapache:masterfrom
xiangfu0:codex/parameterized-compression-codec

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 3, 2026

Summary

  • add a structured CompressionCodecSpec parser/model for compressionCodec, including CODEC and CODEC(level) syntax plus ZSTD -> ZSTANDARD canonicalization
  • thread optional compression levels through raw/forward index writer creation for codecs that support them, while preserving legacy behavior for plain codec names
  • persist the explicit canonical forward-index compression spec in segment metadata so level-only config changes can trigger rewrite detection during reload
  • add parser, config serde, compression factory, and forward-index handler tests covering aliases, validation errors, and level-aware rewrites

User manual

This change keeps the external config field name as compressionCodec and extends its accepted values.

Accepted syntax:

  • CODEC
  • CODEC(level)

Parsing rules:

  • input is case-insensitive
  • ZSTD is accepted anywhere ZSTANDARD is accepted
  • canonical internal form is uppercase, with ZSTD normalized to ZSTANDARD
  • malformed values are rejected during Pinot config validation instead of being passed downstream

Examples of accepted values:

  • ZSTANDARD
  • ZSTD
  • ZSTANDARD(3)
  • ZSTD(3)
  • GZIP(6)
  • LZ4(12)

Supported level ranges:

  • ZSTANDARD / ZSTD: 1..22
  • GZIP: 0..9
  • LZ4: 1..17

Codecs that do not expose a meaningful level in Pinot continue to reject parameterized syntax. For example:

  • SNAPPY(3)
  • PASS_THROUGH(1)
  • MV_ENTRY_DICT(2)

Malformed values are also rejected. For example:

  • ZSTD()
  • ZSTD(foo)
  • UNKNOWN(3)

Sample configs

Legacy raw-index field config remains valid, now with parameterized syntax:

{
  "fieldConfigList": [
    {
      "name": "message",
      "encodingType": "RAW",
      "compressionCodec": "ZSTD(3)"
    },
    {
      "name": "payload",
      "encodingType": "RAW",
      "compressionCodec": "GZIP(6)"
    }
  ]
}

The newer forward-index config under indexes.forward also accepts the same compressionCodec syntax:

{
  "fieldConfigList": [
    {
      "name": "message",
      "indexes": {
        "forward": {
          "compressionCodec": "ZSTANDARD(3)",
          "deriveNumDocsPerChunk": true,
          "rawIndexWriterVersion": 10
        }
      }
    },
    {
      "name": "blob",
      "indexes": {
        "forward": {
          "compressionCodec": "LZ4(12)"
        }
      }
    }
  ]
}

Alias handling is equivalent, so these are treated the same:

{ "compressionCodec": "ZSTD" }
{ "compressionCodec": "ZSTANDARD" }
{ "compressionCodec": "ZSTD(3)" }
{ "compressionCodec": "ZSTANDARD(3)" }

Backward compatibility

This change is intentionally conservative.

Config compatibility:

  • existing configs using plain codec names continue to work unchanged
  • the config field name stays compressionCodec
  • existing enum-like values such as SNAPPY, PASS_THROUGH, ZSTANDARD, and LZ4 continue to deserialize and behave as before
  • ZSTD is newly accepted as an alias and normalized internally to ZSTANDARD

Write-path compatibility:

  • plain codec strings preserve existing behavior
  • plain LZ4 still uses Pinot's existing fast-compressor path
  • only explicit LZ4(level) switches to HC compression with the requested level
  • if no level is provided, Pinot keeps the prior library/default write behavior instead of introducing a new Pinot-defined default level

Read-path and segment compatibility:

  • no forward-index header/read-path format changes are introduced
  • readers continue to rely on codec identity only; compression level is still a write-time concern
  • existing segments remain readable unchanged
  • mixed segments written with different levels of the same codec are query-safe because decode does not require the level

Reload/rewrite behavior:

  • the canonical explicit spec string is persisted in segment metadata for forward indexes
  • this is used to detect level-only config changes during segment reload/rebuild
  • metadata persistence is for rewrite detection and observability, not for decode correctness

Validation

  • ./mvnw -pl pinot-segment-spi,pinot-segment-local -am -Dcheckstyle.skip=true -Dtest=ForwardIndexConfigTest,ForwardIndexTypeTest,IndexLoadingConfigTest,TestCompression,ForwardIndexHandlerTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-spi,pinot-common -am -Dcheckstyle.skip=true -Dtest=CompressionCodecSpecTest,TableConfigSerDeUtilsTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local spotless:apply -DskipTests
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local checkstyle:check -DskipTests
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local license:format -DskipTests
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local license:check -DskipTests

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 83.12500% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.17%. Comparing base (85e2083) to head (3433bb5).

Files with missing lines Patch % Lines
...he/pinot/segment/spi/index/ForwardIndexConfig.java 56.81% 13 Missing and 6 partials ⚠️
...t/spi/config/table/CompressionCodecSpecParser.java 83.33% 3 Missing and 3 partials ⚠️
...t/local/io/compression/ChunkCompressorFactory.java 76.47% 2 Missing and 2 partials ⚠️
...e/pinot/spi/config/table/CompressionCodecSpec.java 80.95% 2 Missing and 2 partials ⚠️
...ocal/segment/index/loader/ForwardIndexHandler.java 92.68% 1 Missing and 2 partials ⚠️
.../pinot/segment/spi/utils/SegmentMetadataUtils.java 0.00% 3 Missing ⚠️
...ot/segment/local/io/compression/LZ4Compressor.java 66.66% 2 Missing ⚠️
...al/io/writer/impl/BaseChunkForwardIndexWriter.java 50.00% 2 Missing ⚠️
...r/impl/fwd/MultiValueFixedByteRawIndexCreator.java 83.33% 2 Missing ⚠️
.../impl/fwd/SingleValueFixedByteRawIndexCreator.java 71.42% 2 Missing ⚠️
... and 5 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18097      +/-   ##
============================================
+ Coverage     63.14%   63.17%   +0.03%     
- Complexity     1616     1635      +19     
============================================
  Files          3213     3216       +3     
  Lines        195730   195973     +243     
  Branches      30240    30281      +41     
============================================
+ Hits         123592   123814     +222     
- Misses        62261    62272      +11     
- Partials       9877     9887      +10     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.14% <83.12%> (+0.04%) ⬆️
java-21 63.15% <83.12%> (+0.04%) ⬆️
temurin 63.17% <83.12%> (+0.03%) ⬆️
unittests 63.17% <83.12%> (+0.03%) ⬆️
unittests1 55.40% <63.75%> (+0.02%) ⬆️
unittests2 34.81% <64.37%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch from 39e9169 to 938028b Compare April 3, 2026 13:33
@xiangfu0 xiangfu0 requested review from Jackie-Jiang and Copilot April 3, 2026 20:59
@xiangfu0 xiangfu0 added configuration Config changes (addition/deletion/change in behavior) feature New functionality labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a structured, canonicalized compressionCodec specification (including optional compression level) and threads it through forward-index creation and segment metadata so level-only changes can trigger rewrites on reload.

Changes:

  • Introduce CompressionCodecSpec + parser/capabilities to support CODEC and CODEC(level) syntax with alias canonicalization (e.g., ZSTDZSTANDARD).
  • Plumb optional compression levels into raw/forward-index writers and chunk compressor construction while preserving legacy behavior for non-parameterized codecs.
  • Persist explicit forward-index compression spec in segment metadata and enhance reload logic to detect level-only changes; add/extend unit and integration tests.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/test/java/org/apache/pinot/spi/config/table/CompressionCodecSpecTest.java Adds parser/serde tests for canonicalization, level ranges, and invalid specs.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java Switches compressionCodec to CompressionCodecSpec while retaining enum-based compatibility paths.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/CompressionCodecSpecParser.java Implements parsing/validation for CODEC and CODEC(level) syntax.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/CompressionCodecSpec.java Adds canonical spec model with Jackson (de)serialization via config string.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/CompressionCodecCapabilities.java Centralizes alias canonicalization and per-codec supported level ranges.
pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/ForwardIndexConfigTest.java Adds coverage for parameterized compression codec parsing in forward-index config.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/V1Constants.java Adds a metadata key for persisting forward-index compression spec.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/ForwardIndexConfig.java Adds CompressionCodecSpec support, legacy interop, and equality/hash updates.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java Updates FieldConfig construction to match new compressionCodec type.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfigTest.java Verifies parameterized codec is propagated through index loading config.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandlerTest.java Tests rewrite detection and metadata persistence for level-only codec changes.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexTypeTest.java Ensures legacy field-config path supports parameterized codec specs.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/io/compression/TestCompression.java Adds compressor roundtrip tests with explicit levels and rejection for unsupported codecs.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/InvertedIndexAndDictionaryBasedForwardIndexCreator.java Persists forward-index compression spec into metadata for raw forward indexes.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandler.java Uses persisted spec to detect level-only changes and updates metadata on rewrite.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexType.java Switches FieldConfig-to-ForwardIndexConfig mapping to use CompressionCodecSpec.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexCreatorFactory.java Threads optional compression level into raw index creator selection.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/SingleValueVarByteRawIndexCreator.java Adds compression-level plumbing into var-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/SingleValueFixedByteRawIndexCreator.java Adds compression-level plumbing into fixed-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/MultiValueVarByteRawIndexCreator.java Adds compression-level plumbing into MV var-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/MultiValueFixedByteRawIndexCreator.java Adds compression-level plumbing into MV fixed-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/BaseSegmentCreator.java Writes canonical explicit forward-index compression spec to segment metadata.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java Adds constructors to pass optional compression level to base writer.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV5.java Adds constructors to pass optional compression level to base writer.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java Uses level-aware compressor factory when building chunk compressor.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriter.java Adds compression-level support in base var-byte writer constructor path.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkForwardIndexWriter.java Adds compression-level support in fixed-byte writer constructor path.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/BaseChunkForwardIndexWriter.java Passes optional level to compressor factory for chunk writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/ZstandardCompressor.java Adds configurable Zstd compression level.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/LZ4WithLengthCompressor.java Adds configurable LZ4 HC level for length-prefixed LZ4 compressor.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/LZ4Compressor.java Adds configurable LZ4 HC level and avoids recreating compressor on each call.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/GzipCompressor.java Adds configurable Deflater compression level.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/ChunkCompressorFactory.java Adds factory overload to construct compressors with optional explicit level.
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/TextIndicesTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-core/src/test/java/org/apache/pinot/queries/BaseFSTBasedRegexpLikeQueriesTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/BaseTransformFunctionTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-core/src/test/java/org/apache/pinot/core/data/manager/TableIndexingTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-common/src/test/java/org/apache/pinot/common/utils/config/TableConfigSerDeUtilsTest.java Adds serde coverage for parameterized codec spec in table config.

@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch 4 times, most recently from d522a5e to c1aeec4 Compare April 9, 2026 11:53
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward Compatibility & Segment Format Analysis

I've reviewed this PR for critical backward compatibility concerns with segment format changes. Here are my findings:

CRITICAL: Segment Rewrite Detection Logic Gap

Existing segments written before this PR will lack the new FORWARD_INDEX_COMPRESSION_CODEC metadata key. If a user later modifies only the compression level (e.g., changing from ZSTD to ZSTD(3)), the ForwardIndexHandler rewrite detection will fail to trigger because:

  1. Old segments have no persisted codec spec in metadata
  2. The rewrite logic compares current config against persisted metadata spec
  3. Missing metadata = no comparison possible = no rewrite triggered
  4. Result: Silent configuration skew - user expects rewrite, segments remain at old level

Recommendation: ForwardIndexHandler should explicitly handle missing codec metadata as "unknown level" state and trigger rewrites when config specifies explicit levels for old segments.

SAFE: Segment Format Read Compatibility

✓ New metadata key is additive (V1Constants.Column.FORWARD_INDEX_COMPRESSION_CODEC)
✓ Old code reading new segments: unaffected (skips unknown metadata)
✓ New code reading old segments: unaffected (spec is optional, defaults apply)
✓ No changes to forward-index header or chunk encoding format

SAFE: Configuration Serialization

✓ CompressionCodecSpec uses Jackson @JsonValue/@JsonCreator for stable string format
✓ Format is human-readable: CODEC or CODEC(level)
✓ Parser is strict (rejects malformed input early in config validation)
✓ Alias canonicalization (ZSTD → ZSTANDARD) handled consistently

SAFE: Legacy Configuration Compatibility

✓ Plain codec names (ZSTANDARD, GZIP, LZ4) preserve prior behavior
✓ Explicit levels default to library defaults when omitted
✓ FieldConfig compatibility constructors maintain config deserialization paths
✓ Comprehensive test coverage for serde paths

@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch from c1aeec4 to 3433bb5 Compare April 10, 2026 09:47
@xiangfu0
Copy link
Copy Markdown
Contributor Author

Addressed the legacy-segment rewrite gap called out in review.

Changes in this push:

  • rebased the branch onto current master
  • made ForwardIndexHandler treat missing forwardIndexCompressionCodec metadata as an unknown legacy state and force a rewrite when the table config now specifies an explicit compression level
  • tightened ForwardIndexHandlerTest.testChangeCompressionLevelForSingleColumn() so it first asserts the old segment has no persisted codec-spec metadata, then verifies the rewrite persists LZ4(12)

Focused validation run:

  • ./mvnw -pl pinot-segment-local -am -Dcheckstyle.skip=true -Dtest=ForwardIndexHandlerTest -Dsurefire.failIfNoSpecifiedTests=false test
  • spotless, checkstyle, and license checks on pinot-segment-local

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Config changes (addition/deletion/change in behavior) feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants