GH-45946: [C++][Parquet] Variant decoding#50121
Open
qzyu999 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
This is part of the GH-45937 umbrella (Add variant support to C++ Parquet). The Variant Encoding Spec defines a binary format for semi-structured data in Parquet. This PR adds the decoding (reading) side, which is a prerequisite for the encoder (GH-45947) and shredding support (GH-45948).
The implementation targets feature parity with the existing arrow-go
parquet/variantpackage, adapted to idiomatic C++ patterns. Divergences from Go are deliberate and documented in code comments.What changes are included in this PR?
Full Variant binary decoding per the Variant Encoding Spec. Adds
variant_internal.h/.ccwith:Decoder (visitor/SAX-style traversal):
DecodeMetadata()— parses the string dictionary from raw bytesDecodeVariantValue()— recursive traversal invoking aVariantVisitorfor each valuekMaxNestingDepth = 128) — security hardening for C++ stack semantics (Go doesn't need this due to goroutine stack growth)Random-access utilities (for future Parquet reader integration):
ValueSize()— compute byte size of a value without full decodeFindObjectField()— lookup by field name (binary search for ≥32 fields, linear for small objects)GetArrayElement()— O(1) element access by indexGetObjectFieldAt()— positional field accessFindMetadataKey()— dictionary ID lookup (binary search if sorted)Design choices (deliberate divergences from Go documented in code):
TypeVisitor,ArrayVisitorprecedent)FindObjectFieldbinary search uses signedint32_tforlo/hito avoid an unsigned underflow pattern present in Go'sObjectValue.ValueByKey()(separate bug report TBD)Bug discovered in arrow-go during development:
valueSize()inparquet/variant/utils.goreads the wrong bit for arrayis_large— uses(typeInfo >> 4) & 0x1(object's is_large position) instead of(typeInfo >> 2) & 0x1. Fix submitted as [Go][Parquet] valuesize() uses incorrect bit position for basicarray large flag arrow-go#839.ObjectValue.ValueByKey()binary search usesj = mid - 1wherejisuint32— wraps toMaxUint32whenmid == 0, skipping elements. Reported as [Go][Parquet] ObjectValue.ValueByKey binary search skips elements at mid - 1 arrow-go#842.Are these changes tested?
Yes. 165 tests pass with
BUILD_WARNING_LEVEL=CHECKIN(warnings-as-errors):Are there any user-facing changes?
No breaking changes. This adds new public API (
arrow::extension::variantnamespace) that did not previously exist. The headervariant_internal.his installed — "internal" in the filename refers to "binary encoding internals" not visibility.AI Disclosure: AI coding assistants were used during development for scaffolding, test generation, and review iteration. All code has been reviewed, debugged, and verified by the author who owns and understands the changes.