feat(puffin): add puffin file reader and writer#624
Conversation
zhaoxuan1994
commented
Apr 23, 2026
- PuffinWriter: in-memory writer that builds complete Puffin files
- Add() writes blobs with optional compression
- Finish() serializes footer with JSON metadata
- Tracks BlobMetadata for all written blobs
- PuffinReader: in-memory reader that parses Puffin files
- ReadFileMetadata() parses footer and validates magic bytes
- ReadBlob() reads and decompresses individual blobs
- ReadAll() reads all blobs from metadata
- Expose Compress/Decompress as public API in puffin_format.h
- Register new sources in CMake and Meson build systems
- Add comprehensive tests
There was a problem hiding this comment.
Pull request overview
Adds an in-memory Puffin file writer/reader to the Iceberg C++ Puffin module, exposes compression helpers as public API, wires new sources into the build, and introduces round-trip + Java-compatibility tests.
Changes:
- Introduce
PuffinWriterfor building Puffin files in-memory (blob writing + footer serialization). - Introduce
PuffinReaderfor parsing Puffin files in-memory (footer parsing + blob reading). - Export
Compress/DecompressAPIs and register new sources/tests in CMake + Meson.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/puffin/puffin_writer.h | Declares the in-memory Puffin writer API. |
| src/iceberg/puffin/puffin_writer.cc | Implements Puffin file serialization (header, blobs, footer). |
| src/iceberg/puffin/puffin_reader.h | Declares the in-memory Puffin reader API. |
| src/iceberg/puffin/puffin_reader.cc | Implements footer parsing, magic validation, and blob reads. |
| src/iceberg/puffin/puffin_format.h | Exposes Compress/Decompress as public APIs. |
| src/iceberg/puffin/puffin_format.cc | Moves flag helpers and keeps compression stubs in the public namespace. |
| src/iceberg/puffin/meson.build | Installs new public headers for Meson builds. |
| src/iceberg/meson.build | Adds new Puffin reader/writer sources to the core library build. |
| src/iceberg/CMakeLists.txt | Adds new Puffin reader/writer sources to the CMake build. |
| src/iceberg/test/puffin_reader_writer_test.cc | Adds comprehensive writer/reader round-trip and Java-compat tests. |
| src/iceberg/test/meson.build | Adds the new Puffin reader/writer test to Meson test target. |
| src/iceberg/test/CMakeLists.txt | Adds the new Puffin reader/writer test to CMake test target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto payload_size = ReadLittleEndian<int32_t>( | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset); | ||
|
|
||
| // Calculate total footer size and validate | ||
| int64_t footer_size = PuffinFormat::kFooterStartMagicLength + | ||
| static_cast<int64_t>(payload_size) + | ||
| PuffinFormat::kFooterStructLength; | ||
| auto footer_offset = file_size - footer_size; | ||
| if (footer_offset < 0) { | ||
| return Invalid("Invalid file: footer size {} exceeds file size {}", footer_size, | ||
| file_size); | ||
| } | ||
|
|
||
| // Validate footer start magic | ||
| ICEBERG_RETURN_UNEXPECTED(CheckMagic(data_, footer_offset)); | ||
|
|
||
| // Check flags for footer compression | ||
| std::array<uint8_t, 4> flags{}; | ||
| std::memcpy( | ||
| flags.data(), | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | ||
|
|
||
| PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone; | ||
| if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) { | ||
| footer_compression = PuffinFormat::kDefaultFooterCompressionCodec; | ||
| } | ||
|
|
||
| // Extract footer payload | ||
| auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength; | ||
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto payload_bytes, | ||
| Decompress(footer_compression, payload_span)); |
There was a problem hiding this comment.
payload_size is already validated above:
if (payload_size < 0) {
return Invalid("Invalid file: negative payload size {}", payload_size);
}
| if (blob_metadata.offset < 0 || | ||
| blob_metadata.offset + blob_metadata.length > file_size) { | ||
| return Invalid("Invalid blob: offset {} + length {} exceeds file size {}", | ||
| blob_metadata.offset, blob_metadata.length, file_size); | ||
| } | ||
|
|
||
| std::span<const std::byte> raw_data(data_.data() + blob_metadata.offset, | ||
| blob_metadata.length); |
1dd5373 to
f8c15ee
Compare
| std::unreachable(); | ||
| } | ||
|
|
||
| } // namespace |
There was a problem hiding this comment.
is there any reason to move both Result<std::vector<std::byte>> Compress and Result<std::vector<std::byte>> Decompress out of the namespace?
kamcheungting-db
left a comment
There was a problem hiding this comment.
Some minor comments.
Nice code
| auto payload_size = ReadLittleEndian<int32_t>( | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset); | ||
|
|
||
| // Calculate total footer size and validate | ||
| int64_t footer_size = PuffinFormat::kFooterStartMagicLength + | ||
| static_cast<int64_t>(payload_size) + | ||
| PuffinFormat::kFooterStructLength; | ||
| auto footer_offset = file_size - footer_size; | ||
| if (footer_offset < 0) { | ||
| return Invalid("Invalid file: footer size {} exceeds file size {}", footer_size, | ||
| file_size); | ||
| } | ||
|
|
||
| // Validate footer start magic | ||
| ICEBERG_RETURN_UNEXPECTED(CheckMagic(data_, footer_offset)); | ||
|
|
||
| // Check flags for footer compression | ||
| std::array<uint8_t, 4> flags{}; | ||
| std::memcpy( | ||
| flags.data(), | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | ||
|
|
||
| PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone; | ||
| if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) { | ||
| footer_compression = PuffinFormat::kDefaultFooterCompressionCodec; | ||
| } | ||
|
|
||
| // Extract footer payload | ||
| auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength; | ||
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto payload_bytes, | ||
| Decompress(footer_compression, payload_span)); |
There was a problem hiding this comment.
payload_size is already validated above:
if (payload_size < 0) {
return Invalid("Invalid file: negative payload size {}", payload_size);
}
|
|
||
| // Extract footer payload | ||
| auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength; | ||
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); |
There was a problem hiding this comment.
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); | |
| const std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); |
There was a problem hiding this comment.
Yeah I thought about this — span is basically a pointer+size pair like string_view, so const on the member itself doesn't buy us much. The span is set once in the constructor and never reassigned. I'll leave it as-is to stay consistent with how the rest of the project uses span/string_view members.
| : default_codec_(default_codec) {} | ||
|
|
||
| void PuffinWriter::WriteHeader() { | ||
| if (header_written_) return; |
There was a problem hiding this comment.
shouldn't we error out here?
It looks like dangerous to call WriteHeader() multiple time.
There was a problem hiding this comment.
| if (header_written_) return; | |
| if (header_written_) { | |
| return Invalid("Header has already been written."); | |
| } |
| buffer_.insert(buffer_.end(), reinterpret_cast<const std::byte*>(magic.data()), | ||
| reinterpret_cast<const std::byte*>(magic.data() + magic.size())); |
There was a problem hiding this comment.
| buffer_.insert(buffer_.end(), reinterpret_cast<const std::byte*>(magic.data()), | |
| reinterpret_cast<const std::byte*>(magic.data() + magic.size())); | |
| buffer_.insert(buffer_.end(), | |
| reinterpret_cast<const std::byte*>(magic.data()), | |
| reinterpret_cast<const std::byte*>(magic.data() + magic.size())); |
| /// \param properties File-level properties to include in the footer. | ||
| /// \return The complete Puffin file as a byte vector, or an error. | ||
| Result<std::vector<std::byte>> Finish( | ||
| std::unordered_map<std::string, std::string> properties = {}); |
There was a problem hiding this comment.
do we really need to copy the whole std::unordered_map<std::string, std::string>
There was a problem hiding this comment.
Why do not set the properties while creating the writer?
There was a problem hiding this comment.
BTW, it seems that we need to extend our FileIO abstraction to support puffin I/O. It looks inefficient to write all blobs into a buffer like this.
wgtmac
left a comment
There was a problem hiding this comment.
Sorry for the long delay. I have some concerns about its API design which would be inefficient in the real world use case. I believe the main issue arises from the lack of streaming support of FileIO. I will put up a PR to address this.
| /// | ||
| /// Parses a Puffin file from an in-memory buffer. Usage: | ||
| /// PuffinReader reader(file_data); | ||
| /// auto metadata = reader.ReadFileMetadata(); |
There was a problem hiding this comment.
This might not be the recommended approach unless error handling has been added. Otherwise users may silently ignore the error and call metadata.value() unconsciously.
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | ||
|
|
||
| PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone; | ||
| if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) { |
There was a problem hiding this comment.
Please validate that all unknown/reserved footer flag bits are unset. Java's PuffinReader rejects unknown flags, and the spec says reserved bits should be 0; silently ignoring them may cause C++ to accept future or invalid files and interpret the footer incorrectly.
|
|
||
| /// \brief Writer for Puffin files. | ||
| /// | ||
| /// Builds a complete Puffin file in memory. Usage: |
There was a problem hiding this comment.
Same as my comment in the reader, I think we should remove comment like this.
|
|
||
| /// \brief Add a blob to be written. | ||
| /// \return The BlobMetadata for the written blob, or an error. | ||
| Result<BlobMetadata> Add(const Blob& blob); |
There was a problem hiding this comment.
Do we really need to return a copy of BlobMetadata here? IMO, users usually do not care about it and it pays to do so.
| /// \param properties File-level properties to include in the footer. | ||
| /// \return The complete Puffin file as a byte vector, or an error. | ||
| Result<std::vector<std::byte>> Finish( | ||
| std::unordered_map<std::string, std::string> properties = {}); |
There was a problem hiding this comment.
Why do not set the properties while creating the writer?
| /// \param properties File-level properties to include in the footer. | ||
| /// \return The complete Puffin file as a byte vector, or an error. | ||
| Result<std::vector<std::byte>> Finish( | ||
| std::unordered_map<std::string, std::string> properties = {}); |
There was a problem hiding this comment.
BTW, it seems that we need to extend our FileIO abstraction to support puffin I/O. It looks inefficient to write all blobs into a buffer like this.
|
I've merged #641 so we can rebase on it to design better API for puffin. |
3c267c2 to
7e0d056
Compare
| Result<FileMetadata> PuffinReader::ReadFileMetadata() { | ||
| // Get file size | ||
| if (IsFileMode()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(file_size_, input_file_->Size()); |
There was a problem hiding this comment.
Please fail on unknown footer flags here, matching Java. Ignoring extra bits can make newer or corrupted Puffin files look valid.
| /// \brief Reader for Puffin files. | ||
| /// | ||
| /// Supports two modes: | ||
| /// - Memory mode: parses from an in-memory buffer. |
There was a problem hiding this comment.
Can we avoid making the public Puffin API in-memory only? Java and the C++ IO layer are file/stream based, which fits object stores and range reads better.
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for incorporating the new FileIO abstraction! My main concern is the API inconsistency compared to Java. Let me know what you think.
| explicit PuffinReader(std::span<const std::byte> data); | ||
|
|
||
| /// \brief Construct a file-mode reader from an InputFile. | ||
| explicit PuffinReader(std::unique_ptr<InputFile> input_file); |
There was a problem hiding this comment.
We cannot throw exception in practice, so we may need a static function like Result<std::unique_ptr<PuffinReader>> Make(std::unique_ptr<InputFile> input_file) to handle any error, like null input. Same for writer.
There was a problem hiding this comment.
BTW, do we want to provide footerSize as the hint? This saves a seek btw.
bf68fb2 to
8f3720c
Compare
| #include "iceberg/result.h" | ||
|
|
||
| namespace iceberg { | ||
| class OutputFile; |
There was a problem hiding this comment.
nit: we should put these forward declarations in type_fwd.h and include it here.
| if (!output_file) { | ||
| return InvalidArgument("output_file must not be null"); | ||
| } | ||
| ICEBERG_ASSIGN_OR_RAISE(auto stream, output_file->CreateOrOverwrite()); |
There was a problem hiding this comment.
Java PuffinWriter opens the output with OutputFile.create(), which fails when the target already exists. This path uses CreateOrOverwrite(), so a caller can silently truncate an existing Puffin file instead of getting the Java-aligned failure.
| if (!input_file) { | ||
| return InvalidArgument("input_file must not be null"); | ||
| } | ||
| ICEBERG_ASSIGN_OR_RAISE(auto file_size, input_file->Size()); |
There was a problem hiding this comment.
The reader accepts a known footer_size, but Make still always calls InputFile::Size() and ReadFileMetadata recomputes the footer size from the file tail. Java's ReadBuilder supports both withFileSize and withFooterSize and uses/validates the hints, so this API is not strictly aligned and loses the intended object-store optimization.
| // Footer end magic | ||
| ICEBERG_RETURN_UNEXPECTED(WriteMagic()); | ||
|
|
||
| ICEBERG_ASSIGN_OR_RAISE(auto end_pos, stream_->Position()); |
There was a problem hiding this comment.
Finish records file_size/footer_size and sets finished_ before Flush/Close returns. Java only marks the writer finished and records storedLength after close, because some streams emit final bytes on close. If Close fails or finalizes extra bytes, this writer reports a finished state and wrong size.
| PuffinCompressionCodec default_codec = PuffinCompressionCodec::kNone, | ||
| bool compress_footer = false); | ||
|
|
||
| ~PuffinWriter(); |
There was a problem hiding this comment.
Java PuffinWriter implements close(), and close() implicitly finishes an unfinished file. This API exposes only Finish() and the destructor is a no-op, so ported code has no Java-equivalent close path and can leave an invalid Puffin file if Finish() is not called explicitly.
| std::unique_ptr<InputFile> input_file, | ||
| std::optional<int64_t> footer_size = std::nullopt); | ||
|
|
||
| ~PuffinReader(); |
There was a problem hiding this comment.
Java PuffinReader.close() closes the input stream. This reader owns a SeekableInputStream whose interface has an explicit Close() contract, but PuffinReader exposes no Close() and its destructor is defaulted, so readers can leak or delay releasing file handles/object-store resources.
- PuffinWriter: in-memory writer that builds complete Puffin files - Add() writes blobs with optional compression - Finish() serializes footer with JSON metadata - Tracks BlobMetadata for all written blobs - PuffinReader: in-memory reader that parses Puffin files - ReadFileMetadata() parses footer and validates magic bytes - ReadBlob() reads and decompresses individual blobs - ReadAll() reads all blobs from metadata - Expose Compress/Decompress as public API in puffin_format.h - Register new sources in CMake and Meson build systems - Add comprehensive tests including Java binary compatibility
|
I'll directly push some changes to resolve my comments from previous round of review to save some time. |
8f3720c to
e936d49
Compare
|
Thank you for working on this, @zhaoxuan1994! |