Skip to content

feat(puffin): add puffin file reader and writer#624

Merged
wgtmac merged 2 commits into
apache:mainfrom
zhaoxuan1994:feat/puffin-format-utils
May 28, 2026
Merged

feat(puffin): add puffin file reader and writer#624
wgtmac merged 2 commits into
apache:mainfrom
zhaoxuan1994:feat/puffin-format-utils

Conversation

@zhaoxuan1994
Copy link
Copy Markdown
Contributor

  • PuffinWriter: in-memory writer that builds complete Puffin files
    • Add() writes blobs with optional compression
    • Finish() serializes footer with JSON metadata
    • Tracks BlobMetadata for all written blobs
  • PuffinReader: in-memory reader that parses Puffin files
    • ReadFileMetadata() parses footer and validates magic bytes
    • ReadBlob() reads and decompresses individual blobs
    • ReadAll() reads all blobs from metadata
  • Expose Compress/Decompress as public API in puffin_format.h
  • Register new sources in CMake and Meson build systems
  • Add comprehensive tests

Copilot AI review requested due to automatic review settings April 23, 2026 13:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an in-memory Puffin file writer/reader to the Iceberg C++ Puffin module, exposes compression helpers as public API, wires new sources into the build, and introduces round-trip + Java-compatibility tests.

Changes:

  • Introduce PuffinWriter for building Puffin files in-memory (blob writing + footer serialization).
  • Introduce PuffinReader for parsing Puffin files in-memory (footer parsing + blob reading).
  • Export Compress/Decompress APIs and register new sources/tests in CMake + Meson.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/iceberg/puffin/puffin_writer.h Declares the in-memory Puffin writer API.
src/iceberg/puffin/puffin_writer.cc Implements Puffin file serialization (header, blobs, footer).
src/iceberg/puffin/puffin_reader.h Declares the in-memory Puffin reader API.
src/iceberg/puffin/puffin_reader.cc Implements footer parsing, magic validation, and blob reads.
src/iceberg/puffin/puffin_format.h Exposes Compress/Decompress as public APIs.
src/iceberg/puffin/puffin_format.cc Moves flag helpers and keeps compression stubs in the public namespace.
src/iceberg/puffin/meson.build Installs new public headers for Meson builds.
src/iceberg/meson.build Adds new Puffin reader/writer sources to the core library build.
src/iceberg/CMakeLists.txt Adds new Puffin reader/writer sources to the CMake build.
src/iceberg/test/puffin_reader_writer_test.cc Adds comprehensive writer/reader round-trip and Java-compat tests.
src/iceberg/test/meson.build Adds the new Puffin reader/writer test to Meson test target.
src/iceberg/test/CMakeLists.txt Adds the new Puffin reader/writer test to CMake test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/puffin/puffin_reader.h Outdated
Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Comment on lines +69 to +100
auto payload_size = ReadLittleEndian<int32_t>(
data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset);

// Calculate total footer size and validate
int64_t footer_size = PuffinFormat::kFooterStartMagicLength +
static_cast<int64_t>(payload_size) +
PuffinFormat::kFooterStructLength;
auto footer_offset = file_size - footer_size;
if (footer_offset < 0) {
return Invalid("Invalid file: footer size {} exceeds file size {}", footer_size,
file_size);
}

// Validate footer start magic
ICEBERG_RETURN_UNEXPECTED(CheckMagic(data_, footer_offset));

// Check flags for footer compression
std::array<uint8_t, 4> flags{};
std::memcpy(
flags.data(),
data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4);

PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone;
if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) {
footer_compression = PuffinFormat::kDefaultFooterCompressionCodec;
}

// Extract footer payload
auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength;
std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size);
ICEBERG_ASSIGN_OR_RAISE(auto payload_bytes,
Decompress(footer_compression, payload_span));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload_size is already validated above:

if (payload_size < 0) {
  return Invalid("Invalid file: negative payload size {}", payload_size);
}

Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Comment on lines +117 to +124
if (blob_metadata.offset < 0 ||
blob_metadata.offset + blob_metadata.length > file_size) {
return Invalid("Invalid blob: offset {} + length {} exceeds file size {}",
blob_metadata.offset, blob_metadata.length, file_size);
}

std::span<const std::byte> raw_data(data_.data() + blob_metadata.offset,
blob_metadata.length);
@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-format-utils branch from 1dd5373 to f8c15ee Compare April 24, 2026 02:13
std::unreachable();
}

} // namespace
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to move both Result<std::vector<std::byte>> Compress and Result<std::vector<std::byte>> Decompress out of the namespace?

Copy link
Copy Markdown

@kamcheungting-db kamcheungting-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.
Nice code

Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Comment on lines +69 to +100
auto payload_size = ReadLittleEndian<int32_t>(
data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset);

// Calculate total footer size and validate
int64_t footer_size = PuffinFormat::kFooterStartMagicLength +
static_cast<int64_t>(payload_size) +
PuffinFormat::kFooterStructLength;
auto footer_offset = file_size - footer_size;
if (footer_offset < 0) {
return Invalid("Invalid file: footer size {} exceeds file size {}", footer_size,
file_size);
}

// Validate footer start magic
ICEBERG_RETURN_UNEXPECTED(CheckMagic(data_, footer_offset));

// Check flags for footer compression
std::array<uint8_t, 4> flags{};
std::memcpy(
flags.data(),
data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4);

PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone;
if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) {
footer_compression = PuffinFormat::kDefaultFooterCompressionCodec;
}

// Extract footer payload
auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength;
std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size);
ICEBERG_ASSIGN_OR_RAISE(auto payload_bytes,
Decompress(footer_compression, payload_span));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

payload_size is already validated above:

if (payload_size < 0) {
  return Invalid("Invalid file: negative payload size {}", payload_size);
}

Comment thread src/iceberg/puffin/puffin_reader.cc Outdated

// Extract footer payload
auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength;
std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size);
const std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about this — span is basically a pointer+size pair like string_view, so const on the member itself doesn't buy us much. The span is set once in the constructor and never reassigned. I'll leave it as-is to stay consistent with how the rest of the project uses span/string_view members.

Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.cc Outdated
: default_codec_(default_codec) {}

void PuffinWriter::WriteHeader() {
if (header_written_) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we error out here?
It looks like dangerous to call WriteHeader() multiple time.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (header_written_) return;
if (header_written_) {
return Invalid("Header has already been written.");
}

Comment thread src/iceberg/puffin/puffin_writer.cc Outdated
Comment thread src/iceberg/puffin/puffin_writer.cc Outdated
Comment on lines +110 to +111
buffer_.insert(buffer_.end(), reinterpret_cast<const std::byte*>(magic.data()),
reinterpret_cast<const std::byte*>(magic.data() + magic.size()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buffer_.insert(buffer_.end(), reinterpret_cast<const std::byte*>(magic.data()),
reinterpret_cast<const std::byte*>(magic.data() + magic.size()));
buffer_.insert(buffer_.end(),
reinterpret_cast<const std::byte*>(magic.data()),
reinterpret_cast<const std::byte*>(magic.data() + magic.size()));

Comment thread src/iceberg/puffin/puffin_writer.h Outdated
/// \param properties File-level properties to include in the footer.
/// \return The complete Puffin file as a byte vector, or an error.
Result<std::vector<std::byte>> Finish(
std::unordered_map<std::string, std::string> properties = {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to copy the whole std::unordered_map<std::string, std::string>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not set the properties while creating the writer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it seems that we need to extend our FileIO abstraction to support puffin I/O. It looks inefficient to write all blobs into a buffer like this.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. I have some concerns about its API design which would be inefficient in the real world use case. I believe the main issue arises from the lack of streaming support of FileIO. I will put up a PR to address this.

Comment thread src/iceberg/puffin/puffin_reader.h Outdated
///
/// Parses a Puffin file from an in-memory buffer. Usage:
/// PuffinReader reader(file_data);
/// auto metadata = reader.ReadFileMetadata();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the recommended approach unless error handling has been added. Otherwise users may silently ignore the error and call metadata.value() unconsciously.

Comment thread src/iceberg/puffin/puffin_reader.h Outdated
Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4);

PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone;
if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please validate that all unknown/reserved footer flag bits are unset. Java's PuffinReader rejects unknown flags, and the spec says reserved bits should be 0; silently ignoring them may cause C++ to accept future or invalid files and interpret the footer incorrectly.

Comment thread src/iceberg/puffin/puffin_writer.h Outdated

/// \brief Writer for Puffin files.
///
/// Builds a complete Puffin file in memory. Usage:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my comment in the reader, I think we should remove comment like this.

Comment thread src/iceberg/puffin/puffin_writer.h Outdated

/// \brief Add a blob to be written.
/// \return The BlobMetadata for the written blob, or an error.
Result<BlobMetadata> Add(const Blob& blob);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to return a copy of BlobMetadata here? IMO, users usually do not care about it and it pays to do so.

Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
/// \param properties File-level properties to include in the footer.
/// \return The complete Puffin file as a byte vector, or an error.
Result<std::vector<std::byte>> Finish(
std::unordered_map<std::string, std::string> properties = {});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not set the properties while creating the writer?

Comment thread src/iceberg/puffin/puffin_writer.h Outdated
/// \param properties File-level properties to include in the footer.
/// \return The complete Puffin file as a byte vector, or an error.
Result<std::vector<std::byte>> Finish(
std::unordered_map<std::string, std::string> properties = {});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it seems that we need to extend our FileIO abstraction to support puffin I/O. It looks inefficient to write all blobs into a buffer like this.

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 8, 2026

I've merged #641 so we can rebase on it to design better API for puffin.

Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
Result<FileMetadata> PuffinReader::ReadFileMetadata() {
// Get file size
if (IsFileMode()) {
ICEBERG_ASSIGN_OR_RAISE(file_size_, input_file_->Size());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fail on unknown footer flags here, matching Java. Ignoring extra bits can make newer or corrupted Puffin files look valid.

Comment thread src/iceberg/puffin/puffin_reader.h Outdated
/// \brief Reader for Puffin files.
///
/// Supports two modes:
/// - Memory mode: parses from an in-memory buffer.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid making the public Puffin API in-memory only? Java and the C++ IO layer are file/stream based, which fits object stores and range reads better.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for incorporating the new FileIO abstraction! My main concern is the API inconsistency compared to Java. Let me know what you think.

Comment thread src/iceberg/puffin/puffin_reader.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
Comment thread src/iceberg/puffin/puffin_reader.h Outdated
explicit PuffinReader(std::span<const std::byte> data);

/// \brief Construct a file-mode reader from an InputFile.
explicit PuffinReader(std::unique_ptr<InputFile> input_file);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot throw exception in practice, so we may need a static function like Result<std::unique_ptr<PuffinReader>> Make(std::unique_ptr<InputFile> input_file) to handle any error, like null input. Same for writer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do we want to provide footerSize as the hint? This saves a seek btw.

@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-format-utils branch 2 times, most recently from bf68fb2 to 8f3720c Compare May 25, 2026 11:19
Comment thread src/iceberg/puffin/puffin_writer.h Outdated
#include "iceberg/result.h"

namespace iceberg {
class OutputFile;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should put these forward declarations in type_fwd.h and include it here.

Comment thread src/iceberg/puffin/puffin_writer.cc Outdated
if (!output_file) {
return InvalidArgument("output_file must not be null");
}
ICEBERG_ASSIGN_OR_RAISE(auto stream, output_file->CreateOrOverwrite());
Copy link
Copy Markdown
Member

@wgtmac wgtmac May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java PuffinWriter opens the output with OutputFile.create(), which fails when the target already exists. This path uses CreateOrOverwrite(), so a caller can silently truncate an existing Puffin file instead of getting the Java-aligned failure.

Comment thread src/iceberg/puffin/puffin_reader.cc Outdated
if (!input_file) {
return InvalidArgument("input_file must not be null");
}
ICEBERG_ASSIGN_OR_RAISE(auto file_size, input_file->Size());
Copy link
Copy Markdown
Member

@wgtmac wgtmac May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader accepts a known footer_size, but Make still always calls InputFile::Size() and ReadFileMetadata recomputes the footer size from the file tail. Java's ReadBuilder supports both withFileSize and withFooterSize and uses/validates the hints, so this API is not strictly aligned and loses the intended object-store optimization.

// Footer end magic
ICEBERG_RETURN_UNEXPECTED(WriteMagic());

ICEBERG_ASSIGN_OR_RAISE(auto end_pos, stream_->Position());
Copy link
Copy Markdown
Member

@wgtmac wgtmac May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish records file_size/footer_size and sets finished_ before Flush/Close returns. Java only marks the writer finished and records storedLength after close, because some streams emit final bytes on close. If Close fails or finalizes extra bytes, this writer reports a finished state and wrong size.

PuffinCompressionCodec default_codec = PuffinCompressionCodec::kNone,
bool compress_footer = false);

~PuffinWriter();
Copy link
Copy Markdown
Member

@wgtmac wgtmac May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java PuffinWriter implements close(), and close() implicitly finishes an unfinished file. This API exposes only Finish() and the destructor is a no-op, so ported code has no Java-equivalent close path and can leave an invalid Puffin file if Finish() is not called explicitly.

std::unique_ptr<InputFile> input_file,
std::optional<int64_t> footer_size = std::nullopt);

~PuffinReader();
Copy link
Copy Markdown
Member

@wgtmac wgtmac May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java PuffinReader.close() closes the input stream. This reader owns a SeekableInputStream whose interface has an explicit Close() contract, but PuffinReader exposes no Close() and its destructor is defaulted, so readers can leak or delay releasing file handles/object-store resources.

- PuffinWriter: in-memory writer that builds complete Puffin files
  - Add() writes blobs with optional compression
  - Finish() serializes footer with JSON metadata
  - Tracks BlobMetadata for all written blobs
- PuffinReader: in-memory reader that parses Puffin files
  - ReadFileMetadata() parses footer and validates magic bytes
  - ReadBlob() reads and decompresses individual blobs
  - ReadAll() reads all blobs from metadata
- Expose Compress/Decompress as public API in puffin_format.h
- Register new sources in CMake and Meson build systems
- Add comprehensive tests including Java binary compatibility
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 27, 2026

I'll directly push some changes to resolve my comments from previous round of review to save some time.

@wgtmac wgtmac force-pushed the feat/puffin-format-utils branch from 8f3720c to e936d49 Compare May 27, 2026 15:54
@wgtmac wgtmac merged commit 100bbe3 into apache:main May 28, 2026
15 checks passed
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented May 28, 2026

Thank you for working on this, @zhaoxuan1994!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants