Skip to content

Add packfile library design doc#633

Open
tamirms wants to merge 5 commits intofeature/full-historyfrom
add-packfile-design-doc
Open

Add packfile library design doc#633
tamirms wants to merge 5 commits intofeature/full-historyfrom
add-packfile-design-doc

Conversation

@tamirms
Copy link
Copy Markdown
Contributor

@tamirms tamirms commented Mar 18, 2026

Summary

  • Adds a design document for a compact, immutable file format that provides O(1) random access to ordinal-indexed data (events, bitmaps, ledgers) with minimal I/O
  • This is a building block for a full history implementation of RPC (v2)
  • The doc covers the file format, read/write paths, integrity model, and API reference
  • The goal is to align on this design before proceeding with implementation — please flag any concerns or blockers so we can address them during the design stage

Note: This PR targets main for now but will be retargeted to the full history RPC / v2 branch once we align on that before merging.

Further reading

  • Reference implementation: packfile, intpack
  • Benchmarks — packfile vs RocksDB sstable across write throughput, sequential/random reads, compression efficiency, and bitmap indexing
  • Format comparison — technical evaluation of packfile vs RocksDB sstable across performance, code complexity, dependencies, and suitability for remote storage

🤖 Generated with Claude Code

Design document describing the packfile and intpack packages for
storing and reading large collections of immutable, ordinal-indexed
items (events, bitmaps, ledgers) with O(1) random access and minimal
I/O overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 18, 2026 17:56
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design document proposing a “packfile” immutable, ordinal-indexed file format and Go library API intended as a building block for Stellar RPC v2 full-history storage (compact storage with O(1) random access and minimal I/O).

Changes:

  • Introduces the packfile/intpack concepts, goals/non-goals, and usage examples.
  • Specifies the on-disk layout (records, index, trailer) plus integrity/content-hash model.
  • Documents proposed Go APIs for writer/reader, concurrency behavior, and error surface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread design-docs/packfile-library.md
Comment thread design-docs/packfile-library.md Outdated
Comment thread design-docs/packfile-library.md Outdated
Comment thread design-docs/packfile-library.md Outdated
Comment thread design-docs/packfile-library.md Outdated
- Clarify trailer flag descriptions: Bit 0 means not zstd-compressed
  (not implying CRC is always present), and document the Bit 0 + Bit 2
  combination as the Raw format
- Fix ErrIndexRange message from "record index" to "item index" to
  match the public API contract

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tamirms tamirms requested a review from a team March 18, 2026 18:08
@tamirms tamirms changed the base branch from main to feature/full-history March 19, 2026 13:12
Copy link
Copy Markdown
Contributor

@graydon graydon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got curious so I had claude spit out a simpler program (https://github.com/graydon/xdrpack -- both go and rust though I only tested the rust version) that just does a zstd train on an xdr stream's frames to build a shared dictionary, and then compresses each frame with that dictionary and indexes the offsets of each compressed frame a single table, with offsets sized to minimum to span the file. seems to work, though who knows if it's comparable to the benchmark numbers you're after. I guess I'm just a bit unsure whether the complexity of the structure you've built here buys you much, but perhaps it does -- happy to talk more about it!

Comment thread design-docs/packfile-library.md Outdated

- **O(1) random access by ordinal index.** Every `ReadItem` call maps index to record via arithmetic, then reads and decodes a single record.
- **Minimal I/O.** The full index loads in one disk read on open (~112KB for 68K records). After that, one disk read per record, exact size, no over-read.
- **Compact index.** Index size depends on max record size, not file size. A file with 20KB records uses 15-bit deltas whether the file is 500MB or 50GB.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not true. I think the index size described in this document is the size of an index entry times the number of index entries, which grows without bound. The size of an index entry (as a delta) is small, but the 500MB file mentioned here would have 25,000 index entries whereas the 50GB file would have 2.5 million index entries.

Comment thread design-docs/packfile-library.md Outdated

## Goals

- **O(1) random access by ordinal index.** Every `ReadItem` call maps index to record via arithmetic, then reads and decodes a single record.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is .. depending on how you look at it "not achieved" by the design in this doc. The index is delta-encoded which means the entire index needs decoding on file-open (which is O(n) despite having a fairly healthy divisor on n).

Once open and decoded, it's O(1), but that's not guaranteed to be easy or cheap (eg. if you have lots of files you might be constantly reading/decoding/dropping their indexes from memory). I'd recommend just making your index be seekable itself, so you:

  • open the file
  • readAt a position in the index you calculate from the ordinal, to get the offset of the variable-sized record
  • readAt the variable-sized record

Copy link
Copy Markdown
Contributor

@karthikiyer56 karthikiyer56 Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not achieved" by the design in this doc

I do agree with this in a sense.
As in,
the offset index is delta-encoded, so you cant seek to a single entry — you need to decode ALL preceding deltas and cumulative-sum them to reconstruct absolute offsets.
thats O(N) on open, not O(1).
Granted N here is only ever 10k, atleast in the ledgers case, but still..
after open it's an array lookup, sure., but the doc says "O(1) random access" - as a generaliztion without quantification.

would be more honest to say "O(1) after O(n) initialization"?

Comment thread design-docs/packfile-library.md Outdated

## How It Works

Items are grouped into fixed-size **records** (default 128 items per record). Small items like events don't compress well individually, but 128 of them together do. Large items like ledgers are stored one per record since they compress well on their own. Each record is compressed and written as a contiguous block on disk.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on whether the design of bundling items into groups for compression is ideal. I think it's buying you two things?

  1. Space for zstd to be able to achieve some compression
  2. A secondary FOR base for the item-FOR-offsets-within-the-record, but exploiting that requires yet another FOR_index appended to each record.

I don't know, the whole thing seems a bit elaborate -- the FOR-and-delta coded primary index, the secondary FOR_index on each record, even the use of zstd.

I think you can get to a similarly-good place with less complexity if you:

  • Figure out a way to adequately individual items (eg. zstd --train for a while and then reuse that dictionary across all items; or perhaps just cook up a fixed-model compression scheme for XDR, I think we've tried this before in the past?)
  • FOR-encode the offsets in the index if you like, but:
    • Encode an offset for every item in a single index at the end, not a 2-level index-of-indexes
    • Don't also delta-encode the primary index entries
    • Figure out a single width for al the index entries in the file so you can calculate the index entry to readAt
    • Reset the FOR-base at those same 128-offset fixed intervals so you can pick that out with another readAt

Honestly .. even the FOR-encoding of offsets seems like it might well be overkill. I would just do plain leading-null suppression / picking a per-file offset width.

Like unless your files are truly gargantuan the offsets will probably be like 3-5 bytes most of the time (16MiB - 1TiB) and all the FOR wrangling will probably only get your offsets down to 1-2 bytes. So like the index is half as big, but .. is it worth an extra readAt on the hot path? Probably not. I would just figure out what the max offset you need is for all offsets in a file, write that number at the end -- eg. "this file is <= 16MiB so offsets in this file are each 3 bytes long", say -- and then just readAt(ordinal * 3) from your index, and that's your item offset, don't FOR or delta encode anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@graydon this is a great question and I think this gets at the core design decision in packfile.

Space for zstd to be able to achieve some compression

A secondary FOR base for the item-FOR-offsets-within-the-record, but exploiting that requires yet another FOR_index appended to each record.

That assessment is correct and I will elaborate on the specific benefits.

The constraint driving everything is: the events API returns up to 1,000 events per response, and in the worst case those events are scattered across the entire file. The file of events we've been benchmarking with spans 10,000 ledgers and contains 8.7M events (average 221 bytes each uncompressed) — it's recent and probably representative of current event density.

Compression: early on in our design phase @urvisavla actually tried out per event dictionary compression and she found training the dictionary allowed us to achieve ~2x compression. I also ran your xdrpack tool on the dataset of 8.7M events and confirmed the ~2x compression ratio using that method.

Using zstd on groups of 128 events per record gets us a a ~4.4x compression ratio. Individual events just don't give zstd enough context even with a 64 KB trained dictionary. Aside from a better compression ratio there are two more benefits:

  1. faster ingestion times because we don't need to train the zstd dictionary
  2. no need to read the 64 KB trained dictionary as a prerequisite step before querying individual events from the file

Index size: Without grouping, the index has one entry per event: 8.7M × 3 bytes = 26 MB. With grouping at 128, it's one entry per record: 68K entries. Using leading-null-suppressed offsets as you suggested, that's 68K × 3 bytes = 204 KB — fits in a single EBS IOP. FOR encoding shrinks it further to 112 KB, which gives us room to grow if event density increases or we want each packfile to span wider ledger ranges than 10,000.

Why this matters on EBS: On gp3 (3,000 IOPS baseline, 125 MB/s throughput), each random read costs one IOP regardless of size (up to 256 KB):

  • Without grouping: the 26 MB index is too large to preload, so each event lookup needs two IOPS — one index seek, one data read. Worst case is 1,000 scattered events = 2,000 IOPS = 667 ms.
  • With grouping: the 112 KB index loads in one IOP, then each event lookup is one data read. 1,001 IOPS = 334 ms. You might worry that grouped records are larger (~6.4 KB each vs a single compressed event), but 1,000 × 6.4 KB = 6.4 MB total, which takes 51 ms at 125 MB/s. Since IOPS and bandwidth are consumed simultaneously — each IOP transfers its bytes — the bottleneck is whichever takes longer: 334 ms for IOPS vs 51 ms for bandwidth. So it's IOPS-bound either way.

For the EBS case, I think we should strive to make the index at the end of the file as compact as possible so that we can load it with as few IOPs as possible. IMO, the FOR implementation (125 LoC) is justified for that but I'm definitely open to other suggestions.

However, it is not necessary for the index contained within each record to be super compact since the index size for 128 events is tiny in comparison to the event payload sizes. I agree that FOR is totally overkill there. I only used it in the packfile design because we're already using it for the primary index at the trailer. A simpler scheme like length-prefixed entries would work just as well but I figured we might as well just implement it as another call to the FOR encode / decode function.

In conclusion, I agree the overall design is more elaborate than a flat per-item approach. But, I think it's justified because we're getting more than 2x better compression, a 230x smaller index (112 KB vs 26 MB), and 2x lower worst case query latency on EBS (334 ms vs 667 ms).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! I didn't realize you were going to be reading off EBS -- I'm more used to thinking in terms of local NVMe IO patterns -- and that does change the calculus a bit.

Comment thread design-docs/packfile-library.md Outdated

**Non-blocking Open.** `Open` returns a `*Reader` immediately. A background goroutine performs all I/O: open, stat, speculative read, trailer parse, CRC verification, index decode, app data read. A `sync.OnceValue` drains the result on the first query call. Errors are deferred to query time — `Open` itself never fails. This enables overlapped initialization: start loading an MPHF or opening other files while the goroutine runs.

**Speculative Read.** On open, one pread of the last `min(256KB, fileSize)` bytes. This usually captures the trailer, app data, and index in one IOP. If the tail exceeds 256KB, a single fallback read fetches the rest.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't true. The index will be as large as the index is. It's not guaranteed to fit in 256KB or 512KB or anything.

Comment thread design-docs/packfile-library.md Outdated
w.Finish(nil) // flushes partial record, writes index + trailer, fsyncs
```

Items are appended in order. `Finish` flushes any partial record, writes the offset index, optional app data, and a 64-byte trailer, then fsyncs. `Close` after `Finish` is a no-op; `Close` without `Finish` removes the incomplete file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read ahead and didnt see any mention of a viable uscase/way in which an application can use this, nor can I think of one that relates to either evnts or ledger usage.

Does it make sense to include it in the packfile format?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're referring to. Are you taking about the use case of calling Close() / Finish() ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad.
I was talking about the section for app data in the packfile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

app data is useful for storing any type of data / metadata which is relevant for that specific packfile.

it is used for events to store a table mapping ledgers to cumulative event counts per ledger. The ledger counts are used so we can filter for events matching the getEvents ledger range

Comment thread design-docs/packfile-library.md Outdated

Packfile stores **record sizes** (deltas between consecutive offsets) instead. Deltas depend on the maximum record size, not total file size. A file with 20KB records uses 15-bit deltas whether the file is 500MB or 50GB.

Deltas are encoded using **Frame of Reference (FOR)** compression in groups of 128. FOR subtracts a per-group minimum from every value, then bit-packs the residuals at the minimum bit width needed. Each group is self-contained:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im a bit confused by the 128s in this doc — theres RecordSize which defaults to 128 items per record,
And then thre is this line here that says - compression in groups of 128. for the FOR encoding here, which is a separate hardcoded constant somewhere, again mentioned at line 280 somewhere - The group size (128) is a library constant, independent of RecordSize. If it changes, the format version is bumped.

whats the rationale for the FOR group size being 128?
the doc doesn't say — it just appears as a bare number. Is it based on some empirical benchmarking you did for events (i know thats how you landed on the default recordsize=128)

can you make this more explicit?
reading through the doc, i kept thinking they were the same thing, and its only by accident that i noticed the mention saying theyre independent.
i think it deserves to be called out upfront — maybe give it a name like IndexGroupSize, explain why 128 was chosen, and make it clear early on that its a separate constant from default RecordSize.

also, if the group size is required to decode the index, it should probably be in the trailer?
theres 2 reserved bytes at offset 58 that could hold it?
right now a reader has to hardcode 128, and if that ever changes the only signal is a version bump — but a reader for version 1 would silently decode garbage (right?) instead of knowing it cant handle the file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. the FOR group size for the index doesn't really matter so much based on the benchmarking I did. there were diminishing returns after a certain point:

Group Size Groups Index Size vs Flat (267 KB)
32 2,135 118.6 KB -55.5%
64 1,068 113.6 KB -57.4%
128 534 111.0 KB -58.4%
256 267 109.7 KB -58.9%
512 134 109.1 KB -59.1%

Including in the group size in the trailer makes sense, I can make that change

Items are grouped into fixed-size **records** (default 128 items per record). Small items like events don't compress well individually, but 128 of them together do. Large items like ledgers are stored one per record since they compress well on their own. Each record is compressed and written as a contiguous block on disk.

An **offset index** at the end of the file maps record numbers to byte offsets. On open, the entire index is decoded into a flat `[]int64` array. Looking up item `i` is arithmetic: `offsets[i / RecordSize]` gives the record's byte offset, then a single disk read + decode extracts the item.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include 1-2 exampes here?

one for events (RecordSize=128) and one for ledgers (RecordSize=1)?
something like - "here are 5 items (item being an individual event or lcm) of these sizes, heres what the records look like on disk, heres what the offset index looks like, heres the FOR encoding step by step."

i have read through this a few times and I am still struggling to understand the format here.
There are like 6-7 concepts/terms here - item, record, offset index, FOR group, delta, W, min - all in abstract terms across different sections. i had to read it three times to piece together how they relate and I still have some visualization issues 🙈

especially useful would be showing the two modes side by side —
events with RecordSize=128 where you get the per-record FOR_index AND the file-level offset index
VS
ledgers with RecordSize=1 where the per-record FOR_index disappears entirely. that difference is non-obvious from the current text.

@karthikiyer56
Copy link
Copy Markdown
Contributor

@tamirms : I wanted to highlight that the doc is slightly hard to follow the way it is currently structured.
it shows API code before explaining what records, FOR groups, or the offset index even are.
by the time you hit "How It Works" , you've already seen ReadItem, ReadItems, MPHF, bitmaps, fingerprints — none of which make sense yet.

perhspa, it flow better as:
Problem -> Goals -> Concepts/Terminology -> How It Works (with examples) -> File Format spec -> Usage/API -> Implementation Notes?
that way the reader builds understanding before seeing code.

also, intpack feels like an afterthought at the very end, but its FOR encoding is fundamental to both the offset index and the per-record item index. it deserves more prominence - maybe a dedicated section in "How It Works" explaining FOR with a small example before diving into the file format spec.

And then maybe have a glosssary of terms at he end?
I CCed something and it gave me something like this, which helped me big time.
Pleaes do fact check the table, but it looks acurate to me


Term What it is Book analogy
Item One event or one ledger. The thing you're storing. One page
Record A container of 1 or more items, stored as one blob on disk. RecordSize controls how many items per record. A chapter (1 page per chapter, or 128 pages per chapter)
Payload The concatenated raw bytes of all items in a record, before compression. The chapter text before printing
FOR_index A mini-index appended to each multi-item record, storing each item's byte length so you can find individual items inside the decompressed payload. Only exists when RecordSize > 1. A chapter's own table of contents showing where each page starts within that chapter
Offset index The file-level index at the end of the file mapping record numbers to byte positions on disk. The book's table of contents showing where each chapter starts
Delta The byte size of one record on disk. The offset index stores these instead of absolute byte positions. "Chapter 3 is 14 pages long" instead of "Chapter 3 starts on page 42"
FOR group A batch of 128 deltas encoded together using FOR compression. The 128 here is a separate hardcoded constant from RecordSize (which also defaults to 128 — confusing coincidence). Grouping every 128 chapters' page-counts together for compact storage
min The smallest delta in a FOR group. Subtracted from every delta to shrink the numbers. "The shortest chapter in this batch is 12 pages, so record everything relative to 12"
W Bit width needed to store the largest residual (delta - min) in a FOR group. "After subtracting the minimum, the biggest remaining number fits in 9 bits"

Comment thread design-docs/packfile-library.md Outdated

### Content Hash

When `ContentHash: true`, the writer computes a chunked SHA-256 over the logical item stream:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rephrase this perhaps to say - "Each record's contenst are hashed, and then the all the record digests are hashed together" instead of introducing K and talking about chunk boundaries?
K is just RecordSize - as in, a "chunk" of K items is just... a record.

IMO, the section

chunkDigest_i = SHA-256([4B len][item_{i*K}] ... [4B len][item_{i*K+K-1}])
finalHash     = SHA-256(chunkDigest_0 || ... || chunkDigest_M)
K = RecordSize

can be replaced with

RecordSize = 128
record_0_digest = SHA-256([len][item_0][len][item_1]...[len][item_127])
record_1_digest = SHA-256([len][item_128]...[len][item_255])
...
finalHash = SHA-256(record_0_digest || record_1_digest || ...)

Comment thread design-docs/packfile-library.md Outdated
K = RecordSize
```

The hash is independent of compression and format — same items in the same order with the same RecordSize always produce the same hash. Note that changing RecordSize changes the chunk boundaries and therefore the hash.
Copy link
Copy Markdown
Contributor

@karthikiyer56 karthikiyer56 Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

independent of compression and format reads like the hash depends only on the data itself. but the very next sentence says changing RecordSize changes the hash.

Can we rephrase it to something like:
"""
The hash is independent of compression and format, but depends on RecordSize.
This means you can't use the content hash to verify "these two packfiles contain the same data" unless they were written with the same RecordSize
"""

karthikiyer56 added a commit that referenced this pull request Mar 31, 2026
…w fixes

- Add events cold segment as third process_chunk output (PR #635)
- Switch LFS from .data+.index to .pack format (PR #633)
- Add chunk:{C}:events meta store key, atomic 3-flag WriteBatch
- Add events_base to config Optional Sections table
- Add events/ to directory structure
- Add DAG setup pseudocode with explicit BUILD_READY handling
- Replace ASCII dependency diagram with Mermaid flowchart
- Expand LFS, BSB, MPHF acronyms on first use
- Explain 10,000 multiplier in validation rules
- Remove "Future: getEvents" section (events now first-class)
- Remove dead pseudocode branch, hedging language
Major restructuring to improve clarity:
- Reorder doc: Problem → Concepts → Usage → API → File Format → Implementation
- Add Concepts section explaining items, records, offset index, and
  ItemsPerRecord tradeoffs upfront before any code
- Add terminology table at start of File Format section
- Give FOR encoding its own section before index and record descriptions
- Rename RecordSize to ItemsPerRecord for clarity
- Store FOR group size in trailer (offset 58) instead of hardcoding
- Simplify content hash section with concrete record numbering
- Clarify flag descriptions for Uncompressed/Raw format mapping
- Rename ErrIndexRange to ErrPositionOutOfRange to match API contract
- Add concrete ReadItem walkthrough in Implementation Notes
- Clarify speculative read is not guaranteed to capture full index

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tamirms tamirms requested a review from karthikiyer56 April 1, 2026 14:34
@tamirms
Copy link
Copy Markdown
Contributor Author

tamirms commented Apr 1, 2026

@karthikiyer56 can you take another look at the packfile doc ? I have updated it to address the review feedback

- Add byte-level worked example showing ItemsPerRecord=2 and
  ItemsPerRecord=1 side by side with FOR encoding math, record
  layouts, and file layout (addresses review feedback)
- Minor wording trims in FOR Encoding and Index Encoding sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants