Add packfile library design doc#633
Conversation
Design document describing the packfile and intpack packages for storing and reading large collections of immutable, ordinal-indexed items (events, bitmaps, ledgers) with O(1) random access and minimal I/O overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a design document proposing a “packfile” immutable, ordinal-indexed file format and Go library API intended as a building block for Stellar RPC v2 full-history storage (compact storage with O(1) random access and minimal I/O).
Changes:
- Introduces the packfile/intpack concepts, goals/non-goals, and usage examples.
- Specifies the on-disk layout (records, index, trailer) plus integrity/content-hash model.
- Documents proposed Go APIs for writer/reader, concurrency behavior, and error surface.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Clarify trailer flag descriptions: Bit 0 means not zstd-compressed (not implying CRC is always present), and document the Bit 0 + Bit 2 combination as the Raw format - Fix ErrIndexRange message from "record index" to "item index" to match the public API contract Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
I got curious so I had claude spit out a simpler program (https://github.com/graydon/xdrpack -- both go and rust though I only tested the rust version) that just does a zstd train on an xdr stream's frames to build a shared dictionary, and then compresses each frame with that dictionary and indexes the offsets of each compressed frame a single table, with offsets sized to minimum to span the file. seems to work, though who knows if it's comparable to the benchmark numbers you're after. I guess I'm just a bit unsure whether the complexity of the structure you've built here buys you much, but perhaps it does -- happy to talk more about it!
|
|
||
| - **O(1) random access by ordinal index.** Every `ReadItem` call maps index to record via arithmetic, then reads and decodes a single record. | ||
| - **Minimal I/O.** The full index loads in one disk read on open (~112KB for 68K records). After that, one disk read per record, exact size, no over-read. | ||
| - **Compact index.** Index size depends on max record size, not file size. A file with 20KB records uses 15-bit deltas whether the file is 500MB or 50GB. |
There was a problem hiding this comment.
I think this is not true. I think the index size described in this document is the size of an index entry times the number of index entries, which grows without bound. The size of an index entry (as a delta) is small, but the 500MB file mentioned here would have 25,000 index entries whereas the 50GB file would have 2.5 million index entries.
|
|
||
| ## Goals | ||
|
|
||
| - **O(1) random access by ordinal index.** Every `ReadItem` call maps index to record via arithmetic, then reads and decodes a single record. |
There was a problem hiding this comment.
This is .. depending on how you look at it "not achieved" by the design in this doc. The index is delta-encoded which means the entire index needs decoding on file-open (which is O(n) despite having a fairly healthy divisor on n).
Once open and decoded, it's O(1), but that's not guaranteed to be easy or cheap (eg. if you have lots of files you might be constantly reading/decoding/dropping their indexes from memory). I'd recommend just making your index be seekable itself, so you:
- open the file
- readAt a position in the index you calculate from the ordinal, to get the offset of the variable-sized record
- readAt the variable-sized record
There was a problem hiding this comment.
"not achieved" by the design in this doc
I do agree with this in a sense.
As in,
the offset index is delta-encoded, so you cant seek to a single entry — you need to decode ALL preceding deltas and cumulative-sum them to reconstruct absolute offsets.
thats O(N) on open, not O(1).
Granted N here is only ever 10k, atleast in the ledgers case, but still..
after open it's an array lookup, sure., but the doc says "O(1) random access" - as a generaliztion without quantification.
would be more honest to say "O(1) after O(n) initialization"?
|
|
||
| ## How It Works | ||
|
|
||
| Items are grouped into fixed-size **records** (default 128 items per record). Small items like events don't compress well individually, but 128 of them together do. Large items like ledgers are stored one per record since they compress well on their own. Each record is compressed and written as a contiguous block on disk. |
There was a problem hiding this comment.
I'm unclear on whether the design of bundling items into groups for compression is ideal. I think it's buying you two things?
- Space for zstd to be able to achieve some compression
- A secondary FOR base for the item-FOR-offsets-within-the-record, but exploiting that requires yet another FOR_index appended to each record.
I don't know, the whole thing seems a bit elaborate -- the FOR-and-delta coded primary index, the secondary FOR_index on each record, even the use of zstd.
I think you can get to a similarly-good place with less complexity if you:
- Figure out a way to adequately individual items (eg.
zstd --trainfor a while and then reuse that dictionary across all items; or perhaps just cook up a fixed-model compression scheme for XDR, I think we've tried this before in the past?) - FOR-encode the offsets in the index if you like, but:
- Encode an offset for every item in a single index at the end, not a 2-level index-of-indexes
- Don't also delta-encode the primary index entries
- Figure out a single width for al the index entries in the file so you can calculate the index entry to readAt
- Reset the FOR-base at those same 128-offset fixed intervals so you can pick that out with another readAt
Honestly .. even the FOR-encoding of offsets seems like it might well be overkill. I would just do plain leading-null suppression / picking a per-file offset width.
Like unless your files are truly gargantuan the offsets will probably be like 3-5 bytes most of the time (16MiB - 1TiB) and all the FOR wrangling will probably only get your offsets down to 1-2 bytes. So like the index is half as big, but .. is it worth an extra readAt on the hot path? Probably not. I would just figure out what the max offset you need is for all offsets in a file, write that number at the end -- eg. "this file is <= 16MiB so offsets in this file are each 3 bytes long", say -- and then just readAt(ordinal * 3) from your index, and that's your item offset, don't FOR or delta encode anything.
There was a problem hiding this comment.
@graydon this is a great question and I think this gets at the core design decision in packfile.
Space for zstd to be able to achieve some compression
A secondary FOR base for the item-FOR-offsets-within-the-record, but exploiting that requires yet another FOR_index appended to each record.
That assessment is correct and I will elaborate on the specific benefits.
The constraint driving everything is: the events API returns up to 1,000 events per response, and in the worst case those events are scattered across the entire file. The file of events we've been benchmarking with spans 10,000 ledgers and contains 8.7M events (average 221 bytes each uncompressed) — it's recent and probably representative of current event density.
Compression: early on in our design phase @urvisavla actually tried out per event dictionary compression and she found training the dictionary allowed us to achieve ~2x compression. I also ran your xdrpack tool on the dataset of 8.7M events and confirmed the ~2x compression ratio using that method.
Using zstd on groups of 128 events per record gets us a a ~4.4x compression ratio. Individual events just don't give zstd enough context even with a 64 KB trained dictionary. Aside from a better compression ratio there are two more benefits:
- faster ingestion times because we don't need to train the zstd dictionary
- no need to read the 64 KB trained dictionary as a prerequisite step before querying individual events from the file
Index size: Without grouping, the index has one entry per event: 8.7M × 3 bytes = 26 MB. With grouping at 128, it's one entry per record: 68K entries. Using leading-null-suppressed offsets as you suggested, that's 68K × 3 bytes = 204 KB — fits in a single EBS IOP. FOR encoding shrinks it further to 112 KB, which gives us room to grow if event density increases or we want each packfile to span wider ledger ranges than 10,000.
Why this matters on EBS: On gp3 (3,000 IOPS baseline, 125 MB/s throughput), each random read costs one IOP regardless of size (up to 256 KB):
- Without grouping: the 26 MB index is too large to preload, so each event lookup needs two IOPS — one index seek, one data read. Worst case is 1,000 scattered events = 2,000 IOPS = 667 ms.
- With grouping: the 112 KB index loads in one IOP, then each event lookup is one data read. 1,001 IOPS = 334 ms. You might worry that grouped records are larger (~6.4 KB each vs a single compressed event), but 1,000 × 6.4 KB = 6.4 MB total, which takes 51 ms at 125 MB/s. Since IOPS and bandwidth are consumed simultaneously — each IOP transfers its bytes — the bottleneck is whichever takes longer: 334 ms for IOPS vs 51 ms for bandwidth. So it's IOPS-bound either way.
For the EBS case, I think we should strive to make the index at the end of the file as compact as possible so that we can load it with as few IOPs as possible. IMO, the FOR implementation (125 LoC) is justified for that but I'm definitely open to other suggestions.
However, it is not necessary for the index contained within each record to be super compact since the index size for 128 events is tiny in comparison to the event payload sizes. I agree that FOR is totally overkill there. I only used it in the packfile design because we're already using it for the primary index at the trailer. A simpler scheme like length-prefixed entries would work just as well but I figured we might as well just implement it as another call to the FOR encode / decode function.
In conclusion, I agree the overall design is more elaborate than a flat per-item approach. But, I think it's justified because we're getting more than 2x better compression, a 230x smaller index (112 KB vs 26 MB), and 2x lower worst case query latency on EBS (334 ms vs 667 ms).
There was a problem hiding this comment.
ok! I didn't realize you were going to be reading off EBS -- I'm more used to thinking in terms of local NVMe IO patterns -- and that does change the calculus a bit.
|
|
||
| **Non-blocking Open.** `Open` returns a `*Reader` immediately. A background goroutine performs all I/O: open, stat, speculative read, trailer parse, CRC verification, index decode, app data read. A `sync.OnceValue` drains the result on the first query call. Errors are deferred to query time — `Open` itself never fails. This enables overlapped initialization: start loading an MPHF or opening other files while the goroutine runs. | ||
|
|
||
| **Speculative Read.** On open, one pread of the last `min(256KB, fileSize)` bytes. This usually captures the trailer, app data, and index in one IOP. If the tail exceeds 256KB, a single fallback read fetches the rest. |
There was a problem hiding this comment.
I think this isn't true. The index will be as large as the index is. It's not guaranteed to fit in 256KB or 512KB or anything.
| w.Finish(nil) // flushes partial record, writes index + trailer, fsyncs | ||
| ``` | ||
|
|
||
| Items are appended in order. `Finish` flushes any partial record, writes the offset index, optional app data, and a 64-byte trailer, then fsyncs. `Close` after `Finish` is a no-op; `Close` without `Finish` removes the incomplete file. |
There was a problem hiding this comment.
I read ahead and didnt see any mention of a viable uscase/way in which an application can use this, nor can I think of one that relates to either evnts or ledger usage.
Does it make sense to include it in the packfile format?
There was a problem hiding this comment.
I'm not sure what you're referring to. Are you taking about the use case of calling Close() / Finish() ?
There was a problem hiding this comment.
My bad.
I was talking about the section for app data in the packfile.
There was a problem hiding this comment.
app data is useful for storing any type of data / metadata which is relevant for that specific packfile.
it is used for events to store a table mapping ledgers to cumulative event counts per ledger. The ledger counts are used so we can filter for events matching the getEvents ledger range
|
|
||
| Packfile stores **record sizes** (deltas between consecutive offsets) instead. Deltas depend on the maximum record size, not total file size. A file with 20KB records uses 15-bit deltas whether the file is 500MB or 50GB. | ||
|
|
||
| Deltas are encoded using **Frame of Reference (FOR)** compression in groups of 128. FOR subtracts a per-group minimum from every value, then bit-packs the residuals at the minimum bit width needed. Each group is self-contained: |
There was a problem hiding this comment.
im a bit confused by the 128s in this doc — theres RecordSize which defaults to 128 items per record,
And then thre is this line here that says - compression in groups of 128. for the FOR encoding here, which is a separate hardcoded constant somewhere, again mentioned at line 280 somewhere - The group size (128) is a library constant, independent of RecordSize. If it changes, the format version is bumped.
whats the rationale for the FOR group size being 128?
the doc doesn't say — it just appears as a bare number. Is it based on some empirical benchmarking you did for events (i know thats how you landed on the default recordsize=128)
can you make this more explicit?
reading through the doc, i kept thinking they were the same thing, and its only by accident that i noticed the mention saying theyre independent.
i think it deserves to be called out upfront — maybe give it a name like IndexGroupSize, explain why 128 was chosen, and make it clear early on that its a separate constant from default RecordSize.
also, if the group size is required to decode the index, it should probably be in the trailer?
theres 2 reserved bytes at offset 58 that could hold it?
right now a reader has to hardcode 128, and if that ever changes the only signal is a version bump — but a reader for version 1 would silently decode garbage (right?) instead of knowing it cant handle the file.
There was a problem hiding this comment.
good point. the FOR group size for the index doesn't really matter so much based on the benchmarking I did. there were diminishing returns after a certain point:
| Group Size | Groups | Index Size | vs Flat (267 KB) |
|---|---|---|---|
| 32 | 2,135 | 118.6 KB | -55.5% |
| 64 | 1,068 | 113.6 KB | -57.4% |
| 128 | 534 | 111.0 KB | -58.4% |
| 256 | 267 | 109.7 KB | -58.9% |
| 512 | 134 | 109.1 KB | -59.1% |
Including in the group size in the trailer makes sense, I can make that change
| Items are grouped into fixed-size **records** (default 128 items per record). Small items like events don't compress well individually, but 128 of them together do. Large items like ledgers are stored one per record since they compress well on their own. Each record is compressed and written as a contiguous block on disk. | ||
|
|
||
| An **offset index** at the end of the file maps record numbers to byte offsets. On open, the entire index is decoded into a flat `[]int64` array. Looking up item `i` is arithmetic: `offsets[i / RecordSize]` gives the record's byte offset, then a single disk read + decode extracts the item. | ||
|
|
There was a problem hiding this comment.
Can you include 1-2 exampes here?
one for events (RecordSize=128) and one for ledgers (RecordSize=1)?
something like - "here are 5 items (item being an individual event or lcm) of these sizes, heres what the records look like on disk, heres what the offset index looks like, heres the FOR encoding step by step."
i have read through this a few times and I am still struggling to understand the format here.
There are like 6-7 concepts/terms here - item, record, offset index, FOR group, delta, W, min - all in abstract terms across different sections. i had to read it three times to piece together how they relate and I still have some visualization issues 🙈
especially useful would be showing the two modes side by side —
events with RecordSize=128 where you get the per-record FOR_index AND the file-level offset index
VS
ledgers with RecordSize=1 where the per-record FOR_index disappears entirely. that difference is non-obvious from the current text.
|
@tamirms : I wanted to highlight that the doc is slightly hard to follow the way it is currently structured. perhspa, it flow better as: also, intpack feels like an afterthought at the very end, but its FOR encoding is fundamental to both the offset index and the per-record item index. it deserves more prominence - maybe a dedicated section in "How It Works" explaining FOR with a small example before diving into the file format spec. And then maybe have a glosssary of terms at he end?
|
|
|
||
| ### Content Hash | ||
|
|
||
| When `ContentHash: true`, the writer computes a chunked SHA-256 over the logical item stream: |
There was a problem hiding this comment.
can we rephrase this perhaps to say - "Each record's contenst are hashed, and then the all the record digests are hashed together" instead of introducing K and talking about chunk boundaries?
K is just RecordSize - as in, a "chunk" of K items is just... a record.
IMO, the section
chunkDigest_i = SHA-256([4B len][item_{i*K}] ... [4B len][item_{i*K+K-1}])
finalHash = SHA-256(chunkDigest_0 || ... || chunkDigest_M)
K = RecordSize
can be replaced with
RecordSize = 128
record_0_digest = SHA-256([len][item_0][len][item_1]...[len][item_127])
record_1_digest = SHA-256([len][item_128]...[len][item_255])
...
finalHash = SHA-256(record_0_digest || record_1_digest || ...)
| K = RecordSize | ||
| ``` | ||
|
|
||
| The hash is independent of compression and format — same items in the same order with the same RecordSize always produce the same hash. Note that changing RecordSize changes the chunk boundaries and therefore the hash. |
There was a problem hiding this comment.
independent of compression and format reads like the hash depends only on the data itself. but the very next sentence says changing RecordSize changes the hash.
Can we rephrase it to something like:
"""
The hash is independent of compression and format, but depends on RecordSize.
This means you can't use the content hash to verify "these two packfiles contain the same data" unless they were written with the same RecordSize
"""
…w fixes - Add events cold segment as third process_chunk output (PR #635) - Switch LFS from .data+.index to .pack format (PR #633) - Add chunk:{C}:events meta store key, atomic 3-flag WriteBatch - Add events_base to config Optional Sections table - Add events/ to directory structure - Add DAG setup pseudocode with explicit BUILD_READY handling - Replace ASCII dependency diagram with Mermaid flowchart - Expand LFS, BSB, MPHF acronyms on first use - Explain 10,000 multiplier in validation rules - Remove "Future: getEvents" section (events now first-class) - Remove dead pseudocode branch, hedging language
Major restructuring to improve clarity: - Reorder doc: Problem → Concepts → Usage → API → File Format → Implementation - Add Concepts section explaining items, records, offset index, and ItemsPerRecord tradeoffs upfront before any code - Add terminology table at start of File Format section - Give FOR encoding its own section before index and record descriptions - Rename RecordSize to ItemsPerRecord for clarity - Store FOR group size in trailer (offset 58) instead of hardcoding - Simplify content hash section with concrete record numbering - Clarify flag descriptions for Uncompressed/Raw format mapping - Rename ErrIndexRange to ErrPositionOutOfRange to match API contract - Add concrete ReadItem walkthrough in Implementation Notes - Clarify speculative read is not guaranteed to capture full index Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@karthikiyer56 can you take another look at the packfile doc ? I have updated it to address the review feedback |
- Add byte-level worked example showing ItemsPerRecord=2 and ItemsPerRecord=1 side by side with FOR encoding math, record layouts, and file layout (addresses review feedback) - Minor wording trims in FOR Encoding and Index Encoding sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Note: This PR targets
mainfor now but will be retargeted to the full history RPC / v2 branch once we align on that before merging.Further reading
packfile,intpack🤖 Generated with Claude Code