Problem
Several read_* entry points produce a random checkpoint hash on every run, making checkpoint reuse impossible. The root cause is read_records() — it creates a temp dataset via session.generate_temp_dataset_name() which includes a random UUID. The starting hash is derived from this random dataset name.
Affected entry points:
| Entry point |
Root cause |
read_values() |
Built on read_records() |
read_pandas() |
Built on read_values() |
read_hf() |
Built on read_values() |
read_records() with concrete data |
Same random temp name |
read_database() |
Built on read_records() with a generator |
read_records() with generators |
Same |
Working entry points (for reference):
| Entry point |
Why it works |
read_dataset() |
Hash based on dataset name + version |
read_storage() |
Hash based on listing dataset name |
read_csv(), read_parquet(), read_json() |
Built on read_storage() |
Proposed solution
For entry points where data is concrete (in memory) at construction time, compute a deterministic hash from the input data upfront — before steps are applied. This preserves the current design where hash calculation is lightweight and doesn't require step application.
Concrete data (can fix now):
read_values() — input is always concrete lists (**fr_map sequences)
read_pandas() — input is always a DataFrame converted to lists
read_hf() — input is a list of split names
read_records() when called with a list/dict
Approach: compute a streaming SHA256 hash of the serialized input rows at construction time. Python's hashlib.sha256 supports incremental .update() calls with constant memory. Performance for 1M rows of ~100 bytes ≈ 0.2s, negligible compared to DB insert time.
Challenges:
- Deterministic serialization of arbitrary Python objects (Pydantic models, datetimes, nested dicts). The
_flatten_record + adjust_outputs pipeline already serializes these for DB insertion — we could reuse or mirror that logic for hashing.
- The hash needs to be available before
read_dataset() is called on the temp dataset. Could either set it directly as the starting step hash on the DatasetQuery, or use a content-addressable dataset name.
Generators/iterators (deferred — separate discussion):
read_database() passes a generator from SQL result iteration
- Direct
read_records() calls can pass generators
- These can't be hashed without consuming the iterator. Options include hashing during insertion (breaks the lightweight-hash-before-apply design) or leaving as a known limitation. To be discussed separately.
Related: #1629
Problem
Several
read_*entry points produce a random checkpoint hash on every run, making checkpoint reuse impossible. The root cause isread_records()— it creates a temp dataset viasession.generate_temp_dataset_name()which includes a random UUID. The starting hash is derived from this random dataset name.Affected entry points:
read_values()read_records()read_pandas()read_values()read_hf()read_values()read_records()with concrete dataread_database()read_records()with a generatorread_records()with generatorsWorking entry points (for reference):
read_dataset()read_storage()read_csv(),read_parquet(),read_json()read_storage()Proposed solution
For entry points where data is concrete (in memory) at construction time, compute a deterministic hash from the input data upfront — before steps are applied. This preserves the current design where hash calculation is lightweight and doesn't require step application.
Concrete data (can fix now):
read_values()— input is always concrete lists (**fr_mapsequences)read_pandas()— input is always a DataFrame converted to listsread_hf()— input is a list of split namesread_records()when called with a list/dictApproach: compute a streaming SHA256 hash of the serialized input rows at construction time. Python's
hashlib.sha256supports incremental.update()calls with constant memory. Performance for 1M rows of ~100 bytes ≈ 0.2s, negligible compared to DB insert time.Challenges:
_flatten_record+adjust_outputspipeline already serializes these for DB insertion — we could reuse or mirror that logic for hashing.read_dataset()is called on the temp dataset. Could either set it directly as the starting step hash on theDatasetQuery, or use a content-addressable dataset name.Generators/iterators (deferred — separate discussion):
read_database()passes a generator from SQL result iterationread_records()calls can pass generatorsRelated: #1629