Skip to content

[ENH] Add basic maxscore writer/reader#6825

Open
HammadB wants to merge 10 commits intohammad/sparse_posting_blockfrom
hammad/maxscore_writer_reader
Open

[ENH] Add basic maxscore writer/reader#6825
HammadB wants to merge 10 commits intohammad/sparse_posting_blockfrom
hammad/maxscore_writer_reader

Conversation

@HammadB
Copy link
Copy Markdown
Collaborator

@HammadB HammadB commented Apr 5, 2026

Description of changes

This is PR #2 of a series unbundled from the hammad/sparse-maxscore-prototype branch. Stacked on hammad/sparse_posting_block which adds SparsePostingBlock and its blockstore integration.

This PR adds the writer, reader, cursor, and eager-mode BlockMaxMaxScore query implementation for sparse posting lists. Subsequent PRs will add lazy I/O cursors, a 3-batch pipeline, SIMD budget pruning, and segment-level integration.

  • New functionality
    • BlockSparseWriter: Accumulates per-dimension deltas in a DashMap, merges with an optional previous BlockSparseReader on commit, re-chunks entries into fixed-size blocks (default 1024), and writes both data blocks and a directory block per dimension. Supports incremental add/delete.
    • BlockSparseReader: Reads posting blocks per dimension via the blockfile prefix API. Provides open_cursor() for single-dimension iteration and a full query() method implementing the windowed BlockMaxMaxScore algorithm.
    • PostingCursor: Eager cursor backed by fully decompressed SparsePostingBlocks. Supports advance() with roaring bitmap masks, drain_essential() for Phase 1 window accumulation, score_candidates() for Phase 2 non-essential merge-join, and window_upper_bound() for per-window essential/non-essential repartitioning.
    • query() — BlockMaxMaxScore: Windowed (4096-slot) block-max MaxScore implementation with essential/non-essential term partitioning, bitmap-tracked flat accumulator, budget-based candidate pruning, and min-heap top-k extraction.
    • Two-phase rescoring: SparseRescorer trait and rescore_and_select() for oversampled retrieval with exact-score refinement. Not currently used.
    • Design doc (maxscore.md): Documents the algorithm, data layout, and per-phase query walkthrough. High level for posterity.

Test plan

11 integration test files covering the writer, reader, cursor, and query engine:

  • ms_01_blockfile_roundtrip — Serialize/deserialize SparsePostingBlock through blockfile write→flush→read.

  • ms_02_writer_basic — Write sparse vectors, commit, read back via reader.

  • ms_03_writer_incremental — Incremental updates: add, delete, overwrite across commits.

  • ms_04_writer_edge_cases — Single-entry dimensions, high-cardinality dimensions, empty deltas.

  • ms_05_cursorPostingCursor advance, drain_essential, score_candidates, window_upper_bound.

  • ms_06_correctness — End-to-end query() results match brute-force dot-product.

  • ms_07_masksquery() with Include/Exclude roaring bitmap masks.

  • ms_08_edge_cases — Empty index, k=0, missing dimensions, single-doc queries.

  • ms_09_recall — Recall@k measurement against brute-force on randomized data.

  • ms_10_incremental_query — Query correctness after incremental writer updates.

  • ms_11_vectorized_scoring — Verifies drain_essential and score_candidates accumulator arithmetic.

  • Tests pass locally with cargo test

Migration plan

No migrations needed. This is a new index type with no existing on-disk data. Segment-level wiring is deferred to a later PR.

Observability plan

No new instrumentation in this PR. Tracing spans and metrics (block skip rate, essential/non-essential term counts, per-window candidate counts) will be added alongside segment integration in a follow-up PR.

Documentation Changes

No user-facing API changes. Internal design is documented in rust/index/src/sparse/maxscore.md.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 5, 2026

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Copy Markdown
Collaborator Author

HammadB commented Apr 5, 2026

@HammadB HammadB mentioned this pull request Apr 5, 2026
2 tasks
propel-code-bot[bot]

This comment was marked as outdated.

@HammadB HammadB mentioned this pull request Apr 5, 2026
1 task
@HammadB HammadB changed the base branch from hammad/sparse_posting_block to graphite-base/6825 April 6, 2026 16:51
propel-code-bot[bot]

This comment was marked as outdated.

@HammadB HammadB force-pushed the hammad/maxscore_writer_reader branch from 9d41c2f to f67eb32 Compare April 6, 2026 21:57
@HammadB HammadB force-pushed the graphite-base/6825 branch from 0a24242 to 7e9e2b9 Compare April 6, 2026 21:57
@HammadB HammadB changed the base branch from graphite-base/6825 to hammad/sparse_posting_block April 6, 2026 21:57
propel-code-bot[bot]

This comment was marked as outdated.

@HammadB HammadB changed the base branch from hammad/sparse_posting_block to graphite-base/6825 April 9, 2026 15:42
@HammadB HammadB force-pushed the hammad/maxscore_writer_reader branch from f67eb32 to 2fed177 Compare April 9, 2026 18:03
@HammadB HammadB changed the base branch from graphite-base/6825 to hammad/sparse_posting_block April 9, 2026 18:03
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 2fed177 to ffb0104 Compare April 9, 2026 19:13
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 5f52182 to 789dabf Compare April 9, 2026 19:13
@blacksmith-sh

This comment has been minimized.

Comment thread rust/index/src/sparse/maxscore.rs Outdated
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from e71b3f4 to 6b0e4e6 Compare April 9, 2026 23:09
@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review April 10, 2026 00:04
@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot Bot commented Apr 10, 2026

Add MaxScore sparse posting writer/reader with eager BlockMaxMaxScore query path

This PR introduces a new sparse indexing/query path centered on rust/index/src/sparse/maxscore.rs, including MaxScoreWriter, MaxScoreReader, PostingCursor, MaxScoreFlusher, and a windowed eager query() implementation of BlockMaxMaxScore. The writer supports incremental set/delete updates, merges against a prior snapshot, rewrites posting blocks with directory metadata, and includes a suffix-rewrite optimization to avoid full-dimension rewrites when only tail ranges are affected. The reader can open per-dimension cursors and execute top-k scoring with essential/non-essential partitioning, bitmap-tracked window accumulation, budget pruning, and heap-based selection.

The PR also extracts reusable scoring primitives into rust/index/src/sparse/types.rs (Score, TopKHeap), wires the new module in rust/index/src/sparse/mod.rs, adds a design document at rust/index/src/sparse/maxscore.md, and adds broad integration coverage under rust/index/tests/maxscore/ (11 test files) for serialization roundtrip, writer behavior, incremental updates, cursor math, mask handling, correctness vs brute force, recall, and edge cases.

This summary was automatically generated by @propel-code-bot

@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 6b0e4e6 to d57413d Compare April 10, 2026 00:08
@Sicheng-Pan Sicheng-Pan mentioned this pull request Apr 10, 2026
1 task
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from d57413d to 9941a72 Compare April 10, 2026 03:11
@Sicheng-Pan Sicheng-Pan mentioned this pull request Apr 10, 2026
5 tasks
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 5f15fc9 to c2105e3 Compare April 10, 2026 03:26
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 9941a72 to 0dec95c Compare April 10, 2026 03:26
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 6ad7e5c to be9d78b Compare April 10, 2026 20:11
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from c2105e3 to 36bf255 Compare April 10, 2026 20:11
Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found no issues; the enhancement appears solid, well-tested, and ready to merge.

Status: No Issues Found | Risk: Low

Review Details

📁 21 files reviewed | 💬 0 comments

@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 36bf255 to df84501 Compare April 13, 2026 17:56
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 4c6a02c to e75da84 Compare April 13, 2026 17:56
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from df84501 to ad84956 Compare April 13, 2026 18:19
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from e75da84 to 3faeec2 Compare April 13, 2026 18:19
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 3faeec2 to f3ea978 Compare April 14, 2026 22:04
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from ad84956 to 894e694 Compare April 14, 2026 22:04
Comment thread rust/index/src/sparse/maxscore.rs Outdated
///
/// Used by the suffix-rewrite optimization in `MaxScoreWriter::commit()`
/// to avoid loading blocks before the first affected offset.
pub async fn get_posting_blocks_from(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can named something like get_posting_blocks_range(start_seq, end_seq). Current name is a bit unclear

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

Comment thread rust/index/src/sparse/maxscore.rs Outdated
if let Some(ref directory) = old_directory {
let old_block_count = directory.num_blocks() as u32;
let old_dir_part_count = if let Some(ref reader) = self.old_reader {
reader.count_directory_parts(encoded_dim).await? as u32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how much perf concern this is - both 1. reader.count_directory_parts() and 2. reader.get_directory() do a get_prefix call which could result in fetching blocks from s3. If there is cache churning after 1 and before 2 then 2 will also fetch from s3 the same blocks that 1 fetched. Just for getting a count. Is it possible to combine these two into one call maybe? Ignore this comment if it won't matter in practice

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified

#[derive(Clone)]
pub struct MaxScoreWriter<'me> {
block_size: u32,
delta: Arc<DashMap<u32, DashMap<u32, Option<f32>>>>,
Copy link
Copy Markdown
Contributor

@sanketkedia sanketkedia Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might also not matter much in practice - Arc<DashMap<u32, DashMap<u32, Option<f32>>>> v/s Arc<DashMap<u32, AsyncPartitionedMutex<Vector<(u32, Option<f32>)>>> might be more friendly for cpu caches. We don't need the inner hash map because different threads will be operating on different doc ids so they can simply append to a vector

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not a concern yet. the slow part is the commit

posting_reader: BlockfileReader<'me, u32, SparsePostingBlock>,
}

impl<'me> MaxScoreReader<'me> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: will it be useful to segregate the getters (get_* ) methods into owned v/s non owned ones by naming them suitably?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the non-owned ones?


let mut terms: Vec<TermState> = Vec::new();
for (idx, result) in cursor_results.into_iter().enumerate() {
let Some(mut cursor) = result? else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we ignore the error here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is propagated up?

}
if mask.contains(doc) {
let idx = (doc - window_start) as usize;
bitmap[idx >> 6] |= 1u64 << (idx & 63);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment explaining this bitwise arithmetic would be useful for future readers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

@Sicheng-Pan Sicheng-Pan force-pushed the hammad/sparse_posting_block branch from 894e694 to 35663b7 Compare April 24, 2026 00:02
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from f3ea978 to 116fb2a Compare April 24, 2026 00:02
@Sicheng-Pan Sicheng-Pan force-pushed the hammad/maxscore_writer_reader branch from 116fb2a to d65311f Compare April 24, 2026 02:05
@blacksmith-sh

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants