Skip to content

feat(state): segments / tokens iterators (#3)#9

Merged
uqio merged 7 commits into
mainfrom
feat/iterators
May 9, 2026
Merged

feat(state): segments / tokens iterators (#3)#9
uqio merged 7 commits into
mainfrom
feat/iterators

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 9, 2026

Closes #3.

Adds two ergonomic iterator entry points to the safe API and the standard adapter trait impls so they compose with flat_map / rev / for ... in &state.

Surface

state.segments_iter()                    // Segments<'_>
seg.tokens_iter()                        // Tokens<'_>

for seg in &state { ... }                // IntoIterator for &State
for tok in seg { ... }                   // IntoIterator for Segment<'a>  (Copy)
for tok in &seg { ... }                  // IntoIterator for &Segment<'a>

state.segments_iter().rev()              // DoubleEndedIterator
seg.tokens_iter().rev()
state.segments_iter().flat_map(|s| s.tokens_iter())  // flat token stream

Both iterators implement Iterator + ExactSizeIterator + DoubleEndedIterator + FusedIterator. Yielded items: Segments::Item = Segment<'a> (borrows from the source &'a State); Tokens::Item = Token (owned, value-typed snapshot, no further borrow).

Lifetime shape

  • Segments<'a> borrows &'a State. Multiple iterators alive simultaneously is sound — the underlying whisper_state is immutable for the borrow's duration (State::full requires &mut self, ruled out by the borrow checker while any &self iterator exists).
  • Tokens<'state> owns a Copy of the Segment (which is itself a thin index + pointer projection, derive-Copy). This makes
    state.segments_iter().flat_map(|seg| seg.tokens_iter())
    compile — the inner iterator does not borrow the closure-local seg value. The 'state lifetime still ties yielded item pointer projections to the parent State.
  • Auto-traits: !Send + !Sync on both, inherited from &State (State is !Sync) / NonNull<whisper_state> (foreign C type with no impl). Matches the existing State threading model.

Performance

Segments::next and Tokens::next inline the construction / pointer projection rather than calling back through State::segment(i) / Segment::token(j). Those accessors do their own n_segments() / n_tokens() FFI bounds-check, but the iterator already captured end from the same call at construction — and the borrow chain pins the underlying state, so the bounds-check is redundant. One FFI call per yielded item saved on Tokens (which dominates: hundreds to low thousands of tokens per state).

Tests

13 new unit tests in whispercpp/src/state.rs's #[cfg(test)] mod tests:

  • segments_iter_empty_state_yields_zero_items
  • segments_iter_count_matches_n_segments
  • segments_iter_exact_size_len_matches_count
  • segments_iter_size_hint_is_exact
  • segments_iter_fused_after_exhaustion
  • multiple_segments_iter_alive_concurrently
  • segments_iter_composes_with_adapters
  • nested_segments_and_tokens_iter_compiles
  • iterator_type_bounds_are_correct
  • tokens_iter_composes_with_flat_map
  • into_iter_for_state_ref_yields_segments
  • into_iter_for_segment_compiles
  • segments_iter_double_ended_compiles_and_empty_yields_none
  • tokens_iter_double_ended_compiles

Two #[cfg(test)] pub(crate) test fixtures make this possible without a model file:

  • Context::dangling_for_test()unsafe fn. Returns a Context with NonNull::dangling(). The unsafe contract: the caller must ensure the value (or any Arc<Context> derived from it) is core::mem::forget'd before its Drop runs (which would otherwise call whisper_free on the dangling pointer).
  • State::poisoned_for_test(ctx: Arc<Context>) — safe. Returns a State with ptr: None (the Drop impl already short-circuits for poisoned states; no FFI runs).

The test fixture (poisoned_state_for_test) wraps the unsafe call in a single unsafe block with a SAFETY comment explaining the Arc-refcount-leak invariant: Arc::new + Arc::clone brings refcount to 2, then mem::forget on the original handle keeps it at ≥1 forever, so Context::drop is unreachable.

The runtime exercise of Tokens requires a real model (the poisoned fixture yields zero segments, so we can't construct a Segment). The compile-time bound assertions (assert_token_iter, assert_dei) cover the type-system contract instead — every adapter the docs claim should compile is type-checked at the call site.

Audit

whispercpp/src/safety_audit.rs gains a segments_iter / tokens_iter row in the per-method coverage matrix, walking all ten safety axes. Highlights:

Verification

  • cargo build -p whispercpp clean (debug + release).
  • cargo test -p whispercpp --lib — 62 passed (was 48, +14 new across the four commits).
  • cargo clippy -p whispercpp -p whispercpp-sys --all-targets -- -D warnings clean.
  • cargo doc -p whispercpp --no-deps clean.

🤖 Generated with Claude Code

@uqio uqio changed the title feat: iterators feat(state): segments / tokens iterators (#3) May 9, 2026
@al8n al8n requested a review from Copilot May 9, 2026 02:41
Adds two iterator entry points to the safe API and the
standard adapter trait impls so they compose with
`flat_map` / `rev` / `for ... in &state`.

Surface
-------

```rust
state.segments_iter()                        // Segments<'_>
seg.tokens_iter()                            // Tokens<'_>

for seg in &state { ... }                    // IntoIterator for &State
for tok in seg { ... }                       // IntoIterator for Segment<'a>
for tok in &seg { ... }                      // IntoIterator for &Segment<'a>

state.segments_iter().rev()                  // DoubleEndedIterator
seg.tokens_iter().rev()
state.segments_iter().flat_map(|s| s.tokens_iter())   // flat token stream
```

Both iterators implement `Iterator + ExactSizeIterator
+ DoubleEndedIterator + FusedIterator`. Yielded items:
`Segments::Item = Segment<'a>` (borrows from the source
`&'a State`); `Tokens::Item = Token` (owned, value-typed
snapshot, no further borrow).

Lifetime shape
--------------

* `Segments<'a>` borrows `&'a State`. Multiple iterators
  alive simultaneously is sound — the underlying
  `whisper_state` is immutable for the borrow's
  duration (`State::full` requires `&mut self`, ruled
  out while any `&self` iterator exists).
* `Tokens<'state>` owns a `Copy` of the `Segment` (which
  is itself a thin index + pointer projection,
  derive-`Copy`). This makes
  `state.segments_iter().flat_map(|seg| seg.tokens_iter())`
  compile — the inner iterator does not borrow the
  closure-local `seg` value. `'state` still ties yielded
  item pointer projections to the parent `State`.
* Auto-traits: `!Send + !Sync` on both, inherited from
  `&State` (State is `!Sync`) / `NonNull<whisper_state>`
  (foreign C type, no impl). Matches the existing State
  threading model.

Performance
-----------

`Segments::next` and `Tokens::next` inline the
construction / pointer projection rather than calling
back through `State::segment(i)` / `Segment::token(j)`.
Those accessors do their own `n_segments()` /
`n_tokens()` FFI bounds-check, but the iterator already
captured `end` from the same call at construction —
and the borrow chain pins the underlying state. One
FFI call per yielded item saved on `Tokens` (which
dominates: hundreds to low thousands of tokens per
state).

Tests
-----

14 new unit tests in `whispercpp/src/state.rs`'s
`#[cfg(test)] mod tests`. Two `#[cfg(test)] pub(crate)`
test fixtures make the iterator tests possible without
a model file:

* `Context::dangling_for_test()` — `unsafe fn` whose
  contract requires the caller to `mem::forget` the
  result (or any `Arc<Context>` derived from it) before
  its `Drop` runs. The unsafe is at the producer; the
  only call site wraps it in `unsafe { ... }` with a
  SAFETY comment explaining the Arc-refcount-leak
  invariant.
* `State::poisoned_for_test(ctx: Arc<Context>)` — safe
  (the Drop impl already short-circuits for poisoned
  states; no FFI runs).

Runtime exercise of `Tokens` requires a real model
(the poisoned fixture yields zero segments, blocking
`Segment` construction). Compile-time bound assertions
cover the type-system contract instead.

Audit
-----

`whispercpp/src/safety_audit.rs` gains a
`segments_iter` / `tokens_iter` row in the per-method
coverage matrix walking all ten safety axes. Inlined
FFI projection (`whisper_full_get_token_data_from_state`)
is documented under axis #1 (throw); the
borrow-chain-pinned bounds-check redundancy is the
soundness argument.

Verification
------------

* `cargo build -p whispercpp` clean (debug + release).
* `cargo test -p whispercpp --lib` — 62 passed (was
  48; +14 new).
* `cargo clippy -p whispercpp -p whispercpp-sys
  --all-targets -- -D warnings` clean.
* `cargo doc -p whispercpp --no-deps` clean.

Closes #3.
Cargo-compatible patch release. Only `whispercpp` bumps —
`whispercpp-sys` stays at 0.2.0 (no native changes since
0.2.0). CHANGELOG documents the iterator additions
(`Segments` / `Tokens` types, `IntoIterator` impls,
`DoubleEndedIterator`), the per-yielded-item FFI savings,
and the new test fixtures.

No breaking changes from 0.2.0.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds ergonomic, adapter-friendly iteration over whisper.cpp decode outputs by introducing segment and token iterator entry points on the safe wrapper types, plus standard IntoIterator impls to enable idiomatic for ... in loops.

Changes:

  • Add State::segments_iter() and Segment::tokens_iter() returning Segments<'_> / Tokens<'_> with ExactSizeIterator + DoubleEndedIterator + FusedIterator.
  • Implement IntoIterator for &State, Segment<'a>, and &Segment<'a> to compose with standard iterator adapters.
  • Add test-only fixtures (Context::dangling_for_test, State::poisoned_for_test) and unit tests validating iterator shape/composability; extend the safety audit.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

File Description
whispercpp/src/state.rs Introduces Segments/Tokens iterators, segments_iter/tokens_iter, IntoIterator impls, plus iterator-focused unit tests and test fixtures.
whispercpp/src/context.rs Adds a #[cfg(test)] unsafe dangling_for_test() constructor used by iterator unit tests without requiring a real model.
whispercpp/src/lib.rs Re-exports the new iterator types (Segments, Tokens) as part of the public surface.
whispercpp/src/safety_audit.rs Documents safety/audit coverage for the new iterator methods and their IntoIterator delegations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread whispercpp/src/state.rs Outdated
Comment thread whispercpp/src/state.rs Outdated
Comment thread whispercpp/src/context.rs Outdated
uqio added 4 commits May 9, 2026 14:59
Miri's default leak checker correctly flags the
`Arc<Context>` allocation that `poisoned_state_for_test`
deliberately leaks (one ArcInner per fixture call,
forgotten so `Context::drop` can never call
`whisper_free` on the dangling pointer).

Two reasons the existing `context.rs::tests::*` tests
that use the same `mem::forget` pattern don't trip
this: they construct a bare-stack `Context` and forget
the value (no heap allocation), while State's
`ctx: Arc<Context>` field forces the iterator fixture
through `Arc::new` — and the ArcInner is what Miri
reports.

Tag the 12 fixture-using tests with
`#[cfg_attr(miri, ignore = "...")]`. Pure compile-time
tests (`into_iter_for_segment_compiles`,
`tokens_iter_double_ended_compiles`) stay Miri-eligible
since they only touch `PhantomData` + trait-bound
functions and allocate nothing.

Trade-off: Miri loses runtime coverage of the iterator
counters on poisoned state (n_segments == 0; loop body
never runs under Miri or otherwise). The
load-bearing assertions are at compile time
(`Iterator<Item = …>` bounds, lifetime threading for
flat_map composition, ExactSizeIterator /
DoubleEndedIterator impls), and those still get
checked by the regular `cargo test` job.

Doc note added to `poisoned_state_for_test` explaining
the Miri trade-off so future test additions know to
copy the cfg_attr pattern.
Three findings from the Copilot review:

* Replace the helper-pair fixture (`poisoned_state_for_test`
  + `forget_poisoned_state`) with a `PoisonedStateFixture`
  RAII guard. The previous design required every test to
  call `forget_poisoned_state(state)` for cleanup — fragile
  against panic / early-return mid-test. The guard holds a
  `ManuallyDrop<Arc<Context>>` clone in addition to the
  `State`, so the leak invariant
  ("Context refcount stays ≥ 1 forever") survives test
  panics. Tests now read `let fixture =
  PoisonedStateFixture::new(); let state = &*fixture;` with
  no explicit cleanup. Doc on the struct spells out the
  Miri trade-off (still a leak by design; tests tagged
  `#[cfg_attr(miri, ignore)]`).

* Fix the `Segments::next` comment that claimed
  `self.state.ptr` is `pub(super)` — the field is
  actually private and only accessible because
  `Segments` lives in the same module. Comment now
  reflects the real visibility.

* Clarify the `Context::dangling_for_test` doc rationale
  for `unsafe fn` over `ManuallyDrop<Self>`. The previous
  text implied `Arc<ManuallyDrop<…>>` would still drop the
  inner `Context` via Arc's refcount mechanism — that's
  wrong (`ManuallyDrop`'s drop is a no-op, so the UB
  would in fact be prevented). The real reason is
  API-shape: `State::poisoned_for_test` and the
  production code expect `Arc<Context>` not
  `Arc<ManuallyDrop<Context>>`. The
  `PoisonedStateFixture` test guard handles the leak
  invariant via composition (separate `ManuallyDrop`
  field) instead.

62/62 tests pass under standard `cargo test`; clippy +
doc clean.
@al8n al8n requested a review from Copilot May 9, 2026 03:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread whispercpp/src/state.rs Outdated
Comment thread CHANGELOG.md
…ANGELOG link

Two findings from the second Copilot review on PR #9.

* `State::poisoned_for_test` was a `#[cfg(test)] pub(crate)
  fn` returning a `State` from a passed-in `Arc<Context>`.
  Since `Context::dangling_for_test` is `unsafe fn` but its
  resulting `Arc<Context>` is a regular safe value, this
  constructor effectively LIFTED the unsafe contract back
  into safe code: a caller could chain
  `Arc::new(unsafe { dangling_for_test() })` →
  `State::poisoned_for_test(arc)` → `drop(state)` and reach
  `Context::drop`'s `whisper_free(NonNull::dangling())`
  through what looked like a safe API.

  Deleted the constructor entirely. `PoisonedStateFixture::new()`
  in the test module now constructs `State { ptr: None, ctx }`
  directly via private-field access (the test submodule is
  a child of `state.rs` and inherits the parent's privacy
  boundary). Collapses the soundness chain into a single
  auditable point; no crate-visible safe API takes a
  potentially-dangling Arc<Context> as input.

* CHANGELOG.md 0.2.1 entry had `[#9]` defined as a
  reference-style link but never used inline. The release
  is about closing issue #3 via PR #9 — both should be
  cited. Added inline references in the section header
  ("Closes [#3] (delivered via PR [#9])") and added the
  matching `[#3]` link def. Also refreshed the Internal
  bullet to mention `PoisonedStateFixture` (replaced the
  stale `State::poisoned_for_test` reference now that the
  function is gone).

62/62 tests pass; clippy + cargo doc clean.
@uqio uqio merged commit 2fe13c5 into main May 9, 2026
15 checks passed
@uqio uqio deleted the feat/iterators branch May 9, 2026 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token-stream iterators: Segment / Token

2 participants