feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD#7
Merged
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves pyannote parity for embedding extraction and clustering by replacing the fbank implementation with a torchaudio-compliance port, tightening numerical stability in VBx/cluster initialization, and adding/refreshing parity fixtures and diagnostics to localize drift on longer real-world captures.
Changes:
- Replace the previous fbank backend with a torchaudio.compliance.kaldi.fbank–style implementation (FFT + mel bank) and update dependencies accordingly.
- Restore/strengthen strict parity on long recordings via Neumaier-compensated reductions in VBx,
np.unique-equivalent AHC label canonicalization, and a SciPy-compatible rectangular LSAP for constrained assignment tie-breaking. - Update offline pipeline behavior for pyannote parity (overlap-excluded embedding masks, default smoothing behavior) and expand parity/diagnostic tests + fixtures.
Reviewed changes
Copilot reviewed 36 out of 53 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/parity/hyp.rttm | Adds an RTTM hypothesis artifact for parity/diagnostic workflows. |
| tests/parity/fixtures/10_mrbeast_clean_water/reference.rttm | Adds reference RTTM for the 10_mrbeast_clean_water capture. |
| tests/parity/fixtures/10_mrbeast_clean_water/manifest.json | Adds capture manifest + artifact hashes for 10_mrbeast_clean_water. |
| tests/parity/fixtures/08_luyu_jinjing_freedom/reference.rttm | Adds reference RTTM for the 08_luyu_jinjing_freedom capture. |
| tests/parity/fixtures/08_luyu_jinjing_freedom/manifest.json | Adds capture manifest + artifact hashes for 08_luyu_jinjing_freedom. |
| tests/parity_drift_10.rs | New ignored diagnostic test to measure segmentation/embedding drift vs captured pyannote intermediates. |
| src/reconstruct/rttm_parity_tests.rs | Refines RTTM per-line parity logic (structural equality + bounded duration tolerance) and adds an ignored fixture test. |
| src/reconstruct/parity_tests.rs | Restores strict discrete-grid parity for 06_long_recording (removes prior ignore rationale). |
| src/pipeline/parity_tests.rs | Expands parity/diagnostic coverage for additional captures and adds stage-localization helpers (mostly ignored). |
| src/pipeline/algo.rs | Updates assign_embeddings documentation around deferred speaker-count constraints. |
| src/ops/scalar/mod.rs | Exposes new scalar Neumaier-compensated reduction helpers. |
| src/ops/scalar/kahan.rs | Implements Neumaier-compensated dot/sum with unit tests. |
| src/ops/mod.rs | Re-exports compensated reduction functions via the dispatch layer. |
| src/ops/dispatch/mod.rs | Wires compensated dot/sum into runtime dispatch. |
| src/ops/dispatch/kahan.rs | Adds runtime dispatcher for compensated reductions (NEON fast-path, scalar fallback). |
| src/ops/arch/neon/mod.rs | Exposes NEON compensated dot/sum kernels. |
| src/ops/arch/neon/kahan.rs | Implements NEON Neumaier-compensated dot/sum kernels. |
| src/offline/owned.rs | Changes defaults for smoothing and mirrors pyannote’s overlap-excluded embedding-mask behavior. |
| src/embed/model.rs | Adds a manual Debug impl and an ignored test for AllSilent behavior in weighted embedding. |
| src/embed/fbank.rs | Replaces kaldi-native-fbank usage with a torchaudio-style fbank port (FFT/mel/log + mean-centering) plus SIMD kernels and tests. |
| src/embed/error.rs | Removes the now-obsolete Error::Fbank variant tied to kaldi-native-fbank initialization. |
| src/cluster/vbx/parity_tests.rs | Adds an ignored parity adapter for a longer fixture (10_mrbeast_clean_water). |
| src/cluster/vbx/algo.rs | Replaces key GEMM reductions with Neumaier-compensated dot/sum and introduces packed row-major buffers for stable iteration. |
| src/cluster/spectral.rs | Improves k-means Lloyd iteration by swapping buffers instead of cloning each iteration. |
| src/cluster/mod.rs | Extends compile-time Send/Sync assertions to additional cluster types. |
| src/cluster/hungarian/mod.rs | Adds the new LSAP module to the hungarian cluster submodule. |
| src/cluster/hungarian/lsap.rs | Introduces a SciPy-compatible rectangular LSAP implementation for tie-breaking parity. |
| src/cluster/hungarian/algo.rs | Switches constrained assignment to the new LSAP implementation (behavioral parity on tied costs). |
| src/cluster/ahc/tests.rs | Adjusts unit tests to assert partition-equivalence rather than fixed label values. |
| src/cluster/ahc/parity_tests.rs | Adds ignored parity tests for additional captured fixtures. |
| src/cluster/ahc/algo.rs | Changes label canonicalization to match np.unique(..., return_inverse=True) semantics. |
| scripts/fix_wespeaker_pooling_eps.py | Adds a script to patch WeSpeaker ONNX stats-pooling eps behavior to match PyTorch edge cases. |
| scripts/download-embed-model.sh | Updates the pinned embed-model revision and expected SHA-256. |
| README.md | Updates the pinned embed-model revision and expected SHA-256 in docs. |
| examples/run_owned_pipeline.rs | Updates example to construct the pipeline with explicit OwnedPipelineOptions (new smoothing default). |
| Cargo.toml | Replaces kaldi-native-fbank dependency with realfft for the new fbank implementation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+160
to
+166
| # Real-valued FFT for the bit-exact torchaudio.compliance.kaldi.fbank | ||
| # port (see `src/embed/torchaudio_fbank.rs`). PyTorch's `torch.fft.rfft` | ||
| # routes to pocketfft on CPU; `realfft` wraps `rustfft`'s | ||
| # Cooley-Tukey radix-2 path which produces the same spectrum within | ||
| # ~1e-7 relative — small enough that the resnet+pooling output stays | ||
| # within sub-ULP of pyannote on the 14-audio bench. | ||
| realfft = "3" |
Comment on lines
+386
to
+390
| /// the diagnostic test below — so a mismatch here isolates dia's | ||
| /// Hungarian (`pathfinding::kuhn_munkres`) tie-breaking from scipy's | ||
| /// (`scipy.optimize.linear_sum_assignment` / LAPJV). | ||
| #[test] | ||
| #[ignore = "isolates Hungarian tie-breaking divergence using captured 10_mrbeast_clean_water soft_clusters"] |
Comment on lines
+3
to
+6
| //! Direct Rust port of scipy's `rectangular_lsap.cpp` (BSD-3, Crouse's | ||
| //! shortest augmenting path; PM Larsen). The implementation is based | ||
| //! on: | ||
| //! |
The `neon-native` job's step name `Run fbank + ops:: tests on arm64 (NEON dispatched)` contained an unquoted `:: ` sequence that YAML treated as a nested mapping value indicator. The whole workflow file failed to parse and every CI job was skipped (`This run likely failed because of a workflow file issue` with 0 jobs in the API response). Quoting the step name resolves the parse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines
+94
to
+100
| // Validate after transpose/negate so the rejection mirrors scipy | ||
| // (which also checks the working copy). | ||
| for &v in working.iter() { | ||
| if v.is_nan() || v == f64::NEG_INFINITY { | ||
| return Err(crate::cluster::hungarian::error::NonFiniteError::InfInSoftClusters.into()); | ||
| } | ||
| } |
Comment on lines
+10
to
+18
| /// Marked `#[non_exhaustive]` so callers must include a `_ =>` arm in | ||
| /// any `match`. Variants in this enum represent low-level numerical / | ||
| /// boundary conditions (NaN/inf inputs, shape drift, ORT failure, …) | ||
| /// and the set evolves as new failure modes are surfaced or as | ||
| /// internal kernels stop being able to produce a given variant. The | ||
| /// attribute lets us add or retire variants without it being a | ||
| /// semver-breaking change for downstream exhaustive matchers. | ||
| #[derive(Debug, Error)] | ||
| #[non_exhaustive] |
Three fixes converging on the failed CI jobs against fix/deep-review:
1. clippy `needless_return` (12 errors): the cfg-gated SIMD dispatch
inside `apply_window_inplace` / `power_spectrum` /
`fma_dot_f32_to_f64` ends each per-arch block with `return;` so a
non-arch-matched fallback can't execute, but on any single arch
only one block compiles and the trailing `return` looks needless.
Allow the lint at the function level on all three dispatchers.
Also drop the `into_iter()` on `col_ind` in
`cluster::hungarian::algo::assign_one`, replace `&mut u, &mut v`
with `&u, &v` in `lsap::augmenting_path` (function takes `&[f64]`),
convert three `for i in 0..n { … xs[i] … }` loops to
`iter_mut().enumerate().take(n)` form, switch
`EPSILON: f32 = 1.1920928955078125e-07` to `f32::EPSILON` (literal
had excessive precision).
2. miri-tb-i686 / miri-sb-riscv64gc `function … is never used`
errors on `make_test_inputs` and `assert_dot_within_tol`: those
helpers are only consumed by the `target_arch = "aarch64"` /
`target_arch = "x86_64"` direct-backend tests, so on i686 / riscv
every consumer is cfg-excluded and the helpers become dead code
under `-Dwarnings`. Cfg-gate the helpers to match.
Verified: `cargo test --no-default-features` 505 passed, `cargo
clippy --no-default-features --features _bench -- -Dwarnings` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings from #7 (review): 1. `ops::scalar::kahan::kahan_dot` doc said "panics in debug only" via `debug_assert_eq!`, but the loop already indexes `b[i]` for `i in 0..a.len()` — release builds panic on bounds-check anyway, just with a less-descriptive message. Promote to unconditional `assert_eq!` (matching `ops::dispatch::dot`'s public contract) and update the doc accordingly. 2. `cluster::hungarian::lsap::linear_sum_assignment` rejected NaN and `-inf` but let `+inf` through under `maximize=false` (in-tree `constrained_argmax` always passes `maximize=true` so the existing `+inf` boundary check + negation made this safe in-pipeline, but a future direct caller could trip the dual-update arithmetic). Change the validation to `!v.is_finite()` so any non-finite is caught regardless of orientation. 3. `NonFiniteError::InfInSoftClusters` error message claimed "+inf or -inf" but the LSAP layer also rejects NaN. Update the message to "+inf, -inf, or NaN" so the surfaced error matches the actual rejection criteria. Variant name is preserved for backward compatibility. 4. `embed::Error` `#[non_exhaustive]` + `Error::Fbank` removal is a source-breaking API change. The crate is unpublished 0.1.0 with zero downstream consumers (so the break is theoretical), but document both changes explicitly under `BREAKING (pre-1.0)` in `CHANGELOG.md`'s `# UNRELEASED` section so future readers can trace the API delta. Verified: `cargo clippy --no-default-features --features _bench -- -Dwarnings` clean; `RUSTFLAGS="-Dwarnings" cargo test --features ort,bundled-segmentation` 532 lib tests + 1 integration pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio
added a commit
that referenced
this pull request
May 9, 2026
…eumaier-VBx + bounded-scratch SIMD (#7)
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end pyannote 4.0.4 community-1 parity across the full
14-audio bench: 6 in-repo fixtures + 8 testaudioset clips (07–14)
all match speaker count + segment count exactly, DER = 0.0000
on every audio. The previous spurious 4th cluster on the 23.6-min
Mandarin interview
08_luyu_jinjing_freedom(root-cause: ~2.4e-4f32 fbank drift amplified through ResNet34 to 0.66 abs embedding
error) is fixed by an in-tree port of
torchaudio.compliance.kaldi.fbankthat brings worst-case embedding drift down to 1.018e-5.
Major changes
Fbank: in-tree torchaudio port (replacing
kaldi-native-fbank)src/embed/fbank.rs(~1.6k LOC) — bit-near-exact port oftorchaudio.compliance.kaldi.fbank. Pipeline: strided frames →DC offset → preemphasis → Hamming window → zero-pad to 512 →
realfft (radix-2 r2c) → power spectrum (
(re²+im²).sqrt()thensquare — bit-for-bit matches torchaudio's
complex.abs().pow(2))→ mel filterbank (80 triangular bins, 20 Hz → Nyquist) →
log(max(EPSILON, x)).kaldi-native-fbankdependency dropped.OnceLock(mel filterbank, Hamming window).FftScratchfor the FFT plan + scratch Vecs;bounded retention via
SCRATCH_RETAIN_LIMIT = 256K f32so aone-shot 1-hour clip can't pin hundreds of MB per worker thread.
compute_fbankso the publicfixed-shape API never feeds more than
FBANK_FRAMES * shift + windowsamples to the kernel.f32::maxsointernally-overflowed FFT inputs flow to the embed model's
Error::NonFiniteOutputcheck rather than silently flooring).SIMD kernels
Four backends for the dominant mel-matmul dot product, runtime-
dispatched via
crate::ops::{neon,avx2,avx512}_available:f64 accumulator (not f32-BLAS-sgemm-literal) — empirically the
choice that holds 14/14 parity. f32-literal-contract regressed
09_mrbeast_dollar_date8/468 → 8/470 in iteration; documentedin the kernel header.
NEON-only window-mul + power-spectrum kernels (smaller hot spots).
All four dot kernels have direct-call tests
(
dot_{neon,sse2,avx2,avx512}_agrees_with_scalar_directly)behind runtime feature guards so backends not selected by the host
dispatcher (e.g. SSE2 on an AVX-512 chip) still get exercised.
Length-mismatch guards are unconditional
assert_eq!(notdebug_assert_eq!) because the unsafe SIMD bodies do raw-pointerloads bounded only by
a.len(). Each guard is cross-tested with#[should_panic(expected = "fma_dot_f32_to_f64")].Cluster + assignment
src/cluster/hungarian/lsap.rs,~360 LOC) — direct port of SciPy's
rectangular_lsap.cpp(Crouse / LAPJV; PM Larsen). Replaces
pathfinding::kuhn_munkres,whose tie-break diverged from scipy on tied optima
(
pathfindingandordered-floatdeps removed). Tie-break nowmatches scipy bit-for-bit. BSD-3-Clause attribution added to
NOTICEandCargo.tomlSPDX.(
src/ops/scalar/kahan.rs,src/ops/arch/neon/kahan.rs,src/ops/dispatch/kahan.rs). Critical for long-recordingnumerical stability where AHC dendrogram cuts at
<= thresholdare sensitive to sub-ulp drift.fclusterlabelsare remapped to first-occurrence order, matching pyannote's
np.unique(fcluster - 1, return_inverse=True)semantics.flipped to
Noneto match community-1 semantics.CI safety net
neon-nativejob pinned toubuntu-24.04-arm; runsops::+embed::fbank::tests+ parity tests with--cfg diarization_assert_neonso a runner-image regressionthat hides NEON fails the build instead of silently routing
through scalar.
sanitizer.shextended toinclude
embed::fbank::tests.(
shrink_*, panic guards, scalar-dispatch agreement) — rustfft'sdefault planners use SIMD intrinsics Miri can't evaluate.
into pure helpers (
shrink_scratch_before_resize,shrink_scratch_after_loop) with 5 Miri-safe direct branch tests.neon-nativewired intocoverage.needsso it blocks theaggregate gate alongside the AVX SDE / sanitizer / miri lanes.
tests/parity_fixtures_endtoend.rsruns dia end-to-end on everytests/parity/fixtures/*/clip_16k.wavand pins(speakers, segments)against the captured pyannote 4.0.4reference.
#[ignore]-gated (loads the WeSpeaker ONNX model +~26 min runtime); CI workflow integration is a separate
workstream.
compares_against_torchaudio_inline_chirp_snapshotexercises the full kernel pipeline against torchaudio reference
values inline (no external fixtures) — the in-CI parity gate.
Parity proof
Full 14-fixture e2e bench (
cargo test --release --test parity_fixtures_endtoend --features ort,bundled-segmentation -- --ignored --test-threads=1):DER vs pyannote 4.0.4 reference RTTMs: 0.0000 on all 14 audios.
Breaking (pre-1.0)
diarization::embed::Erroris now#[non_exhaustive]. Callerswith exhaustive
matcharms must add a_ =>wildcard.diarization::embed::Error::Fbank(String)variant removed (wastied to the previous kaldi-native-fbank
Result<_, String>boundary; no longer constructible).
Crate is unpublished 0.1.0 → no downstream consumers to break.
Both items called out in
CHANGELOG.mdunder# UNRELEASED.Test plan
cargo test --release --features ort,bundled-segmentation—532 lib + 1 integration tests pass under
RUSTFLAGS="-Dwarnings"cargo clippy --no-default-features --features _bench -- -DwarningscleanRUSTFLAGS="--cfg diarization_force_scalar" cargo test --no-default-features -- ops:: embed::fbank::tests::*)compares_against_torchaudio_inline_chirp_snapshotalways-onvs torchaudio reference values inline
neon-native, AVX2-SDE, AVX-512-SDE, sanitizer, miri-tb,miri-sb across all targets — gating the merge
🤖 Generated with Claude Code