feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD by uqio · Pull Request #7 · Findit-AI/diarization

uqio · 2026-05-08T12:28:30Z

Summary

End-to-end pyannote 4.0.4 community-1 parity across the full
14-audio bench: 6 in-repo fixtures + 8 testaudioset clips (07–14)
all match speaker count + segment count exactly, DER = 0.0000
on every audio. The previous spurious 4th cluster on the 23.6-min
Mandarin interview 08_luyu_jinjing_freedom (root-cause: ~2.4e-4
f32 fbank drift amplified through ResNet34 to 0.66 abs embedding
error) is fixed by an in-tree port of torchaudio.compliance.kaldi.fbank
that brings worst-case embedding drift down to 1.018e-5.

Major changes

Fbank: in-tree torchaudio port (replacing `kaldi-native-fbank`)

New src/embed/fbank.rs (~1.6k LOC) — bit-near-exact port of
torchaudio.compliance.kaldi.fbank. Pipeline: strided frames →
DC offset → preemphasis → Hamming window → zero-pad to 512 →
realfft (radix-2 r2c) → power spectrum ((re²+im²).sqrt() then
square — bit-for-bit matches torchaudio's complex.abs().pow(2))
→ mel filterbank (80 triangular bins, 20 Hz → Nyquist) →
log(max(EPSILON, x)).
kaldi-native-fbank dependency dropped.
Cached resources via OnceLock (mel filterbank, Hamming window).
Thread-local FftScratch for the FFT plan + scratch Vecs;
bounded retention via SCRATCH_RETAIN_LIMIT = 256K f32 so a
one-shot 1-hour clip can't pin hundreds of MB per worker thread.
Centered-input cropping in compute_fbank so the public
fixed-shape API never feeds more than FBANK_FRAMES * shift + window samples to the kernel.
NaN-propagating log floor (manual cmp instead of f32::max so
internally-overflowed FFT inputs flow to the embed model's
Error::NonFiniteOutput check rather than silently flooring).

SIMD kernels

Four backends for the dominant mel-matmul dot product, runtime-
dispatched via crate::ops::{neon,avx2,avx512}_available:

Arch	Lanes (f32 mul)	Lanes (f64 acc)
NEON	4	2
SSE2	4	2
AVX2 + FMA	8	4
AVX-512F	16	8

f64 accumulator (not f32-BLAS-sgemm-literal) — empirically the
choice that holds 14/14 parity. f32-literal-contract regressed
09_mrbeast_dollar_date 8/468 → 8/470 in iteration; documented
in the kernel header.

NEON-only window-mul + power-spectrum kernels (smaller hot spots).
All four dot kernels have direct-call tests
(dot_{neon,sse2,avx2,avx512}_agrees_with_scalar_directly)
behind runtime feature guards so backends not selected by the host
dispatcher (e.g. SSE2 on an AVX-512 chip) still get exercised.

Length-mismatch guards are unconditional assert_eq! (not
debug_assert_eq!) because the unsafe SIMD bodies do raw-pointer
loads bounded only by a.len(). Each guard is cross-tested with
#[should_panic(expected = "fma_dot_f32_to_f64")].

Cluster + assignment

scipy-compatible rectangular LSAP (src/cluster/hungarian/lsap.rs,
~360 LOC) — direct port of SciPy's rectangular_lsap.cpp
(Crouse / LAPJV; PM Larsen). Replaces pathfinding::kuhn_munkres,
whose tie-break diverged from scipy on tied optima
(pathfinding and ordered-float deps removed). Tie-break now
matches scipy bit-for-bit. BSD-3-Clause attribution added to
NOTICE and Cargo.toml SPDX.
Neumaier-compensated dot/sum in VBx GEMM hot path
(src/ops/scalar/kahan.rs, src/ops/arch/neon/kahan.rs,
src/ops/dispatch/kahan.rs). Critical for long-recording
numerical stability where AHC dendrogram cuts at
<= threshold are sensitive to sub-ulp drift.
np.unique-equivalent AHC canonicalization — fcluster labels
are remapped to first-occurrence order, matching pyannote's
np.unique(fcluster - 1, return_inverse=True) semantics.
Pyannote overlap-excluded embedding mask + smoothing default
flipped to None to match community-1 semantics.

CI safety net

neon-native job pinned to ubuntu-24.04-arm; runs ops:: +
embed::fbank::tests + parity tests with
--cfg diarization_assert_neon so a runner-image regression
that hides NEON fails the build instead of silently routing
through scalar.
AVX2 / AVX-512 SDE scripts and sanitizer.sh extended to
include embed::fbank::tests.
Miri scripts use an explicit allowlist of FFT-free fbank tests
(shrink_*, panic guards, scalar-dispatch agreement) — rustfft's
default planners use SIMD intrinsics Miri can't evaluate.
Cap-and-reset logic for the thread-local fbank scratch factored
into pure helpers (shrink_scratch_before_resize,
shrink_scratch_after_loop) with 5 Miri-safe direct branch tests.
neon-native wired into coverage.needs so it blocks the
aggregate gate alongside the AVX SDE / sanitizer / miri lanes.
tests/parity_fixtures_endtoend.rs runs dia end-to-end on every
tests/parity/fixtures/*/clip_16k.wav and pins
(speakers, segments) against the captured pyannote 4.0.4
reference. #[ignore]-gated (loads the WeSpeaker ONNX model +
~26 min runtime); CI workflow integration is a separate
workstream.
Always-on compares_against_torchaudio_inline_chirp_snapshot
exercises the full kernel pipeline against torchaudio reference
values inline (no external fixtures) — the in-CI parity gate.

Parity proof

Full 14-fixture e2e bench (cargo test --release --test parity_fixtures_endtoend --features ort,bundled-segmentation -- --ignored --test-threads=1):

test parity_01_dialogue ... ok
test parity_02_pyannote_sample ... ok
test parity_03_dual_speaker ... ok
test parity_04_three_speaker ... ok
test parity_05_four_speaker ... ok
test parity_06_long_recording ... ok          (3 spk / 346 segs)
test parity_07_yuhewei_dongbei_english ... ok (2 spk / 7 segs)
test parity_08_luyu_jinjing_freedom ... ok    (3 spk / 448 segs ← was 4 spk / 461 segs)
test parity_09_mrbeast_dollar_date ... ok     (8 spk / 468 segs)
test parity_10_mrbeast_clean_water ... ok     (7 spk / 115 segs)
test parity_11_mrbeast_age_race ... ok        (6 spk / 576 segs)
test parity_12_mrbeast_schools ... ok         (15 spk / 227 segs)
test parity_13_mrbeast_saved_animals ... ok   (11 spk / 296 segs)
test parity_14_mrbeast_strongman_robot ... ok (4 spk / 343 segs)

test result: ok. 14 passed; 0 failed

DER vs pyannote 4.0.4 reference RTTMs: 0.0000 on all 14 audios.

Breaking (pre-1.0)

diarization::embed::Error is now #[non_exhaustive]. Callers
with exhaustive match arms must add a _ => wildcard.
diarization::embed::Error::Fbank(String) variant removed (was
tied to the previous kaldi-native-fbank Result<_, String>
boundary; no longer constructible).

Crate is unpublished 0.1.0 → no downstream consumers to break.
Both items called out in CHANGELOG.md under # UNRELEASED.

Test plan

cargo test --release --features ort,bundled-segmentation —
532 lib + 1 integration tests pass under RUSTFLAGS="-Dwarnings"
cargo clippy --no-default-features --features _bench -- -Dwarnings clean
14-fixture e2e parity bench: all green (~26 min)
Force-scalar Miri (RUSTFLAGS="--cfg diarization_force_scalar" cargo test --no-default-features -- ops:: embed::fbank::tests::*)
compares_against_torchaudio_inline_chirp_snapshot always-on
vs torchaudio reference values inline
CI: neon-native, AVX2-SDE, AVX-512-SDE, sanitizer, miri-tb,
miri-sb across all targets — gating the merge

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR improves pyannote parity for embedding extraction and clustering by replacing the fbank implementation with a torchaudio-compliance port, tightening numerical stability in VBx/cluster initialization, and adding/refreshing parity fixtures and diagnostics to localize drift on longer real-world captures.

Changes:

Replace the previous fbank backend with a torchaudio.compliance.kaldi.fbank–style implementation (FFT + mel bank) and update dependencies accordingly.
Restore/strengthen strict parity on long recordings via Neumaier-compensated reductions in VBx, np.unique-equivalent AHC label canonicalization, and a SciPy-compatible rectangular LSAP for constrained assignment tie-breaking.
Update offline pipeline behavior for pyannote parity (overlap-excluded embedding masks, default smoothing behavior) and expand parity/diagnostic tests + fixtures.

Reviewed changes

Copilot reviewed 36 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/parity/hyp.rttm	Adds an RTTM hypothesis artifact for parity/diagnostic workflows.
tests/parity/fixtures/10_mrbeast_clean_water/reference.rttm	Adds reference RTTM for the 10_mrbeast_clean_water capture.
tests/parity/fixtures/10_mrbeast_clean_water/manifest.json	Adds capture manifest + artifact hashes for 10_mrbeast_clean_water.
tests/parity/fixtures/08_luyu_jinjing_freedom/reference.rttm	Adds reference RTTM for the 08_luyu_jinjing_freedom capture.
tests/parity/fixtures/08_luyu_jinjing_freedom/manifest.json	Adds capture manifest + artifact hashes for 08_luyu_jinjing_freedom.
tests/parity_drift_10.rs	New ignored diagnostic test to measure segmentation/embedding drift vs captured pyannote intermediates.
src/reconstruct/rttm_parity_tests.rs	Refines RTTM per-line parity logic (structural equality + bounded duration tolerance) and adds an ignored fixture test.
src/reconstruct/parity_tests.rs	Restores strict discrete-grid parity for 06_long_recording (removes prior ignore rationale).
src/pipeline/parity_tests.rs	Expands parity/diagnostic coverage for additional captures and adds stage-localization helpers (mostly ignored).
src/pipeline/algo.rs	Updates assign_embeddings documentation around deferred speaker-count constraints.
src/ops/scalar/mod.rs	Exposes new scalar Neumaier-compensated reduction helpers.
src/ops/scalar/kahan.rs	Implements Neumaier-compensated dot/sum with unit tests.
src/ops/mod.rs	Re-exports compensated reduction functions via the dispatch layer.
src/ops/dispatch/mod.rs	Wires compensated dot/sum into runtime dispatch.
src/ops/dispatch/kahan.rs	Adds runtime dispatcher for compensated reductions (NEON fast-path, scalar fallback).
src/ops/arch/neon/mod.rs	Exposes NEON compensated dot/sum kernels.
src/ops/arch/neon/kahan.rs	Implements NEON Neumaier-compensated dot/sum kernels.
src/offline/owned.rs	Changes defaults for smoothing and mirrors pyannote’s overlap-excluded embedding-mask behavior.
src/embed/model.rs	Adds a manual `Debug` impl and an ignored test for AllSilent behavior in weighted embedding.
src/embed/fbank.rs	Replaces kaldi-native-fbank usage with a torchaudio-style fbank port (FFT/mel/log + mean-centering) plus SIMD kernels and tests.
src/embed/error.rs	Removes the now-obsolete `Error::Fbank` variant tied to kaldi-native-fbank initialization.
src/cluster/vbx/parity_tests.rs	Adds an ignored parity adapter for a longer fixture (10_mrbeast_clean_water).
src/cluster/vbx/algo.rs	Replaces key GEMM reductions with Neumaier-compensated dot/sum and introduces packed row-major buffers for stable iteration.
src/cluster/spectral.rs	Improves k-means Lloyd iteration by swapping buffers instead of cloning each iteration.
src/cluster/mod.rs	Extends compile-time Send/Sync assertions to additional cluster types.
src/cluster/hungarian/mod.rs	Adds the new LSAP module to the hungarian cluster submodule.
src/cluster/hungarian/lsap.rs	Introduces a SciPy-compatible rectangular LSAP implementation for tie-breaking parity.
src/cluster/hungarian/algo.rs	Switches constrained assignment to the new LSAP implementation (behavioral parity on tied costs).
src/cluster/ahc/tests.rs	Adjusts unit tests to assert partition-equivalence rather than fixed label values.
src/cluster/ahc/parity_tests.rs	Adds ignored parity tests for additional captured fixtures.
src/cluster/ahc/algo.rs	Changes label canonicalization to match `np.unique(..., return_inverse=True)` semantics.
scripts/fix_wespeaker_pooling_eps.py	Adds a script to patch WeSpeaker ONNX stats-pooling eps behavior to match PyTorch edge cases.
scripts/download-embed-model.sh	Updates the pinned embed-model revision and expected SHA-256.
README.md	Updates the pinned embed-model revision and expected SHA-256 in docs.
examples/run_owned_pipeline.rs	Updates example to construct the pipeline with explicit OwnedPipelineOptions (new smoothing default).
Cargo.toml	Replaces kaldi-native-fbank dependency with realfft for the new fbank implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# Real-valued FFT for the bit-exact torchaudio.compliance.kaldi.fbank
+# port (see `src/embed/torchaudio_fbank.rs`). PyTorch's `torch.fft.rfft`
+# routes to pocketfft on CPU; `realfft` wraps `rustfft`'s
+# Cooley-Tukey radix-2 path which produces the same spectrum within
+# ~1e-7 relative — small enough that the resnet+pooling output stays
+# within sub-ULP of pyannote on the 14-audio bench.
+realfft = "3"


+/// the diagnostic test below — so a mismatch here isolates dia's
+/// Hungarian (`pathfinding::kuhn_munkres`) tie-breaking from scipy's
+/// (`scipy.optimize.linear_sum_assignment` / LAPJV).
+#[test]
+#[ignore = "isolates Hungarian tie-breaking divergence using captured 10_mrbeast_clean_water soft_clusters"]


+//! Direct Rust port of scipy's `rectangular_lsap.cpp` (BSD-3, Crouse's
+//! shortest augmenting path; PM Larsen). The implementation is based
+//! on:
+//!


The `neon-native` job's step name `Run fbank + ops:: tests on arm64 (NEON dispatched)` contained an unquoted `:: ` sequence that YAML treated as a nested mapping value indicator. The whole workflow file failed to parse and every CI job was skipped (`This run likely failed because of a workflow file issue` with 0 jobs in the API response). Quoting the step name resolves the parse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 50 out of 83 changed files in this pull request and generated 4 comments.

+  // Validate after transpose/negate so the rejection mirrors scipy
+  // (which also checks the working copy).
+  for &v in working.iter() {
+    if v.is_nan() || v == f64::NEG_INFINITY {
+      return Err(crate::cluster::hungarian::error::NonFiniteError::InfInSoftClusters.into());
+    }
+  }


+/// Marked `#[non_exhaustive]` so callers must include a `_ =>` arm in
+/// any `match`. Variants in this enum represent low-level numerical /
+/// boundary conditions (NaN/inf inputs, shape drift, ORT failure, …)
+/// and the set evolves as new failure modes are surfaced or as
+/// internal kernels stop being able to produce a given variant. The
+/// attribute lets us add or retire variants without it being a
+/// semver-breaking change for downstream exhaustive matchers.
 #[derive(Debug, Error)]
+#[non_exhaustive]


Three fixes converging on the failed CI jobs against fix/deep-review: 1. clippy `needless_return` (12 errors): the cfg-gated SIMD dispatch inside `apply_window_inplace` / `power_spectrum` / `fma_dot_f32_to_f64` ends each per-arch block with `return;` so a non-arch-matched fallback can't execute, but on any single arch only one block compiles and the trailing `return` looks needless. Allow the lint at the function level on all three dispatchers. Also drop the `into_iter()` on `col_ind` in `cluster::hungarian::algo::assign_one`, replace `&mut u, &mut v` with `&u, &v` in `lsap::augmenting_path` (function takes `&[f64]`), convert three `for i in 0..n { … xs[i] … }` loops to `iter_mut().enumerate().take(n)` form, switch `EPSILON: f32 = 1.1920928955078125e-07` to `f32::EPSILON` (literal had excessive precision). 2. miri-tb-i686 / miri-sb-riscv64gc `function … is never used` errors on `make_test_inputs` and `assert_dot_within_tol`: those helpers are only consumed by the `target_arch = "aarch64"` / `target_arch = "x86_64"` direct-backend tests, so on i686 / riscv every consumer is cfg-excluded and the helpers become dead code under `-Dwarnings`. Cfg-gate the helpers to match. Verified: `cargo test --no-default-features` 505 passed, `cargo clippy --no-default-features --features _bench -- -Dwarnings` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four findings from #7 (review): 1. `ops::scalar::kahan::kahan_dot` doc said "panics in debug only" via `debug_assert_eq!`, but the loop already indexes `b[i]` for `i in 0..a.len()` — release builds panic on bounds-check anyway, just with a less-descriptive message. Promote to unconditional `assert_eq!` (matching `ops::dispatch::dot`'s public contract) and update the doc accordingly. 2. `cluster::hungarian::lsap::linear_sum_assignment` rejected NaN and `-inf` but let `+inf` through under `maximize=false` (in-tree `constrained_argmax` always passes `maximize=true` so the existing `+inf` boundary check + negation made this safe in-pipeline, but a future direct caller could trip the dual-update arithmetic). Change the validation to `!v.is_finite()` so any non-finite is caught regardless of orientation. 3. `NonFiniteError::InfInSoftClusters` error message claimed "+inf or -inf" but the LSAP layer also rejects NaN. Update the message to "+inf, -inf, or NaN" so the surfaced error matches the actual rejection criteria. Variant name is preserved for backward compatibility. 4. `embed::Error` `#[non_exhaustive]` + `Error::Fbank` removal is a source-breaking API change. The crate is unpublished 0.1.0 with zero downstream consumers (so the break is theoretical), but document both changes explicitly under `BREAKING (pre-1.0)` in `CHANGELOG.md`'s `# UNRELEASED` section so future readers can trace the API delta. Verified: `cargo clippy --no-default-features --features _bench -- -Dwarnings` clean; `RUSTFLAGS="-Dwarnings" cargo test --features ort,bundled-segmentation` 532 lib tests + 1 integration pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eumaier-VBx + bounded-scratch SIMD (#7)

codecov · 2026-05-09T06:14:54Z

Codecov Report

❌ Patch coverage is 61.34454% with 230 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/embed/fbank.rs	65.07%	117 Missing ⚠️
src/ops/arch/neon/kahan.rs	0.00%	75 Missing ⚠️
src/offline/owned.rs	0.00%	17 Missing ⚠️
src/ops/dispatch/kahan.rs	28.57%	10 Missing ⚠️
src/ops/scalar/kahan.rs	78.57%	6 Missing ⚠️
src/cluster/hungarian/lsap.rs	97.39%	3 Missing ⚠️
src/embed/model.rs	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

uqio added 4 commits May 8, 2026 20:04

fix fmt

d6ca066

fix fmt

b1a1a3f

fix fmt

c12e012

fix fmt

a8e8f9e

al8n requested a review from Copilot May 8, 2026 12:28

Copilot started reviewing on behalf of al8n May 8, 2026 12:30 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

uqio and others added 11 commits May 9, 2026 01:05

fix fmt

cb49f82

fix fmt

fd40f23

fix fmt

5cfbf0c

fix fmt

03f30e6

fix fmt

07e7c37

fix fmt

7a43e4c

fix fmt

39d1710

fix fmt

1a9b8c5

fix fmt

4254609

fix fmt

9c94810

al8n requested a review from Copilot May 9, 2026 04:59

Copilot started reviewing on behalf of al8n May 9, 2026 05:00 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

uqio and others added 2 commits May 9, 2026 17:07

uqio changed the title ~~feat: fbank~~ feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD May 9, 2026

uqio merged commit 4d7593b into main May 9, 2026
63 checks passed

uqio deleted the feat/onnx-rust-resnet-tail branch May 9, 2026 05:35

uqio added a commit that referenced this pull request May 9, 2026

feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + N…

d4a5b9d

…eumaier-VBx + bounded-scratch SIMD (#7)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD#7

feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD#7
uqio merged 17 commits into
mainfrom
feat/onnx-rust-resnet-tail

uqio commented May 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uqio commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Major changes

Fbank: in-tree torchaudio port (replacing kaldi-native-fbank)

SIMD kernels

Cluster + assignment

CI safety net

Parity proof

Breaking (pre-1.0)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 9, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

uqio commented May 8, 2026 •

edited

Loading

Fbank: in-tree torchaudio port (replacing `kaldi-native-fbank`)