Skip to content

feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD#7

Merged
uqio merged 17 commits into
mainfrom
feat/onnx-rust-resnet-tail
May 9, 2026
Merged

feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD#7
uqio merged 17 commits into
mainfrom
feat/onnx-rust-resnet-tail

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 8, 2026

Summary

End-to-end pyannote 4.0.4 community-1 parity across the full
14-audio bench: 6 in-repo fixtures + 8 testaudioset clips (07–14)
all match speaker count + segment count exactly, DER = 0.0000
on every audio
. The previous spurious 4th cluster on the 23.6-min
Mandarin interview 08_luyu_jinjing_freedom (root-cause: ~2.4e-4
f32 fbank drift amplified through ResNet34 to 0.66 abs embedding
error) is fixed by an in-tree port of torchaudio.compliance.kaldi.fbank
that brings worst-case embedding drift down to 1.018e-5.

Major changes

Fbank: in-tree torchaudio port (replacing kaldi-native-fbank)

  • New src/embed/fbank.rs (~1.6k LOC) — bit-near-exact port of
    torchaudio.compliance.kaldi.fbank. Pipeline: strided frames →
    DC offset → preemphasis → Hamming window → zero-pad to 512 →
    realfft (radix-2 r2c) → power spectrum ((re²+im²).sqrt() then
    square — bit-for-bit matches torchaudio's complex.abs().pow(2))
    → mel filterbank (80 triangular bins, 20 Hz → Nyquist) →
    log(max(EPSILON, x)).
  • kaldi-native-fbank dependency dropped.
  • Cached resources via OnceLock (mel filterbank, Hamming window).
  • Thread-local FftScratch for the FFT plan + scratch Vecs;
    bounded retention via SCRATCH_RETAIN_LIMIT = 256K f32 so a
    one-shot 1-hour clip can't pin hundreds of MB per worker thread.
  • Centered-input cropping in compute_fbank so the public
    fixed-shape API never feeds more than FBANK_FRAMES * shift + window samples to the kernel.
  • NaN-propagating log floor (manual cmp instead of f32::max so
    internally-overflowed FFT inputs flow to the embed model's
    Error::NonFiniteOutput check rather than silently flooring).

SIMD kernels

Four backends for the dominant mel-matmul dot product, runtime-
dispatched via crate::ops::{neon,avx2,avx512}_available:

Arch Lanes (f32 mul) Lanes (f64 acc)
NEON 4 2
SSE2 4 2
AVX2 + FMA 8 4
AVX-512F 16 8

f64 accumulator (not f32-BLAS-sgemm-literal) — empirically the
choice that holds 14/14 parity. f32-literal-contract regressed
09_mrbeast_dollar_date 8/468 → 8/470 in iteration; documented
in the kernel header.

NEON-only window-mul + power-spectrum kernels (smaller hot spots).
All four dot kernels have direct-call tests
(dot_{neon,sse2,avx2,avx512}_agrees_with_scalar_directly)
behind runtime feature guards so backends not selected by the host
dispatcher (e.g. SSE2 on an AVX-512 chip) still get exercised.

Length-mismatch guards are unconditional assert_eq! (not
debug_assert_eq!) because the unsafe SIMD bodies do raw-pointer
loads bounded only by a.len(). Each guard is cross-tested with
#[should_panic(expected = "fma_dot_f32_to_f64")].

Cluster + assignment

  • scipy-compatible rectangular LSAP (src/cluster/hungarian/lsap.rs,
    ~360 LOC) — direct port of SciPy's rectangular_lsap.cpp
    (Crouse / LAPJV; PM Larsen). Replaces pathfinding::kuhn_munkres,
    whose tie-break diverged from scipy on tied optima
    (pathfinding and ordered-float deps removed). Tie-break now
    matches scipy bit-for-bit. BSD-3-Clause attribution added to
    NOTICE and Cargo.toml SPDX.
  • Neumaier-compensated dot/sum in VBx GEMM hot path
    (src/ops/scalar/kahan.rs, src/ops/arch/neon/kahan.rs,
    src/ops/dispatch/kahan.rs). Critical for long-recording
    numerical stability where AHC dendrogram cuts at
    <= threshold are sensitive to sub-ulp drift.
  • np.unique-equivalent AHC canonicalizationfcluster labels
    are remapped to first-occurrence order, matching pyannote's
    np.unique(fcluster - 1, return_inverse=True) semantics.
  • Pyannote overlap-excluded embedding mask + smoothing default
    flipped to None to match community-1 semantics.

CI safety net

  • neon-native job pinned to ubuntu-24.04-arm; runs ops:: +
    embed::fbank::tests + parity tests with
    --cfg diarization_assert_neon so a runner-image regression
    that hides NEON fails the build instead of silently routing
    through scalar.
  • AVX2 / AVX-512 SDE scripts and sanitizer.sh extended to
    include embed::fbank::tests.
  • Miri scripts use an explicit allowlist of FFT-free fbank tests
    (shrink_*, panic guards, scalar-dispatch agreement) — rustfft's
    default planners use SIMD intrinsics Miri can't evaluate.
  • Cap-and-reset logic for the thread-local fbank scratch factored
    into pure helpers (shrink_scratch_before_resize,
    shrink_scratch_after_loop) with 5 Miri-safe direct branch tests.
  • neon-native wired into coverage.needs so it blocks the
    aggregate gate alongside the AVX SDE / sanitizer / miri lanes.
  • tests/parity_fixtures_endtoend.rs runs dia end-to-end on every
    tests/parity/fixtures/*/clip_16k.wav and pins
    (speakers, segments) against the captured pyannote 4.0.4
    reference. #[ignore]-gated (loads the WeSpeaker ONNX model +
    ~26 min runtime); CI workflow integration is a separate
    workstream.
  • Always-on compares_against_torchaudio_inline_chirp_snapshot
    exercises the full kernel pipeline against torchaudio reference
    values inline (no external fixtures) — the in-CI parity gate.

Parity proof

Full 14-fixture e2e bench (cargo test --release --test parity_fixtures_endtoend --features ort,bundled-segmentation -- --ignored --test-threads=1):

test parity_01_dialogue ... ok
test parity_02_pyannote_sample ... ok
test parity_03_dual_speaker ... ok
test parity_04_three_speaker ... ok
test parity_05_four_speaker ... ok
test parity_06_long_recording ... ok          (3 spk / 346 segs)
test parity_07_yuhewei_dongbei_english ... ok (2 spk / 7 segs)
test parity_08_luyu_jinjing_freedom ... ok    (3 spk / 448 segs ← was 4 spk / 461 segs)
test parity_09_mrbeast_dollar_date ... ok     (8 spk / 468 segs)
test parity_10_mrbeast_clean_water ... ok     (7 spk / 115 segs)
test parity_11_mrbeast_age_race ... ok        (6 spk / 576 segs)
test parity_12_mrbeast_schools ... ok         (15 spk / 227 segs)
test parity_13_mrbeast_saved_animals ... ok   (11 spk / 296 segs)
test parity_14_mrbeast_strongman_robot ... ok (4 spk / 343 segs)

test result: ok. 14 passed; 0 failed

DER vs pyannote 4.0.4 reference RTTMs: 0.0000 on all 14 audios.

Breaking (pre-1.0)

  • diarization::embed::Error is now #[non_exhaustive]. Callers
    with exhaustive match arms must add a _ => wildcard.
  • diarization::embed::Error::Fbank(String) variant removed (was
    tied to the previous kaldi-native-fbank Result<_, String>
    boundary; no longer constructible).

Crate is unpublished 0.1.0 → no downstream consumers to break.
Both items called out in CHANGELOG.md under # UNRELEASED.

Test plan

  • cargo test --release --features ort,bundled-segmentation
    532 lib + 1 integration tests pass under RUSTFLAGS="-Dwarnings"
  • cargo clippy --no-default-features --features _bench -- -Dwarnings clean
  • 14-fixture e2e parity bench: all green (~26 min)
  • Force-scalar Miri (RUSTFLAGS="--cfg diarization_force_scalar" cargo test --no-default-features -- ops:: embed::fbank::tests::*)
  • compares_against_torchaudio_inline_chirp_snapshot always-on
    vs torchaudio reference values inline
  • CI: neon-native, AVX2-SDE, AVX-512-SDE, sanitizer, miri-tb,
    miri-sb across all targets — gating the merge

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves pyannote parity for embedding extraction and clustering by replacing the fbank implementation with a torchaudio-compliance port, tightening numerical stability in VBx/cluster initialization, and adding/refreshing parity fixtures and diagnostics to localize drift on longer real-world captures.

Changes:

  • Replace the previous fbank backend with a torchaudio.compliance.kaldi.fbank–style implementation (FFT + mel bank) and update dependencies accordingly.
  • Restore/strengthen strict parity on long recordings via Neumaier-compensated reductions in VBx, np.unique-equivalent AHC label canonicalization, and a SciPy-compatible rectangular LSAP for constrained assignment tie-breaking.
  • Update offline pipeline behavior for pyannote parity (overlap-excluded embedding masks, default smoothing behavior) and expand parity/diagnostic tests + fixtures.

Reviewed changes

Copilot reviewed 36 out of 53 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/parity/hyp.rttm Adds an RTTM hypothesis artifact for parity/diagnostic workflows.
tests/parity/fixtures/10_mrbeast_clean_water/reference.rttm Adds reference RTTM for the 10_mrbeast_clean_water capture.
tests/parity/fixtures/10_mrbeast_clean_water/manifest.json Adds capture manifest + artifact hashes for 10_mrbeast_clean_water.
tests/parity/fixtures/08_luyu_jinjing_freedom/reference.rttm Adds reference RTTM for the 08_luyu_jinjing_freedom capture.
tests/parity/fixtures/08_luyu_jinjing_freedom/manifest.json Adds capture manifest + artifact hashes for 08_luyu_jinjing_freedom.
tests/parity_drift_10.rs New ignored diagnostic test to measure segmentation/embedding drift vs captured pyannote intermediates.
src/reconstruct/rttm_parity_tests.rs Refines RTTM per-line parity logic (structural equality + bounded duration tolerance) and adds an ignored fixture test.
src/reconstruct/parity_tests.rs Restores strict discrete-grid parity for 06_long_recording (removes prior ignore rationale).
src/pipeline/parity_tests.rs Expands parity/diagnostic coverage for additional captures and adds stage-localization helpers (mostly ignored).
src/pipeline/algo.rs Updates assign_embeddings documentation around deferred speaker-count constraints.
src/ops/scalar/mod.rs Exposes new scalar Neumaier-compensated reduction helpers.
src/ops/scalar/kahan.rs Implements Neumaier-compensated dot/sum with unit tests.
src/ops/mod.rs Re-exports compensated reduction functions via the dispatch layer.
src/ops/dispatch/mod.rs Wires compensated dot/sum into runtime dispatch.
src/ops/dispatch/kahan.rs Adds runtime dispatcher for compensated reductions (NEON fast-path, scalar fallback).
src/ops/arch/neon/mod.rs Exposes NEON compensated dot/sum kernels.
src/ops/arch/neon/kahan.rs Implements NEON Neumaier-compensated dot/sum kernels.
src/offline/owned.rs Changes defaults for smoothing and mirrors pyannote’s overlap-excluded embedding-mask behavior.
src/embed/model.rs Adds a manual Debug impl and an ignored test for AllSilent behavior in weighted embedding.
src/embed/fbank.rs Replaces kaldi-native-fbank usage with a torchaudio-style fbank port (FFT/mel/log + mean-centering) plus SIMD kernels and tests.
src/embed/error.rs Removes the now-obsolete Error::Fbank variant tied to kaldi-native-fbank initialization.
src/cluster/vbx/parity_tests.rs Adds an ignored parity adapter for a longer fixture (10_mrbeast_clean_water).
src/cluster/vbx/algo.rs Replaces key GEMM reductions with Neumaier-compensated dot/sum and introduces packed row-major buffers for stable iteration.
src/cluster/spectral.rs Improves k-means Lloyd iteration by swapping buffers instead of cloning each iteration.
src/cluster/mod.rs Extends compile-time Send/Sync assertions to additional cluster types.
src/cluster/hungarian/mod.rs Adds the new LSAP module to the hungarian cluster submodule.
src/cluster/hungarian/lsap.rs Introduces a SciPy-compatible rectangular LSAP implementation for tie-breaking parity.
src/cluster/hungarian/algo.rs Switches constrained assignment to the new LSAP implementation (behavioral parity on tied costs).
src/cluster/ahc/tests.rs Adjusts unit tests to assert partition-equivalence rather than fixed label values.
src/cluster/ahc/parity_tests.rs Adds ignored parity tests for additional captured fixtures.
src/cluster/ahc/algo.rs Changes label canonicalization to match np.unique(..., return_inverse=True) semantics.
scripts/fix_wespeaker_pooling_eps.py Adds a script to patch WeSpeaker ONNX stats-pooling eps behavior to match PyTorch edge cases.
scripts/download-embed-model.sh Updates the pinned embed-model revision and expected SHA-256.
README.md Updates the pinned embed-model revision and expected SHA-256 in docs.
examples/run_owned_pipeline.rs Updates example to construct the pipeline with explicit OwnedPipelineOptions (new smoothing default).
Cargo.toml Replaces kaldi-native-fbank dependency with realfft for the new fbank implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/cluster/hungarian/algo.rs Outdated
Comment thread Cargo.toml
Comment on lines +160 to +166
# Real-valued FFT for the bit-exact torchaudio.compliance.kaldi.fbank
# port (see `src/embed/torchaudio_fbank.rs`). PyTorch's `torch.fft.rfft`
# routes to pocketfft on CPU; `realfft` wraps `rustfft`'s
# Cooley-Tukey radix-2 path which produces the same spectrum within
# ~1e-7 relative — small enough that the resnet+pooling output stays
# within sub-ULP of pyannote on the 14-audio bench.
realfft = "3"
Comment thread src/embed/fbank.rs Outdated
Comment on lines +386 to +390
/// the diagnostic test below — so a mismatch here isolates dia's
/// Hungarian (`pathfinding::kuhn_munkres`) tie-breaking from scipy's
/// (`scipy.optimize.linear_sum_assignment` / LAPJV).
#[test]
#[ignore = "isolates Hungarian tie-breaking divergence using captured 10_mrbeast_clean_water soft_clusters"]
Comment on lines +3 to +6
//! Direct Rust port of scipy's `rectangular_lsap.cpp` (BSD-3, Crouse's
//! shortest augmenting path; PM Larsen). The implementation is based
//! on:
//!
uqio and others added 11 commits May 9, 2026 01:05
The `neon-native` job's step name `Run fbank + ops:: tests on arm64
(NEON dispatched)` contained an unquoted `:: ` sequence that YAML
treated as a nested mapping value indicator. The whole workflow
file failed to parse and every CI job was skipped (`This run likely
failed because of a workflow file issue` with 0 jobs in the API
response). Quoting the step name resolves the parse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 83 changed files in this pull request and generated 4 comments.

Comment thread src/ops/scalar/kahan.rs Outdated
Comment on lines +94 to +100
// Validate after transpose/negate so the rejection mirrors scipy
// (which also checks the working copy).
for &v in working.iter() {
if v.is_nan() || v == f64::NEG_INFINITY {
return Err(crate::cluster::hungarian::error::NonFiniteError::InfInSoftClusters.into());
}
}
Comment thread src/embed/error.rs
Comment on lines +10 to +18
/// Marked `#[non_exhaustive]` so callers must include a `_ =>` arm in
/// any `match`. Variants in this enum represent low-level numerical /
/// boundary conditions (NaN/inf inputs, shape drift, ORT failure, …)
/// and the set evolves as new failure modes are surfaced or as
/// internal kernels stop being able to produce a given variant. The
/// attribute lets us add or retire variants without it being a
/// semver-breaking change for downstream exhaustive matchers.
#[derive(Debug, Error)]
#[non_exhaustive]
Comment thread src/cluster/hungarian/lsap.rs Outdated
uqio and others added 2 commits May 9, 2026 17:07
Three fixes converging on the failed CI jobs against fix/deep-review:

1. clippy `needless_return` (12 errors): the cfg-gated SIMD dispatch
   inside `apply_window_inplace` / `power_spectrum` /
   `fma_dot_f32_to_f64` ends each per-arch block with `return;` so a
   non-arch-matched fallback can't execute, but on any single arch
   only one block compiles and the trailing `return` looks needless.
   Allow the lint at the function level on all three dispatchers.
   Also drop the `into_iter()` on `col_ind` in
   `cluster::hungarian::algo::assign_one`, replace `&mut u, &mut v`
   with `&u, &v` in `lsap::augmenting_path` (function takes `&[f64]`),
   convert three `for i in 0..n { … xs[i] … }` loops to
   `iter_mut().enumerate().take(n)` form, switch
   `EPSILON: f32 = 1.1920928955078125e-07` to `f32::EPSILON` (literal
   had excessive precision).

2. miri-tb-i686 / miri-sb-riscv64gc `function … is never used`
   errors on `make_test_inputs` and `assert_dot_within_tol`: those
   helpers are only consumed by the `target_arch = "aarch64"` /
   `target_arch = "x86_64"` direct-backend tests, so on i686 / riscv
   every consumer is cfg-excluded and the helpers become dead code
   under `-Dwarnings`. Cfg-gate the helpers to match.

Verified: `cargo test --no-default-features` 505 passed, `cargo
clippy --no-default-features --features _bench -- -Dwarnings` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings from
#7 (review):

1. `ops::scalar::kahan::kahan_dot` doc said "panics in debug only" via
   `debug_assert_eq!`, but the loop already indexes `b[i]` for
   `i in 0..a.len()` — release builds panic on bounds-check anyway,
   just with a less-descriptive message. Promote to unconditional
   `assert_eq!` (matching `ops::dispatch::dot`'s public contract) and
   update the doc accordingly.

2. `cluster::hungarian::lsap::linear_sum_assignment` rejected NaN and
   `-inf` but let `+inf` through under `maximize=false` (in-tree
   `constrained_argmax` always passes `maximize=true` so the existing
   `+inf` boundary check + negation made this safe in-pipeline, but a
   future direct caller could trip the dual-update arithmetic).
   Change the validation to `!v.is_finite()` so any non-finite is
   caught regardless of orientation.

3. `NonFiniteError::InfInSoftClusters` error message claimed
   "+inf or -inf" but the LSAP layer also rejects NaN. Update the
   message to "+inf, -inf, or NaN" so the surfaced error matches the
   actual rejection criteria. Variant name is preserved for
   backward compatibility.

4. `embed::Error` `#[non_exhaustive]` + `Error::Fbank` removal is a
   source-breaking API change. The crate is unpublished 0.1.0 with
   zero downstream consumers (so the break is theoretical), but
   document both changes explicitly under `BREAKING (pre-1.0)` in
   `CHANGELOG.md`'s `# UNRELEASED` section so future readers can
   trace the API delta.

Verified: `cargo clippy --no-default-features --features _bench --
-Dwarnings` clean; `RUSTFLAGS="-Dwarnings" cargo test --features
ort,bundled-segmentation` 532 lib tests + 1 integration pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio changed the title feat: fbank feat: 14/14 pyannote-community-1 parity — fbank port + scipy-LSAP + Neumaier-VBx + bounded-scratch SIMD May 9, 2026
@uqio uqio merged commit 4d7593b into main May 9, 2026
63 checks passed
@uqio uqio deleted the feat/onnx-rust-resnet-tail branch May 9, 2026 05:35
uqio added a commit that referenced this pull request May 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

❌ Patch coverage is 61.34454% with 230 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/embed/fbank.rs 65.07% 117 Missing ⚠️
src/ops/arch/neon/kahan.rs 0.00% 75 Missing ⚠️
src/offline/owned.rs 0.00% 17 Missing ⚠️
src/ops/dispatch/kahan.rs 28.57% 10 Missing ⚠️
src/ops/scalar/kahan.rs 78.57% 6 Missing ⚠️
src/cluster/hungarian/lsap.rs 97.39% 3 Missing ⚠️
src/embed/model.rs 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants