Skip to content

Findit-AI/asry

Repository files navigation

whispery

Sans-I/O word-level forced alignment with WhisperX-equivalent accuracy.

github LoC Build codecov

docs.rs crates.io crates.io license

Overview

Sans-I/O cut/batch/whisper/align state machine for speech-to-text indexing pipelines, inspired by WhisperX. Whispery itself owns no threads, channels, or runtime — you drive it from a single thread (or wrap blocking calls in your async runtime), feeding samples + VAD and pulling commands the runner answers via sync compute primitives (AsrSource, run_one_alignment).

Quick start

The wav2vec2-base-960h tokenizer ships inside the crate (parsed at build time, no serde_json runtime dep) — only the encoder ONNX and the Whisper ggml checkpoint are BYO. Both files live above crates.io's 10 MB hard limit and cannot be bundled. Fetch them once with the pinned commands below; each one verifies SHA-256 before installing so a republished or truncated upstream surfaces as a hard failure rather than silently altering alignment output.

Whisper ggml model (ggml-large-v3-turbo.bin, ~1.6 GB)

WHISPERY_WHISPER_MODEL_SHA256="1fc70f774d38eb169993ac391eea357ef47c88757ef72ee5943879b7e8e2bc69"
mkdir -p models
TMP="$(mktemp "${TMPDIR:-/tmp}/ggml-large-v3-turbo.XXXXXXXXXX")"
curl --fail --location \
  --output "$TMP" \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
ACTUAL="$(shasum -a 256 "$TMP" | awk '{print $1}')"
if [ "$ACTUAL" != "$WHISPERY_WHISPER_MODEL_SHA256" ]; then
  echo "SHA-256 mismatch: expected $WHISPERY_WHISPER_MODEL_SHA256, got $ACTUAL" >&2
  rm -f "$TMP"; exit 1
fi
mv "$TMP" models/ggml-large-v3-turbo.bin

Wav2vec2 alignment encoder (English, ~378 MB)

WHISPERY_W2V_EN_SHA256="00b7cc69516c1ab63c429e63a2b543e4d42bb77441ec5b98ee935de175b00de1"
TMP="$(mktemp "${TMPDIR:-/tmp}/wav2vec2-base-960h.XXXXXXXXXX")"
curl --fail --location \
  --output "$TMP" \
  "https://huggingface.co/onnx-community/wav2vec2-base-960h-ONNX/resolve/main/onnx/model.onnx"
ACTUAL="$(shasum -a 256 "$TMP" | awk '{print $1}')"
if [ "$ACTUAL" != "$WHISPERY_W2V_EN_SHA256" ]; then
  echo "SHA-256 mismatch: expected $WHISPERY_W2V_EN_SHA256, got $ACTUAL" >&2
  rm -f "$TMP"; exit 1
fi
mv "$TMP" models/wav2vec2-base-960h.onnx

(Whispery's build.rs can fetch both fixtures for you when WHISPERY_FETCH_MODEL=1 / WHISPERY_FETCH_W2V=1 are set on a cargo build. The script enforces the same SHA-256 pins. Plain cargo build makes no network requests.)

Run an end-to-end alignment

use std::path::Path;
use std::sync::{Arc, atomic::AtomicBool};
use whispery::{
  AlignWorkItem, Aligner, AlignerKey, AlignmentFallback,
  AlignmentSetBuilder, AsrChunkContext, AsrSource, EnglishNormalizer,
  Lang, WhisperAsrSource, WhisperContext, WhisperContextParameters,
  run_one_alignment,
  core::{Command, Transcriber, TranscriberConfig},
  ort::session::RunOptions,
};

let aligner = Aligner::from_paths(
  Lang::En,
  Path::new("models/wav2vec2-base-960h.onnx"),
  Path::new("models/wav2vec2-base-960h-tokenizer.json"),
  Box::new(EnglishNormalizer::new()),
)?;
let alignment_set = AlignmentSetBuilder::new()
  .with_fallback(AlignmentFallback::SkipChunk)
  .register(AlignerKey::Lang(Lang::En), aligner)
  .build();

let whisper_ctx = Arc::new(WhisperContext::new_with_params(
  Path::new("models/ggml-large-v3-turbo.bin"),
  WhisperContextParameters::default(),
)?);
let mut asr_source = WhisperAsrSource::new(whisper_ctx)?;

let mut transcriber = Transcriber::new(TranscriberConfig::default());
let abort_flag = Arc::new(AtomicBool::new(false));

while let Some(cmd) = transcriber.poll_command() {
  match cmd {
    Command::Asr { chunk_id, samples, params, .. } => {
      let result = asr_source.run_chunk(AsrChunkContext::new(
        &samples, &params, &abort_flag, chunk_id,
      ))?;
      transcriber.handle_asr(chunk_id, result)?;
    }
    Command::Alignment { chunk_id, samples, sub_segments: _, text, language, runs } => {
      // Sans-I/O OOV resolution: per-run detect + decide.
      // Each run gets its own decisions vec sized + ordered
      // by the events `detect_oov` produces for that run's
      // text + language. Whole-chunk fallback (when `runs`
      // is empty) gets one inner vec.
      // `default_oov_decisions` mirrors the historical
      // behaviour (alphanumeric → wildcard, pronounced
      // symbols → fail-closed); swap for
      // `wildcard_all_decisions` (WhisperX 1:1) or write
      // your own per-run / per-language policy.
      let oov_decisions: Vec<Vec<whispery::core::ResolvedOov>> = if runs.is_empty() {
        let events = alignment_set.detect_oov(&text, &language)?;
        vec![whispery::core::default_oov_decisions(&events)]
      } else {
        alignment_set.detect_oov_per_run(&runs)?
          .iter()
          .map(|events| whispery::core::default_oov_decisions(events))
          .collect()
      };

      let job = AlignWorkItem::from_run_alignment(
        &transcriber, chunk_id, samples, text, language,
        runs, abort_flag.clone(),
        oov_decisions,
      ).expect("chunk in flight");
      // Fresh `RunOptions` per chunk so a watchdog's
      // `terminate()` for chunk N does not poison chunk N+1.
      let run_options = RunOptions::new()?;
      let aligned = run_one_alignment(&alignment_set, &job, &run_options)?;
      transcriber.handle_alignment(chunk_id, aligned)?;
    }
  }
}
while let Some(_event) = transcriber.poll_event() {
  /* Transcript.words() carries word-level alignment */
}
# Ok::<(), Box<dyn std::error::Error>>(())

ORT_DYLIB_PATH overrides the default libonnxruntime lookup if you keep the dylib elsewhere. The alignment feature uses ort in load-dynamic mode — cargo build --features alignment succeeds on a clean toolchain (no system libonnxruntime needed at build time), but you must supply one at run time.

Async users (tokio, smol) wrap WhisperAsrSource::run_chunk and run_one_alignment in spawn_blocking and wire shutdown via their own cancellation tokens flipping abort_flag. Calling the chunk's run_options.terminate() from another thread cancels in-flight ORT inference mid-call; the alignment pipeline polls abort_flag between coarse stages too.

Cargo features

Feature Default What it enables
std yes std-backed implementations of crate types. Chains std to mediatime, smol_str, and serde when present.
runner yes WhisperAsrSource + the in-house whispercpp 0.2.x bindings + the temperature retry ladder + the real-zlib compression-ratio gate (via miniz_oxide). Implies std.
alignment no wav2vec2 forced alignment via ort (load-dynamic) + tokenizers + ndarray. Lights up Aligner, AlignmentSet, run_one_alignment. Implies runner.
serde no Derive serde::{Serialize, Deserialize} on public state-machine types (Transcript, Word, AsrParams, …). Implies runner.
metal no Apple-only: enables whispercpp/metal so the encoder runs on the unified-memory Metal backend. Implies runner.
coreml no Apple-only: enables whispercpp/coreml so the encoder additionally dispatches to ANE if the caller has produced a CoreML companion .mlmodelc. Implies runner.
bench-internals no Re-exports pub(crate) alignment internals (scalar/SIMD normaliser variants, raw ctc_viterbi, LogProbsTV) under whispery::__bench so the SIMD baseline bench can call them directly. Doc-hidden; never enable in shipping builds. Implies alignment.
parity-dump-emission no Diagnostic-only: writes wy_seg<N>.{emission,trellis}.bin + a wy_seg<N>.tokens.json companion to WHISPERY_PARITY_DUMP_TRELLIS whenever set. Implies alignment. Do NOT enable in shipping builds.

The CTC parity tests run as part of the regular test suite — the OOV policy is per-test runtime data (no Cargo feature):

cargo test --features alignment,bench-internals --test whisperx_unit_parity
# 8/8 — tests 1-6 + 8 use `default_oov_decisions` (whispery
# default); test 7 (`4,9` digits-comma WhisperX issue #1372)
# uses `wildcard_all_decisions` to opt into WhisperX 1:1
# behaviour for pronounced symbols.

These tests port WhisperX's tests/test_word_timestamp_interpolation.py 1:1 onto whispery's CTC pipeline. The 193 alignment-pipeline lib tests pin the same algorithmic invariants stage-by-stage; median IoU 0.9955–0.9990 across 854 word pairs vs. WhisperX's recorded outputs (measured during initial calibration; see trellis_beam.rs:305-330).

Audio parity fixtures (~283 MB)

The 14 WAV clips + RTTM speaker annotations whispery's end-to-end parity tests reference live out-of-tree at Findit-AI/audio-fixtures so they don't bloat whispery's git history. Populate them locally with:

bash scripts/fetch_fixtures.sh

The script shallow-clones the sibling repo and lays files out under tests/parity/fixtures/<name>/{clip_16k.wav,reference.rttm} (the layout existing tests expect). Idempotent + cleans the clone unless WHISPERY_FIXTURES_KEEP_CLONE=1. CI runs the same script when WHISPERY_FETCH_FIXTURES=1 is set on the workflow run; default cargo test stays network-free.

License

whispery is under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE, LICENSE-MIT for details.

Copyright (c) 2026 FinDIT studio authors.

Bundled-asset attributions propagate to downstream binaries

whispery parses one third-party asset into every compiled binary via include_str! at build time:

File License Source
assets/wav2vec2_base_960h_tokenizer.json (parsed at build into Rust constants under OUT_DIR; bundled when alignment is on) Apache-2.0 facebook/wav2vec2-base-960h (tokenizer.json)

The full SPDX expression for an alignment-enabled build is therefore (MIT OR Apache-2.0) AND Apache-2.0. When you redistribute a binary that depends on whispery, reproduce the attribution somewhere a recipient can find — for instance, in your application's "About" or third-party-licenses page.

Models you BYO at runtime (Whisper ggml checkpoints, wav2vec2 encoder ONNX, language-specific aligners) carry their own licenses — see the source links above and on each HuggingFace repo. Mirror copies under huggingface.co/FinDIT-Studio re-export upstream weights without modification; the upstream license applies.

About

No description, website, or topics provided.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

Generated from Findit-AI/template-rs