whispery

Sans-I/O word-level forced alignment with WhisperX-equivalent accuracy.

Overview

Sans-I/O cut/batch/whisper/align state machine for speech-to-text indexing pipelines, inspired by WhisperX. Whispery itself owns no threads, channels, or runtime — you drive it from a single thread (or wrap blocking calls in your async runtime), feeding samples + VAD and pulling commands the runner answers via sync compute primitives (AsrSource, run_one_alignment).

Quick start

The wav2vec2-base-960h tokenizer ships inside the crate (parsed at build time, no serde_json runtime dep) — only the encoder ONNX and the Whisper ggml checkpoint are BYO. Both files live above crates.io's 10 MB hard limit and cannot be bundled. Fetch them once with the pinned commands below; each one verifies SHA-256 before installing so a republished or truncated upstream surfaces as a hard failure rather than silently altering alignment output.

Whisper ggml model (`ggml-large-v3-turbo.bin`, ~1.6 GB)

WHISPERY_WHISPER_MODEL_SHA256="1fc70f774d38eb169993ac391eea357ef47c88757ef72ee5943879b7e8e2bc69"
mkdir -p models
TMP="$(mktemp "${TMPDIR:-/tmp}/ggml-large-v3-turbo.XXXXXXXXXX")"
curl --fail --location \
  --output "$TMP" \
  "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
ACTUAL="$(shasum -a 256 "$TMP" | awk '{print $1}')"
if [ "$ACTUAL" != "$WHISPERY_WHISPER_MODEL_SHA256" ]; then
  echo "SHA-256 mismatch: expected $WHISPERY_WHISPER_MODEL_SHA256, got $ACTUAL" >&2
  rm -f "$TMP"; exit 1
fi
mv "$TMP" models/ggml-large-v3-turbo.bin

Wav2vec2 alignment encoder (English, ~378 MB)

WHISPERY_W2V_EN_SHA256="00b7cc69516c1ab63c429e63a2b543e4d42bb77441ec5b98ee935de175b00de1"
TMP="$(mktemp "${TMPDIR:-/tmp}/wav2vec2-base-960h.XXXXXXXXXX")"
curl --fail --location \
  --output "$TMP" \
  "https://huggingface.co/onnx-community/wav2vec2-base-960h-ONNX/resolve/main/onnx/model.onnx"
ACTUAL="$(shasum -a 256 "$TMP" | awk '{print $1}')"
if [ "$ACTUAL" != "$WHISPERY_W2V_EN_SHA256" ]; then
  echo "SHA-256 mismatch: expected $WHISPERY_W2V_EN_SHA256, got $ACTUAL" >&2
  rm -f "$TMP"; exit 1
fi
mv "$TMP" models/wav2vec2-base-960h.onnx

(Whispery's build.rs can fetch both fixtures for you when WHISPERY_FETCH_MODEL=1 / WHISPERY_FETCH_W2V=1 are set on a cargo build. The script enforces the same SHA-256 pins. Plain cargo build makes no network requests.)

Run an end-to-end alignment

use std::path::Path;
use std::sync::{Arc, atomic::AtomicBool};
use whispery::{
  AlignWorkItem, Aligner, AlignerKey, AlignmentFallback,
  AlignmentSetBuilder, AsrChunkContext, AsrSource, EnglishNormalizer,
  Lang, WhisperAsrSource, WhisperContext, WhisperContextParameters,
  run_one_alignment,
  core::{Command, Transcriber, TranscriberConfig},
  ort::session::RunOptions,
};

let aligner = Aligner::from_paths(
  Lang::En,
  Path::new("models/wav2vec2-base-960h.onnx"),
  Path::new("models/wav2vec2-base-960h-tokenizer.json"),
  Box::new(EnglishNormalizer::new()),
)?;
let alignment_set = AlignmentSetBuilder::new()
  .with_fallback(AlignmentFallback::SkipChunk)
  .register(AlignerKey::Lang(Lang::En), aligner)
  .build();

let whisper_ctx = Arc::new(WhisperContext::new_with_params(
  Path::new("models/ggml-large-v3-turbo.bin"),
  WhisperContextParameters::default(),
)?);
let mut asr_source = WhisperAsrSource::new(whisper_ctx)?;

let mut transcriber = Transcriber::new(TranscriberConfig::default());
let abort_flag = Arc::new(AtomicBool::new(false));

while let Some(cmd) = transcriber.poll_command() {
  match cmd {
    Command::Asr { chunk_id, samples, params, .. } => {
      let result = asr_source.run_chunk(AsrChunkContext::new(
        &samples, &params, &abort_flag, chunk_id,
      ))?;
      transcriber.handle_asr(chunk_id, result)?;
    }
    Command::Alignment { chunk_id, samples, sub_segments: _, text, language, runs } => {
      // Sans-I/O OOV resolution: per-run detect + decide.
      // Each run gets its own decisions vec sized + ordered
      // by the events `detect_oov` produces for that run's
      // text + language. Whole-chunk fallback (when `runs`
      // is empty) gets one inner vec.
      // `default_oov_decisions` mirrors the historical
      // behaviour (alphanumeric → wildcard, pronounced
      // symbols → fail-closed); swap for
      // `wildcard_all_decisions` (WhisperX 1:1) or write
      // your own per-run / per-language policy.
      let oov_decisions: Vec<Vec<whispery::core::ResolvedOov>> = if runs.is_empty() {
        let events = alignment_set.detect_oov(&text, &language)?;
        vec![whispery::core::default_oov_decisions(&events)]
      } else {
        alignment_set.detect_oov_per_run(&runs)?
          .iter()
          .map(|events| whispery::core::default_oov_decisions(events))
          .collect()
      };

      let job = AlignWorkItem::from_run_alignment(
        &transcriber, chunk_id, samples, text, language,
        runs, abort_flag.clone(),
        oov_decisions,
      ).expect("chunk in flight");
      // Fresh `RunOptions` per chunk so a watchdog's
      // `terminate()` for chunk N does not poison chunk N+1.
      let run_options = RunOptions::new()?;
      let aligned = run_one_alignment(&alignment_set, &job, &run_options)?;
      transcriber.handle_alignment(chunk_id, aligned)?;
    }
  }
}
while let Some(_event) = transcriber.poll_event() {
  /* Transcript.words() carries word-level alignment */
}
# Ok::<(), Box<dyn std::error::Error>>(())

ORT_DYLIB_PATH overrides the default libonnxruntime lookup if you keep the dylib elsewhere. The alignment feature uses ort in load-dynamic mode — cargo build --features alignment succeeds on a clean toolchain (no system libonnxruntime needed at build time), but you must supply one at run time.

Async users (tokio, smol) wrap WhisperAsrSource::run_chunk and run_one_alignment in spawn_blocking and wire shutdown via their own cancellation tokens flipping abort_flag. Calling the chunk's run_options.terminate() from another thread cancels in-flight ORT inference mid-call; the alignment pipeline polls abort_flag between coarse stages too.

Cargo features

Feature	Default	What it enables
`std`	yes	`std`-backed implementations of crate types. Chains `std` to `mediatime`, `smol_str`, and serde when present.
`runner`	yes	`WhisperAsrSource` + the in-house `whispercpp` 0.2.x bindings + the temperature retry ladder + the real-zlib compression-ratio gate (via `miniz_oxide`). Implies `std`.
`alignment`	no	wav2vec2 forced alignment via `ort` (load-dynamic) + tokenizers + ndarray. Lights up `Aligner`, `AlignmentSet`, `run_one_alignment`. Implies `runner`.
`serde`	no	Derive `serde::{Serialize, Deserialize}` on public state-machine types (`Transcript`, `Word`, `AsrParams`, …). Implies `runner`.
`metal`	no	Apple-only: enables `whispercpp/metal` so the encoder runs on the unified-memory Metal backend. Implies `runner`.
`coreml`	no	Apple-only: enables `whispercpp/coreml` so the encoder additionally dispatches to ANE if the caller has produced a CoreML companion `.mlmodelc`. Implies `runner`.
`bench-internals`	no	Re-exports `pub(crate)` alignment internals (scalar/SIMD normaliser variants, raw `ctc_viterbi`, `LogProbsTV`) under `whispery::__bench` so the SIMD baseline bench can call them directly. Doc-hidden; never enable in shipping builds. Implies `alignment`.
`parity-dump-emission`	no	Diagnostic-only: writes `wy_seg<N>.{emission,trellis}.bin` + a `wy_seg<N>.tokens.json` companion to `WHISPERY_PARITY_DUMP_TRELLIS` whenever set. Implies `alignment`. Do NOT enable in shipping builds.

The CTC parity tests run as part of the regular test suite — the OOV policy is per-test runtime data (no Cargo feature):

cargo test --features alignment,bench-internals --test whisperx_unit_parity
# 8/8 — tests 1-6 + 8 use `default_oov_decisions` (whispery
# default); test 7 (`4,9` digits-comma WhisperX issue #1372)
# uses `wildcard_all_decisions` to opt into WhisperX 1:1
# behaviour for pronounced symbols.

These tests port WhisperX's tests/test_word_timestamp_interpolation.py 1:1 onto whispery's CTC pipeline. The 193 alignment-pipeline lib tests pin the same algorithmic invariants stage-by-stage; median IoU 0.9955–0.9990 across 854 word pairs vs. WhisperX's recorded outputs (measured during initial calibration; see trellis_beam.rs:305-330).

Audio parity fixtures (~283 MB)

The 14 WAV clips + RTTM speaker annotations whispery's end-to-end parity tests reference live out-of-tree at Findit-AI/audio-fixtures so they don't bloat whispery's git history. Populate them locally with:

bash scripts/fetch_fixtures.sh

The script shallow-clones the sibling repo and lays files out under tests/parity/fixtures/<name>/{clip_16k.wav,reference.rttm} (the layout existing tests expect). Idempotent + cleans the clone unless WHISPERY_FIXTURES_KEEP_CLONE=1. CI runs the same script when WHISPERY_FETCH_FIXTURES=1 is set on the workflow run; default cargo test stays network-free.

License

whispery is under the terms of both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE, LICENSE-MIT for details.

Bundled-asset attributions propagate to downstream binaries

whispery parses one third-party asset into every compiled binary via include_str! at build time:

File	License	Source
`assets/wav2vec2_base_960h_tokenizer.json` (parsed at build into Rust constants under `OUT_DIR`; bundled when `alignment` is on)	Apache-2.0	`facebook/wav2vec2-base-960h` (tokenizer.json)

The full SPDX expression for an alignment-enabled build is therefore (MIT OR Apache-2.0) AND Apache-2.0. When you redistribute a binary that depends on whispery, reproduce the attribution somewhere a recipient can find — for instance, in your application's "About" or third-party-licenses page.

Models you BYO at runtime (Whisper ggml checkpoints, wav2vec2 encoder ONNX, language-specific aligners) carry their own licenses — see the source links above and on each HuggingFace repo. Mirror copies under huggingface.co/FinDIT-Studio re-export upstream weights without modification; the upstream license applies.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
assets		assets
benches		benches
docs/superpowers		docs/superpowers
examples		examples
scripts		scripts
src		src
tests		tests
.codecov.yml		.codecov.yml
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODEX_DEFERRED.md		CODEX_DEFERRED.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README-zh_CN.md		README-zh_CN.md
README.md		README.md
TODO.md		TODO.md
build.rs		build.rs
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

whispery

Overview

Quick start

Whisper ggml model (`ggml-large-v3-turbo.bin`, ~1.6 GB)

Wav2vec2 alignment encoder (English, ~378 MB)

Run an end-to-end alignment

Cargo features

Audio parity fixtures (~283 MB)

License

Bundled-asset attributions propagate to downstream binaries

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

whispery

Overview

Quick start

Whisper ggml model (ggml-large-v3-turbo.bin, ~1.6 GB)

Wav2vec2 alignment encoder (English, ~378 MB)

Run an end-to-end alignment

Cargo features

Audio parity fixtures (~283 MB)

License

Bundled-asset attributions propagate to downstream binaries

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Whisper ggml model (`ggml-large-v3-turbo.bin`, ~1.6 GB)

Packages