Sans-I/O cut/batch/whisper/align state machine for speech-to-text
indexing pipelines, inspired by WhisperX. Whispery
itself owns no threads, channels, or runtime — you drive it from a
single thread (or wrap blocking calls in your async runtime),
feeding samples + VAD and pulling commands the runner answers via
sync compute primitives (AsrSource,
run_one_alignment).
The wav2vec2-base-960h tokenizer ships inside the crate (parsed at
build time, no serde_json runtime dep) — only the encoder ONNX
and the Whisper ggml checkpoint are BYO. Both files live above
crates.io's 10 MB hard limit and cannot be bundled. Fetch them
once with the pinned commands below; each one verifies SHA-256
before installing so a republished or truncated upstream surfaces
as a hard failure rather than silently altering alignment output.
WHISPERY_WHISPER_MODEL_SHA256="1fc70f774d38eb169993ac391eea357ef47c88757ef72ee5943879b7e8e2bc69"
mkdir -p models
TMP="$(mktemp "${TMPDIR:-/tmp}/ggml-large-v3-turbo.XXXXXXXXXX")"
curl --fail --location \
--output "$TMP" \
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin"
ACTUAL="$(shasum -a 256 "$TMP" | awk '{print $1}')"
if [ "$ACTUAL" != "$WHISPERY_WHISPER_MODEL_SHA256" ]; then
echo "SHA-256 mismatch: expected $WHISPERY_WHISPER_MODEL_SHA256, got $ACTUAL" >&2
rm -f "$TMP"; exit 1
fi
mv "$TMP" models/ggml-large-v3-turbo.binWHISPERY_W2V_EN_SHA256="00b7cc69516c1ab63c429e63a2b543e4d42bb77441ec5b98ee935de175b00de1"
TMP="$(mktemp "${TMPDIR:-/tmp}/wav2vec2-base-960h.XXXXXXXXXX")"
curl --fail --location \
--output "$TMP" \
"https://huggingface.co/onnx-community/wav2vec2-base-960h-ONNX/resolve/main/onnx/model.onnx"
ACTUAL="$(shasum -a 256 "$TMP" | awk '{print $1}')"
if [ "$ACTUAL" != "$WHISPERY_W2V_EN_SHA256" ]; then
echo "SHA-256 mismatch: expected $WHISPERY_W2V_EN_SHA256, got $ACTUAL" >&2
rm -f "$TMP"; exit 1
fi
mv "$TMP" models/wav2vec2-base-960h.onnx(Whispery's build.rs can fetch both fixtures for you when
WHISPERY_FETCH_MODEL=1 / WHISPERY_FETCH_W2V=1 are set on a
cargo build. The script enforces the same SHA-256 pins. Plain
cargo build makes no network requests.)
use std::path::Path;
use std::sync::{Arc, atomic::AtomicBool};
use whispery::{
AlignWorkItem, Aligner, AlignerKey, AlignmentFallback,
AlignmentSetBuilder, AsrChunkContext, AsrSource, EnglishNormalizer,
Lang, WhisperAsrSource, WhisperContext, WhisperContextParameters,
run_one_alignment,
core::{Command, Transcriber, TranscriberConfig},
ort::session::RunOptions,
};
let aligner = Aligner::from_paths(
Lang::En,
Path::new("models/wav2vec2-base-960h.onnx"),
Path::new("models/wav2vec2-base-960h-tokenizer.json"),
Box::new(EnglishNormalizer::new()),
)?;
let alignment_set = AlignmentSetBuilder::new()
.with_fallback(AlignmentFallback::SkipChunk)
.register(AlignerKey::Lang(Lang::En), aligner)
.build();
let whisper_ctx = Arc::new(WhisperContext::new_with_params(
Path::new("models/ggml-large-v3-turbo.bin"),
WhisperContextParameters::default(),
)?);
let mut asr_source = WhisperAsrSource::new(whisper_ctx)?;
let mut transcriber = Transcriber::new(TranscriberConfig::default());
let abort_flag = Arc::new(AtomicBool::new(false));
while let Some(cmd) = transcriber.poll_command() {
match cmd {
Command::Asr { chunk_id, samples, params, .. } => {
let result = asr_source.run_chunk(AsrChunkContext::new(
&samples, ¶ms, &abort_flag, chunk_id,
))?;
transcriber.handle_asr(chunk_id, result)?;
}
Command::Alignment { chunk_id, samples, sub_segments: _, text, language, runs } => {
// Sans-I/O OOV resolution: per-run detect + decide.
// Each run gets its own decisions vec sized + ordered
// by the events `detect_oov` produces for that run's
// text + language. Whole-chunk fallback (when `runs`
// is empty) gets one inner vec.
// `default_oov_decisions` mirrors the historical
// behaviour (alphanumeric → wildcard, pronounced
// symbols → fail-closed); swap for
// `wildcard_all_decisions` (WhisperX 1:1) or write
// your own per-run / per-language policy.
let oov_decisions: Vec<Vec<whispery::core::ResolvedOov>> = if runs.is_empty() {
let events = alignment_set.detect_oov(&text, &language)?;
vec![whispery::core::default_oov_decisions(&events)]
} else {
alignment_set.detect_oov_per_run(&runs)?
.iter()
.map(|events| whispery::core::default_oov_decisions(events))
.collect()
};
let job = AlignWorkItem::from_run_alignment(
&transcriber, chunk_id, samples, text, language,
runs, abort_flag.clone(),
oov_decisions,
).expect("chunk in flight");
// Fresh `RunOptions` per chunk so a watchdog's
// `terminate()` for chunk N does not poison chunk N+1.
let run_options = RunOptions::new()?;
let aligned = run_one_alignment(&alignment_set, &job, &run_options)?;
transcriber.handle_alignment(chunk_id, aligned)?;
}
}
}
while let Some(_event) = transcriber.poll_event() {
/* Transcript.words() carries word-level alignment */
}
# Ok::<(), Box<dyn std::error::Error>>(())ORT_DYLIB_PATH overrides the default libonnxruntime lookup if
you keep the dylib elsewhere. The alignment feature uses ort
in load-dynamic mode — cargo build --features alignment
succeeds on a clean toolchain (no system libonnxruntime needed
at build time), but you must supply one at run time.
Async users (tokio, smol) wrap WhisperAsrSource::run_chunk and
run_one_alignment in spawn_blocking and wire shutdown via their
own cancellation tokens flipping abort_flag. Calling the chunk's
run_options.terminate() from another thread cancels in-flight
ORT inference mid-call; the alignment pipeline polls abort_flag
between coarse stages too.
| Feature | Default | What it enables |
|---|---|---|
std |
yes | std-backed implementations of crate types. Chains std to mediatime, smol_str, and serde when present. |
runner |
yes | WhisperAsrSource + the in-house whispercpp 0.2.x bindings + the temperature retry ladder + the real-zlib compression-ratio gate (via miniz_oxide). Implies std. |
alignment |
no | wav2vec2 forced alignment via ort (load-dynamic) + tokenizers + ndarray. Lights up Aligner, AlignmentSet, run_one_alignment. Implies runner. |
serde |
no | Derive serde::{Serialize, Deserialize} on public state-machine types (Transcript, Word, AsrParams, …). Implies runner. |
metal |
no | Apple-only: enables whispercpp/metal so the encoder runs on the unified-memory Metal backend. Implies runner. |
coreml |
no | Apple-only: enables whispercpp/coreml so the encoder additionally dispatches to ANE if the caller has produced a CoreML companion .mlmodelc. Implies runner. |
bench-internals |
no | Re-exports pub(crate) alignment internals (scalar/SIMD normaliser variants, raw ctc_viterbi, LogProbsTV) under whispery::__bench so the SIMD baseline bench can call them directly. Doc-hidden; never enable in shipping builds. Implies alignment. |
parity-dump-emission |
no | Diagnostic-only: writes wy_seg<N>.{emission,trellis}.bin + a wy_seg<N>.tokens.json companion to WHISPERY_PARITY_DUMP_TRELLIS whenever set. Implies alignment. Do NOT enable in shipping builds. |
The CTC parity tests run as part of the regular test suite — the OOV policy is per-test runtime data (no Cargo feature):
cargo test --features alignment,bench-internals --test whisperx_unit_parity
# 8/8 — tests 1-6 + 8 use `default_oov_decisions` (whispery
# default); test 7 (`4,9` digits-comma WhisperX issue #1372)
# uses `wildcard_all_decisions` to opt into WhisperX 1:1
# behaviour for pronounced symbols.These tests port WhisperX's
tests/test_word_timestamp_interpolation.py 1:1 onto whispery's
CTC pipeline. The 193 alignment-pipeline lib tests pin the same
algorithmic invariants stage-by-stage; median IoU 0.9955–0.9990
across 854 word pairs vs. WhisperX's recorded outputs (measured
during initial calibration; see trellis_beam.rs:305-330).
The 14 WAV clips + RTTM speaker annotations whispery's
end-to-end parity tests reference live out-of-tree at
Findit-AI/audio-fixtures so they don't
bloat whispery's git history. Populate them locally with:
bash scripts/fetch_fixtures.shThe script shallow-clones the sibling repo and lays files out
under tests/parity/fixtures/<name>/{clip_16k.wav,reference.rttm}
(the layout existing tests expect). Idempotent + cleans the
clone unless WHISPERY_FIXTURES_KEEP_CLONE=1. CI runs the same
script when WHISPERY_FETCH_FIXTURES=1 is set on the workflow
run; default cargo test stays network-free.
whispery is under the terms of both the MIT license and the
Apache License (Version 2.0).
See LICENSE-APACHE, LICENSE-MIT for details.
Copyright (c) 2026 FinDIT studio authors.
whispery parses one third-party asset into every compiled
binary via include_str! at build time:
| File | License | Source |
|---|---|---|
assets/wav2vec2_base_960h_tokenizer.json (parsed at build into Rust constants under OUT_DIR; bundled when alignment is on) |
Apache-2.0 | facebook/wav2vec2-base-960h (tokenizer.json) |
The full SPDX expression for an alignment-enabled build is
therefore (MIT OR Apache-2.0) AND Apache-2.0. When you
redistribute a binary that depends on whispery, reproduce the
attribution somewhere a recipient can find — for instance, in
your application's "About" or third-party-licenses page.
Models you BYO at runtime (Whisper ggml checkpoints, wav2vec2
encoder ONNX, language-specific aligners) carry their own
licenses — see the source links above and on each HuggingFace
repo. Mirror copies under
huggingface.co/FinDIT-Studio
re-export upstream weights without modification; the upstream
license applies.