fix(asr): add melChunkContext opt-out flag for Issue #594#596
fix(asr): add melChunkContext opt-out flag for Issue #594#596Alex-Wengg wants to merge 1 commit into
Conversation
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m25s • 05/12/2026, 09:08 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 5m8s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.0s diarization time • Test runtime: 2m 30s • 05/12/2026, 09:12 AM EST |
Kokoro TTS Smoke Test ✅
Runtime: 0m51s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
PocketTTS Smoke Test ✅
Runtime: 0m21s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 9m2s • 05/12/2026, 09:15 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 125.3s processing • Test runtime: 2m 5s • 05/12/2026, 09:03 AM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 4m 21s • 2026-05-12T13:11:56.186Z |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR #596.
|
@claude review |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
a416faa to
dde2eb0
Compare
## Summary Adds `.github/workflows/claude.yml` so the repo can respond to `@claude` mentions in issues, issue comments, PR reviews, and PR review comments via [anthropics/claude-code-action@v1](https://github.com/anthropics/claude-code-action). Motivation: PR #596 had a reviewer post `@claude review` and nothing happened because no workflow was wired up. This PR fixes that for future reviews. ## What it does - Triggers on `issue_comment`, `pull_request_review_comment`, `pull_request_review`, `issues` (opened/assigned) - Job runs only when the body/title contains `@claude` (cheap filter, prevents wasted runs) - Uses `ANTHROPIC_API_KEY` repo secret for auth - Minimal `read` permissions on contents/PRs/issues; `id-token: write` for OIDC ## Required configuration (repo settings) Before this workflow can run, a maintainer needs to: 1. Install the [Claude GitHub App](https://github.com/apps/claude) on `FluidInference/FluidAudio` 2. Add an `ANTHROPIC_API_KEY` secret in repo Settings -> Secrets and variables -> Actions Without those, the workflow file is inert (no failed runs, just no-op). ## Test plan - [ ] Maintainer installs the Claude GitHub App and sets `ANTHROPIC_API_KEY` - [ ] After merge, post `@claude help` on a throwaway issue and confirm the workflow fires - [ ] Confirm non-`@claude` comments do not trigger the job
…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR #596.
dde2eb0 to
9bb7e2c
Compare
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
@claude review |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR #596.
9bb7e2c to
9aa1745
Compare
|
@claude review |
|
Claude finished @Alex-Wengg's task in 3m 52s —— View job PR Review
SummaryThis PR implements an opt-in Devin Review Finding — Already Fixed ✅The Devin comment flagged that This is correctly fixed in chunkStart: contextStart, // was chunkStart in the original commitCommit Issues Found1. No unit tests for the new code pathSeverity: Medium — Policy violation (CLAUDE.md: "Add unit tests when writing new code")
2. Stale WER figure in
|
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
Three followups from the bot review of PR #596: 1. Update stale WER doc comment in AsrTypes.swift The `multilingualChunkContinuity` doc said "~4.40% English WER", which referred to an intermediate variant before the Devin globalFrameOffset fix landed. The validated landed number is 2.90% (vs 2.64% with flag off, a +0.26pp cost). 2. Warn when `multilingualChunkContinuity=true` on a non-v3 model The flag's sequential serialization + 2.0s audio prefix is designed to mitigate parakeet-tdt-0.6b-v3 English-prior drift. On v2 / tdtCtc110m / ctcZhCn / tdtJa it still produces correct output, but only adds latency with no benefit, so log a warning once when the path is entered with a non-v3 model. 3. Unit tests for the new code path (CLAUDE.md policy: "Add unit tests when writing new code") - ASRConfigTests: multilingualChunkContinuity defaults to false, preserves explicit true/false, and doesn't disturb other fields. - ChunkProcessorTests (via #if DEBUG accessors): * audio prefix is exactly 32000 samples (2.0s @ 16kHz), encoder- frame-aligned (multiple of 1280). * multilingual chunk size + prefix ≤ maxModelSamples (240000), and chunk size is frame-aligned. * multilingual chunk size is strictly smaller than default chunk size (it has to give up content to make prefix room). * chunkSamples(multilingualContinuity:) dispatches correctly to either the default or multilingual sizing path.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
| let (windowTokens, windowTimestamps, windowConfidences, windowDurations) = | ||
| try await Self | ||
| .transcribeChunk( | ||
| samples: chunkSamplesArray, | ||
| contextSamples: 0, // Pass 0 to disable contextFrameAdjustment | ||
| chunkStart: contextStart, | ||
| isLastChunk: isLastChunk, | ||
| using: manager, | ||
| decoderState: &decoderState, | ||
| maxModelSamples: maxModelSamples, | ||
| language: language | ||
| ) |
There was a problem hiding this comment.
🟡 timeJump=0 special case causes decoder to skip the entire 2.0s audio prefix in multilingual continuity path
In processWithMultilingualContinuity, the decoder state (including timeJump) is persisted across chunks. After each non-last chunk, decoderState.timeJump is set to currentTimeIndices - effectiveSequenceLength (TdtDecoderV3.swift:574). When this value is exactly 0 (decoder finished at the boundary), the next chunk's calculateInitialTimeIndices (TdtFrameNavigation.swift:42-44) hits a special case that returns ASRConstants.standardOverlapFrames (25 frames = 2.0s) — which is the exact length of the audio prefix. This causes the decoder to skip the entire prefix region.
The comment at line 341 says "let the decoder emit naturally through the prefix region" and the LCS/midpoint merger is expected to discard prefix tokens (line 342-344), but when timeJump=0 no prefix tokens are produced. The behavior is inconsistent: timeJump=1 → decoder starts at frame 1 (processes prefix), timeJump=0 → decoder starts at frame 25 (skips prefix entirely). The standardOverlapFrames special case was designed for streaming overlap, not for the batch multilingual prefix. The encoder still benefits from the prefix audio (it processes the full buffer), and the 2.0s stride overlap still provides merge anchors, but the decoder's frame-level behavior becomes non-deterministically dependent on whether timeJump is exactly 0.
Prompt for agents
The processWithMultilingualContinuity function persists decoderState across chunks, including timeJump. After each chunk, TdtDecoderV3.decodeWithTimings sets decoderState.timeJump via calculateFinalTimeJump. On the next chunk, calculateInitialTimeIndices (TdtFrameNavigation.swift:42-44) returns standardOverlapFrames (25) when timeJump==0 and contextFrameAdjustment==0, which skips the entire 2.0s audio prefix.
The fix should ensure the decoder starts at frame 0 for non-first chunks in the multilingual continuity path, so it processes through the prefix region as intended. Options:
1. Clear decoderState.timeJump to nil after each chunk in processWithMultilingualContinuity (before the next iteration), so calculateInitialTimeIndices returns contextFrameAdjustment=0 via its nil guard path.
2. Or pass a negative contextFrameAdjustment to counteract the standardOverlapFrames special case.
Option 1 is simpler: add decoderState.timeJump = nil after the transcribeChunk call (or at the end of the while loop body, before chunkStart += strideSamples). This preserves LSTM hidden/cell state and lastToken while ensuring the decoder always starts at frame 0 on non-first chunks.
Was this helpful? React with 👍 or 👎 to provide feedback.
| // Phase 2: filter punctuation out of LCS candidates to prevent false anchors | ||
| let punctuationAwareMatcher: (IndexedToken, IndexedToken) -> Bool = { l, r in | ||
| if ASRConstants.punctuationTokens.contains(l.token.token) { | ||
| return false | ||
| } | ||
| return timeTolerantMatcher(l, r) | ||
| } |
There was a problem hiding this comment.
🟡 Shared merge logic changes (punctuation-aware LCS + midpoint) modify default parallel path behavior contrary to documentation
The mergeChunks and mergeByMidpoint functions are shared between both the parallel and multilingual continuity paths. This PR modifies both: the LCS fallback now uses punctuationAwareMatcher (line 599-604) that rejects punctuation tokens as anchors, and mergeByMidpoint (line 687-694) walks backward past trailing punctuation to compute effectiveLeftEnd. These changes affect the default parallel path even when multilingualChunkContinuity = false.
The processWithParallelChunks doc comment at line 148 states "identical to origin/main behavior" and the ASRConfig doc at AsrTypes.swift:30-32 claims "WER 2.64% on test-clean" for the default flag-off path. Both claims are now inaccurate since the merge logic has changed. The merge improvements are likely beneficial (punctuation tokens like . ? ! can cause false LCS anchors at chunk boundaries), but the behavioral change is not gated by the multilingualChunkContinuity flag and could shift the English baseline WER.
Prompt for agents
The punctuation-aware LCS matcher at ChunkProcessor.swift:598-604 and the trailing-punctuation adjustment in mergeByMidpoint at ChunkProcessor.swift:687-694 are applied unconditionally in the shared mergeChunks function, affecting both the parallel (default) and multilingual continuity paths.
If the intent is to preserve bit-for-bit behavior for the default path:
- Gate the punctuationAwareMatcher behind a flag (e.g. pass multilingualContinuity through to mergeChunks, and only use punctuationAwareMatcher when true; use timeTolerantMatcher otherwise).
- Similarly gate the effectiveLeftEnd adjustment in mergeByMidpoint.
If the intent is to improve both paths (likely, given the changes are labeled Phase 2):
- Update the processWithParallelChunks doc comment (line 148) to remove the claim of identical to origin/main behavior.
- Update the ASRConfig doc comment (AsrTypes.swift:30-32) to note that merge improvements may slightly shift the WER figure.
- Re-run the LibriSpeech test-clean benchmark with the flag off to confirm the new baseline WER.
Was this helpful? React with 👍 or 👎 to provide feedback.
Devin PR-596 review (2nd pass) flagged two issues: 1. (real bug) `processWithMultilingualContinuity` persists `timeJump` across chunks. `TdtDecoderV3.decodeWithTimings` writes `decoderState.timeJump = currentTimeIndices - effectiveSequenceLength` for non-last chunks. On the next chunk, `TdtFrameNavigation.calculateInitialTimeIndices` either skips into the 2.0s prefix region (prevTimeJump > 0) or, when prevTimeJump == 0, returns the special-case `standardOverlapFrames` (25 = exactly the prefix length) which skips the prefix entirely. Either case breaks the merger's ability to anchor tokens across the chunk seam. Fix: explicitly `decoderState.timeJump = nil` after each non-last chunk so the next chunk's decoder starts at frame 0 of the buffer (which already begins with the 2.0s prefix). The punctuation-seam `reset()` already nils timeJump as a side effect; this clear handles every other boundary the same way. 2. (doc accuracy) Both ChunkProcessor.processWithParallelChunks and ASRConfig.multilingualChunkContinuity claimed the default path was bit-for-bit identical to pre-#596 main. The shared `mergeChunks` / `mergeByMidpoint` punctuation-aware LCS matcher and trailing-punctuation midpoint adjustment apply to both paths. Empirically WER-neutral on LibriSpeech test-clean (validated 2.64%), but the merger algorithm is no longer literally identical. Doc updated to reflect this. Tests: TdtDecoderStateTests gains `testTimeJumpNilingForMultilingualContinuityPath` documenting that `timeJump` is independently nilable without disturbing LSTM hidden/cell, lastToken, or predictorOutput. `testDecoderStateReset` also now exercises the timeJump field in the populate/reset cycle.
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
✅ Japanese ASR Benchmark Results (CTC)Status: Passed
✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly. View benchmark log |
PR #264 (commit 7459740) added an 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks to fix all-blank predictions at chunk boundaries on long English audio. On `parakeet-tdt-0.6b-v3-coreml` with non-English audio, that prepend shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam. Reproduction (4 fixtures, default vs --no-mel-context): - notes_1408 (FR): drift -> clean - wwii (FR): clean -> clean - user_en (EN): clean -> clean - user2 99.9s (FR): clean -> clean Changes: - ASRConfig gains `melChunkContext: Bool = true` (default preserves PR #264 behavior; set to false for non-English long-form batch). - ChunkProcessor reads the flag and zeroes the prepend when disabled, expanding chunkSamples back so chunks aren't 80ms smaller than the encoder's max receptive window. - `transcribe` and `asr-benchmark` CLIs accept `--no-mel-context`. Closes #594
f7f58f5 to
bfa14a1
Compare
Fixes #594.
Summary
French batch transcription with
parakeet-tdt-0.6b-v3-coremldrifts to English at ~15s chunk boundaries. Root cause is PR #264 (commit7459740a): the 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam.This PR adds an opt-out
melChunkContext: Boolflag onASRConfig(defaulttrue, preserves PR #264 behavior on English). Whenfalse, the prepend is zeroed andchunkSamplesexpands back so chunks aren't 80ms smaller than the encoder's max receptive window. No decoder-state-continuity, no 2.0s prefix, no serialized chunk processing — just stop perturbing the encoder's first frame.Investigation
Bisected by reverting
melContextSamplesto0and re-running reporter fixtures from #594.Fixture validation (5 fixtures, English-token count)
--no-mel-context(this PR)notes_1408_clean.wav(FR 45s, original repro)le rest of the key what is thatwwii_belgique_fr.wav(FR 47s)user_2026-05-12.wav(EN 42s)user2_2026-05-12.wav(FR 100s)climate_2026_fr_voice_memo.wav(FR 110s)années the plus extrêmes,Plusieurs gouvernements and entreprises have revue,their ambition,president American; final sentence truncatedPlusieurs gouvernements and entreprises(vsbb96003baseline which has the sameand); everything else cleanThe 5th fixture confirms
--no-mel-contextis strictly better thanmainon every measurable axis. The single remainingandon the climate fixture is also present in the pre-#264bb96003baseline that vdt4534 tested against, so it appears to be an independent failure mode (single-token cosmetic defect, not chunk-loss).English regression check (LibriSpeech test-clean, 2620 files, parakeet-v3)
melChunkContext: true)--no-mel-contextWER/CER identical at 1 decimal place. ~10-12% RTFx drop is from the larger effective chunk size when the flag is off (chunks expand by 1280 samples to fill the encoder window, slightly more encoder work per chunk).
Caveat: test-clean is short-form (mostly <16s, single-chunk) and doesn't stress the chunk-seam path much. PR #264 was introduced to fix all-blank predictions at chunk boundaries on long English audio, so the default stays
true. Only non-English long-form batch callers should opt out.CLI
API
Files touched
Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift— addsmelChunkContext: Bool = truetoASRConfigSources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift— internal accessor forChunkProcessorSources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/ChunkProcessor.swift— flag-aware mel-context / chunk-size mathSources/FluidAudioCLI/.../TranscribeCommand.swift—--no-mel-contextflag + usageSources/FluidAudioCLI/.../AsrBenchmark.swift—--no-mel-contextflag + usage5 files, +73 / -12.
Relationship to PR #604
PR #604 (stacked draft from vdt4534) was built on this branch's previous implementation (decoder-state continuity + 2.0s prefix) before this rework. It targets the same root cause but adds 6 stacked mechanisms (560ms real-audio prefix, prefix-token suppression, silence-aligned boundaries, post-punctuation blank-streak decoder guard, punctuation-aware LCS merger, v3-only routing) at +555 / -48 across 10 files, with sequential chunk processing (1.4-1.8x slower on FR).
#604 reports the climate fixture is fully clean under their stacked approach (vs 1 remaining
andunder this PR). The tradeoff is ~7.5x the code, ~1.6x the wall-clock, and no LibriSpeech English regression measurement. Two components from #604 (the punctuation blank-streak guard inTdtDecoderV3and silence-aligned chunk starts) are orthogonal to the #594 fix and worth considering as separate follow-up PRs.