fix(asr): add melChunkContext opt-out flag for Issue #594 by Alex-Wengg · Pull Request #596 · FluidInference/FluidAudio

Alex-Wengg · 2026-05-10T22:57:20Z

Fixes #594.

Summary

French batch transcription with parakeet-tdt-0.6b-v3-coreml drifts to English at ~15s chunk boundaries. Root cause is PR #264 (commit 7459740a): the 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam.

This PR adds an opt-out melChunkContext: Bool flag on ASRConfig (default true, preserves PR #264 behavior on English). When false, the prepend is zeroed and chunkSamples expands back so chunks aren't 80ms smaller than the encoder's max receptive window. No decoder-state-continuity, no 2.0s prefix, no serialized chunk processing — just stop perturbing the encoder's first frame.

Investigation

Bisected by reverting melContextSamples to 0 and re-running reporter fixtures from #594.

Fixture validation (5 fixtures, English-token count)

Fixture	Default (mel-context ON)	`--no-mel-context` (this PR)
`notes_1408_clean.wav` (FR 45s, original repro)	drifts: `le rest of the key what is that`	clean
`wwii_belgique_fr.wav` (FR 47s)	clean	clean
`user_2026-05-12.wav` (EN 42s)	clean	clean
`user2_2026-05-12.wav` (FR 100s)	clean	clean
`climate_2026_fr_voice_memo.wav` (FR 110s)	4 English drifts: `années the plus extrêmes`, `Plusieurs gouvernements and entreprises have revue`, `their ambition`, `president American`; final sentence truncated	1 English drift remaining: `Plusieurs gouvernements and entreprises` (vs `bb96003` baseline which has the same `and`); everything else clean

The 5th fixture confirms --no-mel-context is strictly better than main on every measurable axis. The single remaining and on the climate fixture is also present in the pre-#264 bb96003 baseline that vdt4534 tested against, so it appears to be an independent failure mode (single-token cosmetic defect, not chunk-loss).

English regression check (LibriSpeech test-clean, 2620 files, parakeet-v3)

Metric	Baseline (`melChunkContext: true`)	`--no-mel-context`	Delta
Average WER	2.6%	2.6%	0.0%
Median WER	0.0%	0.0%	—
Average CER	1.0%	1.0%	0.0%
Median RTFx	31.1x	27.9x	-10%
Overall RTFx	33.6x	29.5x	-12%

WER/CER identical at 1 decimal place. ~10-12% RTFx drop is from the larger effective chunk size when the flag is off (chunks expand by 1280 samples to fill the encoder window, slightly more encoder work per chunk).

Caveat: test-clean is short-form (mostly <16s, single-chunk) and doesn't stress the chunk-seam path much. PR #264 was introduced to fix all-blank predictions at chunk boundaries on long English audio, so the default stays true. Only non-English long-form batch callers should opt out.

CLI

fluidaudiocli transcribe audio.wav --no-mel-context
fluidaudiocli asr-benchmark --subset test-clean --no-mel-context

API

let asrConfig = ASRConfig(
    tdtConfig: tdtConfig,
    encoderHiddenSize: modelVersion.encoderHiddenSize,
    melChunkContext: false  // opt out for non-English long-form
)

Files touched

Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift — adds melChunkContext: Bool = true to ASRConfig
Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift — internal accessor for ChunkProcessor
Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/ChunkProcessor.swift — flag-aware mel-context / chunk-size math
Sources/FluidAudioCLI/.../TranscribeCommand.swift — --no-mel-context flag + usage
Sources/FluidAudioCLI/.../AsrBenchmark.swift — --no-mel-context flag + usage

5 files, +73 / -12.

Relationship to PR #604

PR #604 (stacked draft from vdt4534) was built on this branch's previous implementation (decoder-state continuity + 2.0s prefix) before this rework. It targets the same root cause but adds 6 stacked mechanisms (560ms real-audio prefix, prefix-token suppression, silence-aligned boundaries, post-punctuation blank-streak decoder guard, punctuation-aware LCS merger, v3-only routing) at +555 / -48 across 10 files, with sequential chunk processing (1.4-1.8x slower on FR).

#604 reports the climate fixture is fully clean under their stacked approach (vs 1 remaining and under this PR). The tradeoff is ~7.5x the code, ~1.6x the wall-clock, and no LibriSpeech English regression measurement. Two components from #604 (the punctuation blank-streak guard in TdtDecoderV3 and silence-aligned chunk starts) are orthogonal to the #594 fix and worth considering as separate follow-up PRs.

github-actions · 2026-05-10T23:03:34Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	3.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-05-10T23:04:03Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	7.85x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	62.1s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.062s	Average chunk processing time
Max Chunk Time	0.124s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m25s • 05/12/2026, 09:08 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-05-10T23:04:54Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

Performance Metrics

Metric	CI Value	Expected on Apple Silicon
Median RTFx	0.04x	~2.5x
Overall RTFx	0.04x	~2.5x

_{Runtime: 5m8s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-05-10T23:07:25Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	30.83x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	8.675	25.5	Fetching diarization models
Model Compile	3.718	10.9	CoreML compilation
Audio Load	0.047	0.1	Loading audio file
Segmentation	10.208	30.0	Detecting speech regions
Embedding	17.014	50.0	Extracting speaker voices
Clustering	6.806	20.0	Grouping same speakers
Total	34.033	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.0s diarization time • Test runtime: 2m 30s • 05/12/2026, 09:12 AM EST}

github-actions · 2026-05-10T23:08:00Z

Kokoro TTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (634.8 KB)

_{Runtime: 0m51s}

_{Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.}

github-actions · 2026-05-10T23:09:03Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (150.0 KB)

_{Runtime: 0m21s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-05-10T23:10:49Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	3.23x	✅
test-other	1.35%	0.00%	1.80x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	4.75x	✅
test-other	1.00%	0.00%	2.83x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.39x	Streaming real-time factor
Avg Chunk Time	2.299s	Average time to process each chunk
Max Chunk Time	3.974s	Maximum chunk processing time
First Token	3.336s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.57x	Streaming real-time factor
Avg Chunk Time	1.857s	Average time to process each chunk
Max Chunk Time	3.357s	Maximum chunk processing time
First Token	1.631s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 9m2s • 05/12/2026, 09:15 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-05-10T23:11:51Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	10.4%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	10.61x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	13.380	13.5	Fetching diarization models
Model Compile	5.734	5.8	CoreML compilation
Audio Load	0.069	0.1	Loading audio file
Segmentation	26.605	26.9	VAD + speech detection
Embedding	98.446	99.6	Speaker embedding extraction
Clustering (VBx)	0.234	0.2	Hungarian algorithm + VBx clustering
Total	98.890	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	10.4%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 125.3s processing • Test runtime: 2m 5s • 05/12/2026, 09:03 AM EST}

github-actions · 2026-05-10T23:14:33Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	30.3%	<35%	✅
Miss Rate	28.2%	-	-
False Alarm	0.9%	-	-
Speaker Error	1.2%	-	-
RTFx	7.2x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 4m 21s • 2026-05-12T13:11:56.186Z}

github-actions · 2026-05-10T23:16:26Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	381.3x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	547.5x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-05-11T04:36:07Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.8x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-05-11T14:06:45Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Alex-Wengg · 2026-05-11T14:06:59Z

@claude

…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR #596.

Alex-Wengg · 2026-05-11T14:44:52Z

@claude review

Alex-Wengg · 2026-05-11T14:48:12Z

@claude

github-actions · 2026-05-11T14:53:04Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.3x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

## Summary Adds `.github/workflows/claude.yml` so the repo can respond to `@claude` mentions in issues, issue comments, PR reviews, and PR review comments via [anthropics/claude-code-action@v1](https://github.com/anthropics/claude-code-action). Motivation: PR #596 had a reviewer post `@claude review` and nothing happened because no workflow was wired up. This PR fixes that for future reviews. ## What it does - Triggers on `issue_comment`, `pull_request_review_comment`, `pull_request_review`, `issues` (opened/assigned) - Job runs only when the body/title contains `@claude` (cheap filter, prevents wasted runs) - Uses `ANTHROPIC_API_KEY` repo secret for auth - Minimal `read` permissions on contents/PRs/issues; `id-token: write` for OIDC ## Required configuration (repo settings) Before this workflow can run, a maintainer needs to: 1. Install the [Claude GitHub App](https://github.com/apps/claude) on `FluidInference/FluidAudio` 2. Add an `ANTHROPIC_API_KEY` secret in repo Settings -> Secrets and variables -> Actions Without those, the workflow file is inert (no failed runs, just no-op). ## Test plan - [ ] Maintainer installs the Claude GitHub App and sets `ANTHROPIC_API_KEY` - [ ] After merge, post `@claude help` on a throwaway issue and confirm the workflow fires - [ ] Confirm non-`@claude` comments do not trigger the job

…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR #596.

Alex-Wengg · 2026-05-11T15:07:51Z

@claude review

claude · 2026-05-11T15:08:11Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

Alex-Wengg · 2026-05-11T15:15:38Z

@claude review

claude · 2026-05-11T15:15:59Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR #596.

Alex-Wengg · 2026-05-11T15:16:46Z

@claude review

claude · 2026-05-11T15:17:06Z

Claude finished @Alex-Wengg's task in 3m 52s —— View job

PR Review

Gather context and understand request
Read changed files
Review ChunkProcessor.swift (main changes)
Review TdtDecoderV3.swift
Review AsrTypes.swift and AsrManager.swift
Review CLI changes
Post review findings

Summary

This PR implements an opt-in multilingualChunkContinuity flag to fix French (non-English) chunk boundary drift in batch ASR. The core design is sound, and the critical timestamp bug flagged in the Devin review is already fixed in commit 9aa1745.

Devin Review Finding — Already Fixed ✅

The Devin comment flagged that chunkStart was passed instead of contextStart for globalFrameOffset, which would shift all chunk N≥1 timestamps by +2.0s (25 encoder frames) — beyond the 1.0s LCS tolerance — causing the merger to duplicate ~2s of content at every seam.

This is correctly fixed in ChunkProcessor.swift:334:

chunkStart: contextStart,  // was chunkStart in the original commit

Commit 9aa1745 applied this fix. The comment block at lines 315–328 explains the reasoning clearly.

Issues Found

1. No unit tests for the new code path

Severity: Medium — Policy violation (CLAUDE.md: "Add unit tests when writing new code")

processWithMultilingualContinuity and the punctuation seam reset logic have no tests. The existing test files (AsrManagerTests.swift, AsrTranscriptionTests.swift, TokenDeduplicationRegressionTests.swift) have no coverage for this path. At minimum, the following should be tested without real models:

ChunkProcessor timestamp offset correctness: verify that prefix tokens for chunk N≥1 have timestamps anchored to contextStart, not chunkStart
Punctuation seam reset: verify decoderState.reset() fires when lastToken is a punctuation token
multilingualChunkSamples + audioPrefixSamples ≤ maxModelSamples

Fix this →

2. Stale WER figure in `ASRConfig` doc comment

Severity: Low — AsrTypes.swift:43 says ~4.40% English WER regression:

/// Trade-off: enabling this regresses English LibriSpeech WER (~4.40%
/// on `test-clean`) and serializes chunk processing, so it is opt-in.

The PR description table shows 5.264% with the flag ON. This appears to be from an intermediate version of the fix before contextStart correction. The comment should be updated to reflect the validated number.

Fix this →

3. No guard against non-v3 model + `multilingualChunkContinuity: true`

Severity: Low — The flag only helps on parakeet-tdt-0.6b-v3-coreml. If a caller enables multilingualChunkContinuity: true on a v2 or tdtCtc110m model, it won't fail, but the sequential serialization overhead still applies with no benefit. A logger.warning in processWithMultilingualContinuity (or in AsrManager.transcribeDiskBacked) when multilingualContinuity && version != .v3 would help surface misconfiguration.

Design Notes (No Action Required)

Punctuation guard double-accounting: The blankFramesSincePunctuation counter increments in both the outer loop (line 292–294) and the inner blank loop (lines 371–372). This is correct — the outer adds the first blank's duration, and the inner accumulates subsequent consecutive blanks. No double-counting.

contextSamples: 0 tradeoff: Passing 0 disables contextFrameAdjustment, so the decoder emits through the 2.0s prefix region. Those prefix tokens get timestamps [contextStart/spf, chunkStart/spf] which overlap with the previous chunk's tail — the LCS/midpoint merger correctly discards them. The PR description acknowledges the residual known limitation (files where content is still dropped after a sentence-final seam).

TdtDecoderState.reset() on punctuation seam (ChunkProcessor.swift:361-363): This zeroes LSTM h/c + clears lastToken/predictorOutput, causing SOS re-priming for chunk N+1. The 2.0s audio prefix on chunk N+1 compensates — the encoder sees real audio history before the clean-slate LSTM produces its first prediction. This is consistent with the streaming path design.

CI Status

All relevant benchmarks passed (ASR ✅, Japanese CTC ✅, Parakeet EOU ✅, Qwen3 ✅, TTS ✅). Diarization and Sortformer benchmarks show NaN which appears to be a pre-existing CI infrastructure issue unrelated to this PR.

github-actions · 2026-05-11T16:01:20Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.4x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-05-11T16:03:04Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.2x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-05-11T16:07:06Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-05-11T16:09:12Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	1.8x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Three followups from the bot review of PR #596: 1. Update stale WER doc comment in AsrTypes.swift The `multilingualChunkContinuity` doc said "~4.40% English WER", which referred to an intermediate variant before the Devin globalFrameOffset fix landed. The validated landed number is 2.90% (vs 2.64% with flag off, a +0.26pp cost). 2. Warn when `multilingualChunkContinuity=true` on a non-v3 model The flag's sequential serialization + 2.0s audio prefix is designed to mitigate parakeet-tdt-0.6b-v3 English-prior drift. On v2 / tdtCtc110m / ctcZhCn / tdtJa it still produces correct output, but only adds latency with no benefit, so log a warning once when the path is entered with a non-v3 model. 3. Unit tests for the new code path (CLAUDE.md policy: "Add unit tests when writing new code") - ASRConfigTests: multilingualChunkContinuity defaults to false, preserves explicit true/false, and doesn't disturb other fields. - ChunkProcessorTests (via #if DEBUG accessors): * audio prefix is exactly 32000 samples (2.0s @ 16kHz), encoder- frame-aligned (multiple of 1280). * multilingual chunk size + prefix ≤ maxModelSamples (240000), and chunk size is frame-aligned. * multilingual chunk size is strictly smaller than default chunk size (it has to give up content to make prefix room). * chunkSamples(multilingualContinuity:) dispatches correctly to either the default or multilingual sizing path.

github-actions · 2026-05-11T16:30:10Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.4x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

beta-devin-ai-integration

Devin Review found 2 new potential issues.

View 10 additional findings in Devin Review.

beta-devin-ai-integration · 2026-05-11T17:29:16Z

+            let (windowTokens, windowTimestamps, windowConfidences, windowDurations) =
+                try await Self
+                .transcribeChunk(
+                    samples: chunkSamplesArray,
+                    contextSamples: 0,  // Pass 0 to disable contextFrameAdjustment
+                    chunkStart: contextStart,
+                    isLastChunk: isLastChunk,
+                    using: manager,
+                    decoderState: &decoderState,
+                    maxModelSamples: maxModelSamples,
+                    language: language
+                )


🟡 timeJump=0 special case causes decoder to skip the entire 2.0s audio prefix in multilingual continuity path

In processWithMultilingualContinuity, the decoder state (including timeJump) is persisted across chunks. After each non-last chunk, decoderState.timeJump is set to currentTimeIndices - effectiveSequenceLength (TdtDecoderV3.swift:574). When this value is exactly 0 (decoder finished at the boundary), the next chunk's calculateInitialTimeIndices (TdtFrameNavigation.swift:42-44) hits a special case that returns ASRConstants.standardOverlapFrames (25 frames = 2.0s) — which is the exact length of the audio prefix. This causes the decoder to skip the entire prefix region.

The comment at line 341 says "let the decoder emit naturally through the prefix region" and the LCS/midpoint merger is expected to discard prefix tokens (line 342-344), but when timeJump=0 no prefix tokens are produced. The behavior is inconsistent: timeJump=1 → decoder starts at frame 1 (processes prefix), timeJump=0 → decoder starts at frame 25 (skips prefix entirely). The standardOverlapFrames special case was designed for streaming overlap, not for the batch multilingual prefix. The encoder still benefits from the prefix audio (it processes the full buffer), and the 2.0s stride overlap still provides merge anchors, but the decoder's frame-level behavior becomes non-deterministically dependent on whether timeJump is exactly 0.

Prompt for agents

The processWithMultilingualContinuity function persists decoderState across chunks, including timeJump. After each chunk, TdtDecoderV3.decodeWithTimings sets decoderState.timeJump via calculateFinalTimeJump. On the next chunk, calculateInitialTimeIndices (TdtFrameNavigation.swift:42-44) returns standardOverlapFrames (25) when timeJump==0 and contextFrameAdjustment==0, which skips the entire 2.0s audio prefix. The fix should ensure the decoder starts at frame 0 for non-first chunks in the multilingual continuity path, so it processes through the prefix region as intended. Options: 1. Clear decoderState.timeJump to nil after each chunk in processWithMultilingualContinuity (before the next iteration), so calculateInitialTimeIndices returns contextFrameAdjustment=0 via its nil guard path. 2. Or pass a negative contextFrameAdjustment to counteract the standardOverlapFrames special case. Option 1 is simpler: add decoderState.timeJump = nil after the transcribeChunk call (or at the end of the while loop body, before chunkStart += strideSamples). This preserves LSTM hidden/cell state and lastToken while ensuring the decoder always starts at frame 0 on non-first chunks.

Was this helpful? React with 👍 or 👎 to provide feedback.

beta-devin-ai-integration · 2026-05-11T17:29:18Z

+        // Phase 2: filter punctuation out of LCS candidates to prevent false anchors
+        let punctuationAwareMatcher: (IndexedToken, IndexedToken) -> Bool = { l, r in
+            if ASRConstants.punctuationTokens.contains(l.token.token) {
+                return false
+            }
+            return timeTolerantMatcher(l, r)
+        }


🟡 Shared merge logic changes (punctuation-aware LCS + midpoint) modify default parallel path behavior contrary to documentation

The mergeChunks and mergeByMidpoint functions are shared between both the parallel and multilingual continuity paths. This PR modifies both: the LCS fallback now uses punctuationAwareMatcher (line 599-604) that rejects punctuation tokens as anchors, and mergeByMidpoint (line 687-694) walks backward past trailing punctuation to compute effectiveLeftEnd. These changes affect the default parallel path even when multilingualChunkContinuity = false.

The processWithParallelChunks doc comment at line 148 states "identical to origin/main behavior" and the ASRConfig doc at AsrTypes.swift:30-32 claims "WER 2.64% on test-clean" for the default flag-off path. Both claims are now inaccurate since the merge logic has changed. The merge improvements are likely beneficial (punctuation tokens like . ? ! can cause false LCS anchors at chunk boundaries), but the behavioral change is not gated by the multilingualChunkContinuity flag and could shift the English baseline WER.

Prompt for agents

The punctuation-aware LCS matcher at ChunkProcessor.swift:598-604 and the trailing-punctuation adjustment in mergeByMidpoint at ChunkProcessor.swift:687-694 are applied unconditionally in the shared mergeChunks function, affecting both the parallel (default) and multilingual continuity paths. If the intent is to preserve bit-for-bit behavior for the default path: - Gate the punctuationAwareMatcher behind a flag (e.g. pass multilingualContinuity through to mergeChunks, and only use punctuationAwareMatcher when true; use timeTolerantMatcher otherwise). - Similarly gate the effectiveLeftEnd adjustment in mergeByMidpoint. If the intent is to improve both paths (likely, given the changes are labeled Phase 2): - Update the processWithParallelChunks doc comment (line 148) to remove the claim of identical to origin/main behavior. - Update the ASRConfig doc comment (AsrTypes.swift:30-32) to note that merge improvements may slightly shift the WER figure. - Re-run the LibriSpeech test-clean benchmark with the flag off to confirm the new baseline WER.

Was this helpful? React with 👍 or 👎 to provide feedback.

Devin PR-596 review (2nd pass) flagged two issues: 1. (real bug) `processWithMultilingualContinuity` persists `timeJump` across chunks. `TdtDecoderV3.decodeWithTimings` writes `decoderState.timeJump = currentTimeIndices - effectiveSequenceLength` for non-last chunks. On the next chunk, `TdtFrameNavigation.calculateInitialTimeIndices` either skips into the 2.0s prefix region (prevTimeJump > 0) or, when prevTimeJump == 0, returns the special-case `standardOverlapFrames` (25 = exactly the prefix length) which skips the prefix entirely. Either case breaks the merger's ability to anchor tokens across the chunk seam. Fix: explicitly `decoderState.timeJump = nil` after each non-last chunk so the next chunk's decoder starts at frame 0 of the buffer (which already begins with the 2.0s prefix). The punctuation-seam `reset()` already nils timeJump as a side effect; this clear handles every other boundary the same way. 2. (doc accuracy) Both ChunkProcessor.processWithParallelChunks and ASRConfig.multilingualChunkContinuity claimed the default path was bit-for-bit identical to pre-#596 main. The shared `mergeChunks` / `mergeByMidpoint` punctuation-aware LCS matcher and trailing-punctuation midpoint adjustment apply to both paths. Empirically WER-neutral on LibriSpeech test-clean (validated 2.64%), but the merger algorithm is no longer literally identical. Doc updated to reflect this. Tests: TdtDecoderStateTests gains `testTimeJumpNilingForMultilingualContinuityPath` documenting that `timeJump` is independently nilable without disturbing LSTM hidden/cell, lastToken, or predictorOutput. `testDecoderStateReset` also now exercises the timeJump field in the populate/reset cycle.

github-actions · 2026-05-11T17:57:45Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	1.5x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

github-actions · 2026-05-11T18:07:42Z

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric	Value
CER	9.07%
Samples	50
Avg RTFx	2.7x
Decoder	CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

PR #264 (commit 7459740) added an 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks to fix all-blank predictions at chunk boundaries on long English audio. On `parakeet-tdt-0.6b-v3-coreml` with non-English audio, that prepend shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam. Reproduction (4 fixtures, default vs --no-mel-context): - notes_1408 (FR): drift -> clean - wwii (FR): clean -> clean - user_en (EN): clean -> clean - user2 99.9s (FR): clean -> clean Changes: - ASRConfig gains `melChunkContext: Bool = true` (default preserves PR #264 behavior; set to false for non-English long-form batch). - ChunkProcessor reads the flag and zeroes the prepend when disabled, expanding chunkSamples back so chunks aren't 80ms smaller than the encoder's max receptive window. - `transcribe` and `asr-benchmark` CLIs accept `--no-mel-context`. Closes #594

This comment was marked as resolved.

Sign in to view

Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from a416faa to dde2eb0 Compare May 11, 2026 15:03

Alex-Wengg mentioned this pull request May 11, 2026

ci: add Claude Code Action workflow #599

Merged

3 tasks

Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from dde2eb0 to 9bb7e2c Compare May 11, 2026 15:06

Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from 9bb7e2c to 9aa1745 Compare May 11, 2026 15:16

beta-devin-ai-integration Bot reviewed May 11, 2026

View reviewed changes

vdt4534 mentioned this pull request May 12, 2026

bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264 #594

Open

vdt4534 mentioned this pull request May 12, 2026

fix(asr): avoid v3 multilingual seam drift #604

Draft

Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from f7f58f5 to bfa14a1 Compare May 12, 2026 12:55

Alex-Wengg changed the title ~~fix(asr): persist decoder state, extend audio context, guard punctuation termination (#594)~~ fix(asr): add melChunkContext opt-out flag for Issue #594 May 12, 2026

Conversation

Alex-Wengg commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Investigation

Fixture validation (5 fixtures, English-token count)

English regression check (LibriSpeech test-clean, 2620 files, parakeet-v3)

CLI

API

Files touched

Relationship to PR #604

Uh oh!

github-actions Bot commented May 10, 2026

✅ Japanese ASR Benchmark Results (CTC)

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Performance Metrics

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kokoro TTS Smoke Test ✅

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions Bot commented May 11, 2026

✅ Japanese ASR Benchmark Results (CTC)

Uh oh!

github-actions Bot commented May 11, 2026

✅ Japanese ASR Benchmark Results (CTC)

Uh oh!

Alex-Wengg commented May 11, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Alex-Wengg commented May 11, 2026

Uh oh!

Alex-Wengg commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

claude Bot commented May 11, 2026 •

edited

Loading

claude Bot commented May 11, 2026 •

edited

Loading

claude Bot commented May 11, 2026 •

edited

Loading

2. Stale WER figure in `ASRConfig` doc comment

3. No guard against non-v3 model + `multilingualChunkContinuity: true`