Skip to content

fix(asr): add melChunkContext opt-out flag for Issue #594#596

Open
Alex-Wengg wants to merge 1 commit into
mainfrom
fix/asr-594-french-chunk-boundary
Open

fix(asr): add melChunkContext opt-out flag for Issue #594#596
Alex-Wengg wants to merge 1 commit into
mainfrom
fix/asr-594-french-chunk-boundary

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented May 10, 2026

Fixes #594.

Summary

French batch transcription with parakeet-tdt-0.6b-v3-coreml drifts to English at ~15s chunk boundaries. Root cause is PR #264 (commit 7459740a): the 80ms (1 encoder frame, 1280 samples) mel-context prepend on non-first chunks shifts the FastConformer encoder's first-frame distribution just enough that the SOS-primed TDT decoder drifts back to its English-biased prior at every chunk seam.

This PR adds an opt-out melChunkContext: Bool flag on ASRConfig (default true, preserves PR #264 behavior on English). When false, the prepend is zeroed and chunkSamples expands back so chunks aren't 80ms smaller than the encoder's max receptive window. No decoder-state-continuity, no 2.0s prefix, no serialized chunk processing — just stop perturbing the encoder's first frame.

Investigation

Bisected by reverting melContextSamples to 0 and re-running reporter fixtures from #594.

Fixture validation (5 fixtures, English-token count)

Fixture Default (mel-context ON) --no-mel-context (this PR)
notes_1408_clean.wav (FR 45s, original repro) drifts: le rest of the key what is that clean
wwii_belgique_fr.wav (FR 47s) clean clean
user_2026-05-12.wav (EN 42s) clean clean
user2_2026-05-12.wav (FR 100s) clean clean
climate_2026_fr_voice_memo.wav (FR 110s) 4 English drifts: années the plus extrêmes, Plusieurs gouvernements and entreprises have revue, their ambition, president American; final sentence truncated 1 English drift remaining: Plusieurs gouvernements and entreprises (vs bb96003 baseline which has the same and); everything else clean

The 5th fixture confirms --no-mel-context is strictly better than main on every measurable axis. The single remaining and on the climate fixture is also present in the pre-#264 bb96003 baseline that vdt4534 tested against, so it appears to be an independent failure mode (single-token cosmetic defect, not chunk-loss).

English regression check (LibriSpeech test-clean, 2620 files, parakeet-v3)

Metric Baseline (melChunkContext: true) --no-mel-context Delta
Average WER 2.6% 2.6% 0.0%
Median WER 0.0% 0.0%
Average CER 1.0% 1.0% 0.0%
Median RTFx 31.1x 27.9x -10%
Overall RTFx 33.6x 29.5x -12%

WER/CER identical at 1 decimal place. ~10-12% RTFx drop is from the larger effective chunk size when the flag is off (chunks expand by 1280 samples to fill the encoder window, slightly more encoder work per chunk).

Caveat: test-clean is short-form (mostly <16s, single-chunk) and doesn't stress the chunk-seam path much. PR #264 was introduced to fix all-blank predictions at chunk boundaries on long English audio, so the default stays true. Only non-English long-form batch callers should opt out.

CLI

fluidaudiocli transcribe audio.wav --no-mel-context
fluidaudiocli asr-benchmark --subset test-clean --no-mel-context

API

let asrConfig = ASRConfig(
    tdtConfig: tdtConfig,
    encoderHiddenSize: modelVersion.encoderHiddenSize,
    melChunkContext: false  // opt out for non-English long-form
)

Files touched

  • Sources/FluidAudio/ASR/Parakeet/AsrTypes.swift — adds melChunkContext: Bool = true to ASRConfig
  • Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrManager.swift — internal accessor for ChunkProcessor
  • Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/ChunkProcessor.swift — flag-aware mel-context / chunk-size math
  • Sources/FluidAudioCLI/.../TranscribeCommand.swift--no-mel-context flag + usage
  • Sources/FluidAudioCLI/.../AsrBenchmark.swift--no-mel-context flag + usage

5 files, +73 / -12.

Relationship to PR #604

PR #604 (stacked draft from vdt4534) was built on this branch's previous implementation (decoder-state continuity + 2.0s prefix) before this rework. It targets the same root cause but adds 6 stacked mechanisms (560ms real-audio prefix, prefix-token suppression, silence-aligned boundaries, post-punctuation blank-streak decoder guard, punctuation-aware LCS merger, v3-only routing) at +555 / -48 across 10 files, with sequential chunk processing (1.4-1.8x slower on FR).

#604 reports the climate fixture is fully clean under their stacked approach (vs 1 remaining and under this PR). The tradeoff is ~7.5x the code, ~1.6x the wall-clock, and no LibriSpeech English regression measurement. Two components from #604 (the punctuation blank-streak guard in TdtDecoderV3 and silence-aligned chunk starts) are orthogonal to the #594 fix and worth considering as separate follow-up PRs.

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 3.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 7.85x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 62.1s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.062s Average chunk processing time
Max Chunk Time 0.124s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m25s • 05/12/2026, 09:08 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.04x ~2.5x
Overall RTFx 0.04x ~2.5x

Runtime: 5m8s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 30.83x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.675 25.5 Fetching diarization models
Model Compile 3.718 10.9 CoreML compilation
Audio Load 0.047 0.1 Loading audio file
Segmentation 10.208 30.0 Detecting speech regions
Embedding 17.014 50.0 Extracting speaker voices
Clustering 6.806 20.0 Grouping same speakers
Total 34.033 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.0s diarization time • Test runtime: 2m 30s • 05/12/2026, 09:12 AM EST

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Kokoro TTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (634.8 KB)

Runtime: 0m51s

Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (150.0 KB)

Runtime: 0m21s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 3.23x
test-other 1.35% 0.00% 1.80x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 4.75x
test-other 1.00% 0.00% 2.83x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.39x Streaming real-time factor
Avg Chunk Time 2.299s Average time to process each chunk
Max Chunk Time 3.974s Maximum chunk processing time
First Token 3.336s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.57x Streaming real-time factor
Avg Chunk Time 1.857s Average time to process each chunk
Max Chunk Time 3.357s Maximum chunk processing time
First Token 1.631s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 9m2s • 05/12/2026, 09:15 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 10.61x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 13.380 13.5 Fetching diarization models
Model Compile 5.734 5.8 CoreML compilation
Audio Load 0.069 0.1 Loading audio file
Segmentation 26.605 26.9 VAD + speech detection
Embedding 98.446 99.6 Speaker embedding extraction
Clustering (VBx) 0.234 0.2 Hungarian algorithm + VBx clustering
Total 98.890 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 125.3s processing • Test runtime: 2m 5s • 05/12/2026, 09:03 AM EST

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 30.3% <35%
Miss Rate 28.2% - -
False Alarm 0.9% - -
Speaker Error 1.2% - -
RTFx 7.2x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 4m 21s • 2026-05-12T13:11:56.186Z

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 381.3x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 547.5x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.8x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@Alex-Wengg
Copy link
Copy Markdown
Member Author

@claude

beta-devin-ai-integration[bot]

This comment was marked as resolved.

Alex-Wengg added a commit that referenced this pull request May 11, 2026
…Offset

In processWithMultilingualContinuity, the audio buffer for chunks N>=1
starts at `contextStart = chunkStart - contextSamples` (2.0s of real
audio prepended for encoder left context), but transcribeChunk was
being given `chunkStart` as its origin. Inside transcribeChunk,
`globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed
every token in chunks N>=1 at +25 frames (+2.0s) past its true
position in the original audio timeline.

Consequence: chunk N's prefix tokens (covering audio that actually
overlaps chunk N-1's tail) landed in timestamp space _after_ chunk
N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS
and contiguous matchers could not anchor across the boundary, so every
seam fell through to mergeByMidpoint, which duplicated ~2s of content
at every chunk join.

Pass `contextStart` instead. Prefix tokens now overlap chunk N-1
correctly, LCS matches anchor properly, and the merger can dedupe as
designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90%
WER. Flag-OFF unchanged at 2.6403%. French fix preserved.

Credit: Devin AI review on PR #596.
@Alex-Wengg
Copy link
Copy Markdown
Member Author

@claude review

@Alex-Wengg
Copy link
Copy Markdown
Member Author

@claude

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.3x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@Alex-Wengg Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from a416faa to dde2eb0 Compare May 11, 2026 15:03
Alex-Wengg added a commit that referenced this pull request May 11, 2026
## Summary

Adds `.github/workflows/claude.yml` so the repo can respond to `@claude`
mentions in issues, issue comments, PR reviews, and PR review comments
via
[anthropics/claude-code-action@v1](https://github.com/anthropics/claude-code-action).

Motivation: PR #596 had a reviewer post `@claude review` and nothing
happened because no workflow was wired up. This PR fixes that for future
reviews.

## What it does

- Triggers on `issue_comment`, `pull_request_review_comment`,
`pull_request_review`, `issues` (opened/assigned)
- Job runs only when the body/title contains `@claude` (cheap filter,
prevents wasted runs)
- Uses `ANTHROPIC_API_KEY` repo secret for auth
- Minimal `read` permissions on contents/PRs/issues; `id-token: write`
for OIDC

## Required configuration (repo settings)

Before this workflow can run, a maintainer needs to:

1. Install the [Claude GitHub App](https://github.com/apps/claude) on
`FluidInference/FluidAudio`
2. Add an `ANTHROPIC_API_KEY` secret in repo Settings -> Secrets and
variables -> Actions

Without those, the workflow file is inert (no failed runs, just no-op).

## Test plan

- [ ] Maintainer installs the Claude GitHub App and sets
`ANTHROPIC_API_KEY`
- [ ] After merge, post `@claude help` on a throwaway issue and confirm
the workflow fires
- [ ] Confirm non-`@claude` comments do not trigger the job
Alex-Wengg added a commit that referenced this pull request May 11, 2026
…Offset

In processWithMultilingualContinuity, the audio buffer for chunks N>=1
starts at `contextStart = chunkStart - contextSamples` (2.0s of real
audio prepended for encoder left context), but transcribeChunk was
being given `chunkStart` as its origin. Inside transcribeChunk,
`globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed
every token in chunks N>=1 at +25 frames (+2.0s) past its true
position in the original audio timeline.

Consequence: chunk N's prefix tokens (covering audio that actually
overlaps chunk N-1's tail) landed in timestamp space _after_ chunk
N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS
and contiguous matchers could not anchor across the boundary, so every
seam fell through to mergeByMidpoint, which duplicated ~2s of content
at every chunk join.

Pass `contextStart` instead. Prefix tokens now overlap chunk N-1
correctly, LCS matches anchor properly, and the merger can dedupe as
designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90%
WER. Flag-OFF unchanged at 2.6403%. French fix preserved.

Credit: Devin AI review on PR #596.
@Alex-Wengg Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from dde2eb0 to 9bb7e2c Compare May 11, 2026 15:06
@Alex-Wengg
Copy link
Copy Markdown
Member Author

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@Alex-Wengg
Copy link
Copy Markdown
Member Author

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Alex-Wengg added a commit that referenced this pull request May 11, 2026
…Offset

In processWithMultilingualContinuity, the audio buffer for chunks N>=1
starts at `contextStart = chunkStart - contextSamples` (2.0s of real
audio prepended for encoder left context), but transcribeChunk was
being given `chunkStart` as its origin. Inside transcribeChunk,
`globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed
every token in chunks N>=1 at +25 frames (+2.0s) past its true
position in the original audio timeline.

Consequence: chunk N's prefix tokens (covering audio that actually
overlaps chunk N-1's tail) landed in timestamp space _after_ chunk
N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS
and contiguous matchers could not anchor across the boundary, so every
seam fell through to mergeByMidpoint, which duplicated ~2s of content
at every chunk join.

Pass `contextStart` instead. Prefix tokens now overlap chunk N-1
correctly, LCS matches anchor properly, and the merger can dedupe as
designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90%
WER. Flag-OFF unchanged at 2.6403%. French fix preserved.

Credit: Devin AI review on PR #596.
@Alex-Wengg Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from 9bb7e2c to 9aa1745 Compare May 11, 2026 15:16
@Alex-Wengg
Copy link
Copy Markdown
Member Author

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

Claude finished @Alex-Wengg's task in 3m 52s —— View job


PR Review

  • Gather context and understand request
  • Read changed files
  • Review ChunkProcessor.swift (main changes)
  • Review TdtDecoderV3.swift
  • Review AsrTypes.swift and AsrManager.swift
  • Review CLI changes
  • Post review findings

Summary

This PR implements an opt-in multilingualChunkContinuity flag to fix French (non-English) chunk boundary drift in batch ASR. The core design is sound, and the critical timestamp bug flagged in the Devin review is already fixed in commit 9aa1745.


Devin Review Finding — Already Fixed ✅

The Devin comment flagged that chunkStart was passed instead of contextStart for globalFrameOffset, which would shift all chunk N≥1 timestamps by +2.0s (25 encoder frames) — beyond the 1.0s LCS tolerance — causing the merger to duplicate ~2s of content at every seam.

This is correctly fixed in ChunkProcessor.swift:334:

chunkStart: contextStart,  // was chunkStart in the original commit

Commit 9aa1745 applied this fix. The comment block at lines 315–328 explains the reasoning clearly.


Issues Found

1. No unit tests for the new code path

Severity: Medium — Policy violation (CLAUDE.md: "Add unit tests when writing new code")

processWithMultilingualContinuity and the punctuation seam reset logic have no tests. The existing test files (AsrManagerTests.swift, AsrTranscriptionTests.swift, TokenDeduplicationRegressionTests.swift) have no coverage for this path. At minimum, the following should be tested without real models:

  • ChunkProcessor timestamp offset correctness: verify that prefix tokens for chunk N≥1 have timestamps anchored to contextStart, not chunkStart
  • Punctuation seam reset: verify decoderState.reset() fires when lastToken is a punctuation token
  • multilingualChunkSamples + audioPrefixSamples ≤ maxModelSamples

Fix this →

2. Stale WER figure in ASRConfig doc comment

Severity: LowAsrTypes.swift:43 says ~4.40% English WER regression:

/// Trade-off: enabling this regresses English LibriSpeech WER (~4.40%
/// on `test-clean`) and serializes chunk processing, so it is opt-in.

The PR description table shows 5.264% with the flag ON. This appears to be from an intermediate version of the fix before contextStart correction. The comment should be updated to reflect the validated number.

Fix this →

3. No guard against non-v3 model + multilingualChunkContinuity: true

Severity: Low — The flag only helps on parakeet-tdt-0.6b-v3-coreml. If a caller enables multilingualChunkContinuity: true on a v2 or tdtCtc110m model, it won't fail, but the sequential serialization overhead still applies with no benefit. A logger.warning in processWithMultilingualContinuity (or in AsrManager.transcribeDiskBacked) when multilingualContinuity && version != .v3 would help surface misconfiguration.


Design Notes (No Action Required)

Punctuation guard double-accounting: The blankFramesSincePunctuation counter increments in both the outer loop (line 292–294) and the inner blank loop (lines 371–372). This is correct — the outer adds the first blank's duration, and the inner accumulates subsequent consecutive blanks. No double-counting.

contextSamples: 0 tradeoff: Passing 0 disables contextFrameAdjustment, so the decoder emits through the 2.0s prefix region. Those prefix tokens get timestamps [contextStart/spf, chunkStart/spf] which overlap with the previous chunk's tail — the LCS/midpoint merger correctly discards them. The PR description acknowledges the residual known limitation (files where content is still dropped after a sentence-final seam).

TdtDecoderState.reset() on punctuation seam (ChunkProcessor.swift:361-363): This zeroes LSTM h/c + clears lastToken/predictorOutput, causing SOS re-priming for chunk N+1. The 2.0s audio prefix on chunk N+1 compensates — the encoder sees real audio history before the clean-slate LSTM produces its first prediction. This is consistent with the streaming path design.


CI Status

All relevant benchmarks passed (ASR ✅, Japanese CTC ✅, Parakeet EOU ✅, Qwen3 ✅, TTS ✅). Diarization and Sortformer benchmarks show NaN which appears to be a pre-existing CI infrastructure issue unrelated to this PR.


@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.4x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.2x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 1.8x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Alex-Wengg added a commit that referenced this pull request May 11, 2026
Three followups from the bot review of PR #596:

1. Update stale WER doc comment in AsrTypes.swift
   The `multilingualChunkContinuity` doc said "~4.40% English WER", which
   referred to an intermediate variant before the Devin globalFrameOffset
   fix landed. The validated landed number is 2.90% (vs 2.64% with flag
   off, a +0.26pp cost).

2. Warn when `multilingualChunkContinuity=true` on a non-v3 model
   The flag's sequential serialization + 2.0s audio prefix is designed
   to mitigate parakeet-tdt-0.6b-v3 English-prior drift. On v2 /
   tdtCtc110m / ctcZhCn / tdtJa it still produces correct output, but
   only adds latency with no benefit, so log a warning once when the
   path is entered with a non-v3 model.

3. Unit tests for the new code path (CLAUDE.md policy: "Add unit tests
   when writing new code")
   - ASRConfigTests: multilingualChunkContinuity defaults to false,
     preserves explicit true/false, and doesn't disturb other fields.
   - ChunkProcessorTests (via #if DEBUG accessors):
     * audio prefix is exactly 32000 samples (2.0s @ 16kHz), encoder-
       frame-aligned (multiple of 1280).
     * multilingual chunk size + prefix ≤ maxModelSamples (240000),
       and chunk size is frame-aligned.
     * multilingual chunk size is strictly smaller than default chunk
       size (it has to give up content to make prefix room).
     * chunkSamples(multilingualContinuity:) dispatches correctly to
       either the default or multilingual sizing path.
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.4x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

Copy link
Copy Markdown

@beta-devin-ai-integration beta-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 10 additional findings in Devin Review.

Open in Devin Review (Beta)

Comment on lines +352 to +363
let (windowTokens, windowTimestamps, windowConfidences, windowDurations) =
try await Self
.transcribeChunk(
samples: chunkSamplesArray,
contextSamples: 0, // Pass 0 to disable contextFrameAdjustment
chunkStart: contextStart,
isLastChunk: isLastChunk,
using: manager,
decoderState: &decoderState,
maxModelSamples: maxModelSamples,
language: language
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 timeJump=0 special case causes decoder to skip the entire 2.0s audio prefix in multilingual continuity path

In processWithMultilingualContinuity, the decoder state (including timeJump) is persisted across chunks. After each non-last chunk, decoderState.timeJump is set to currentTimeIndices - effectiveSequenceLength (TdtDecoderV3.swift:574). When this value is exactly 0 (decoder finished at the boundary), the next chunk's calculateInitialTimeIndices (TdtFrameNavigation.swift:42-44) hits a special case that returns ASRConstants.standardOverlapFrames (25 frames = 2.0s) — which is the exact length of the audio prefix. This causes the decoder to skip the entire prefix region.

The comment at line 341 says "let the decoder emit naturally through the prefix region" and the LCS/midpoint merger is expected to discard prefix tokens (line 342-344), but when timeJump=0 no prefix tokens are produced. The behavior is inconsistent: timeJump=1 → decoder starts at frame 1 (processes prefix), timeJump=0 → decoder starts at frame 25 (skips prefix entirely). The standardOverlapFrames special case was designed for streaming overlap, not for the batch multilingual prefix. The encoder still benefits from the prefix audio (it processes the full buffer), and the 2.0s stride overlap still provides merge anchors, but the decoder's frame-level behavior becomes non-deterministically dependent on whether timeJump is exactly 0.

Prompt for agents
The processWithMultilingualContinuity function persists decoderState across chunks, including timeJump. After each chunk, TdtDecoderV3.decodeWithTimings sets decoderState.timeJump via calculateFinalTimeJump. On the next chunk, calculateInitialTimeIndices (TdtFrameNavigation.swift:42-44) returns standardOverlapFrames (25) when timeJump==0 and contextFrameAdjustment==0, which skips the entire 2.0s audio prefix.

The fix should ensure the decoder starts at frame 0 for non-first chunks in the multilingual continuity path, so it processes through the prefix region as intended. Options:
1. Clear decoderState.timeJump to nil after each chunk in processWithMultilingualContinuity (before the next iteration), so calculateInitialTimeIndices returns contextFrameAdjustment=0 via its nil guard path.
2. Or pass a negative contextFrameAdjustment to counteract the standardOverlapFrames special case.

Option 1 is simpler: add decoderState.timeJump = nil after the transcribeChunk call (or at the end of the while loop body, before chunkStart += strideSamples). This preserves LSTM hidden/cell state and lastToken while ensuring the decoder always starts at frame 0 on non-first chunks.
Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +598 to +604
// Phase 2: filter punctuation out of LCS candidates to prevent false anchors
let punctuationAwareMatcher: (IndexedToken, IndexedToken) -> Bool = { l, r in
if ASRConstants.punctuationTokens.contains(l.token.token) {
return false
}
return timeTolerantMatcher(l, r)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Shared merge logic changes (punctuation-aware LCS + midpoint) modify default parallel path behavior contrary to documentation

The mergeChunks and mergeByMidpoint functions are shared between both the parallel and multilingual continuity paths. This PR modifies both: the LCS fallback now uses punctuationAwareMatcher (line 599-604) that rejects punctuation tokens as anchors, and mergeByMidpoint (line 687-694) walks backward past trailing punctuation to compute effectiveLeftEnd. These changes affect the default parallel path even when multilingualChunkContinuity = false.

The processWithParallelChunks doc comment at line 148 states "identical to origin/main behavior" and the ASRConfig doc at AsrTypes.swift:30-32 claims "WER 2.64% on test-clean" for the default flag-off path. Both claims are now inaccurate since the merge logic has changed. The merge improvements are likely beneficial (punctuation tokens like . ? ! can cause false LCS anchors at chunk boundaries), but the behavioral change is not gated by the multilingualChunkContinuity flag and could shift the English baseline WER.

Prompt for agents
The punctuation-aware LCS matcher at ChunkProcessor.swift:598-604 and the trailing-punctuation adjustment in mergeByMidpoint at ChunkProcessor.swift:687-694 are applied unconditionally in the shared mergeChunks function, affecting both the parallel (default) and multilingual continuity paths.

If the intent is to preserve bit-for-bit behavior for the default path:
- Gate the punctuationAwareMatcher behind a flag (e.g. pass multilingualContinuity through to mergeChunks, and only use punctuationAwareMatcher when true; use timeTolerantMatcher otherwise).
- Similarly gate the effectiveLeftEnd adjustment in mergeByMidpoint.

If the intent is to improve both paths (likely, given the changes are labeled Phase 2):
- Update the processWithParallelChunks doc comment (line 148) to remove the claim of identical to origin/main behavior.
- Update the ASRConfig doc comment (AsrTypes.swift:30-32) to note that merge improvements may slightly shift the WER figure.
- Re-run the LibriSpeech test-clean benchmark with the flag off to confirm the new baseline WER.
Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg added a commit that referenced this pull request May 11, 2026
Devin PR-596 review (2nd pass) flagged two issues:

1. (real bug) `processWithMultilingualContinuity` persists `timeJump`
   across chunks. `TdtDecoderV3.decodeWithTimings` writes
   `decoderState.timeJump = currentTimeIndices - effectiveSequenceLength`
   for non-last chunks. On the next chunk,
   `TdtFrameNavigation.calculateInitialTimeIndices` either skips into
   the 2.0s prefix region (prevTimeJump > 0) or, when prevTimeJump == 0,
   returns the special-case `standardOverlapFrames` (25 = exactly the
   prefix length) which skips the prefix entirely. Either case breaks
   the merger's ability to anchor tokens across the chunk seam.

   Fix: explicitly `decoderState.timeJump = nil` after each non-last
   chunk so the next chunk's decoder starts at frame 0 of the buffer
   (which already begins with the 2.0s prefix). The punctuation-seam
   `reset()` already nils timeJump as a side effect; this clear handles
   every other boundary the same way.

2. (doc accuracy) Both ChunkProcessor.processWithParallelChunks and
   ASRConfig.multilingualChunkContinuity claimed the default path was
   bit-for-bit identical to pre-#596 main. The shared
   `mergeChunks` / `mergeByMidpoint` punctuation-aware LCS matcher and
   trailing-punctuation midpoint adjustment apply to both paths.
   Empirically WER-neutral on LibriSpeech test-clean (validated 2.64%),
   but the merger algorithm is no longer literally identical. Doc
   updated to reflect this.

Tests: TdtDecoderStateTests gains `testTimeJumpNilingForMultilingualContinuityPath`
documenting that `timeJump` is independently nilable without disturbing
LSTM hidden/cell, lastToken, or predictorOutput. `testDecoderStateReset`
also now exercises the timeJump field in the populate/reset cycle.
@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 1.5x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

@github-actions
Copy link
Copy Markdown

✅ Japanese ASR Benchmark Results (CTC)

Status: Passed

Metric Value
CER 9.07%
Samples 50
Avg RTFx 2.7x
Decoder CTC

✅ Benchmark completed successfully. The TDT Japanese hybrid model (CTC preprocessor/encoder + TDT decoder/joint) is working correctly.

View benchmark log

PR #264 (commit 7459740) added an 80ms (1 encoder frame, 1280 samples)
mel-context prepend on non-first chunks to fix all-blank predictions at
chunk boundaries on long English audio. On `parakeet-tdt-0.6b-v3-coreml`
with non-English audio, that prepend shifts the FastConformer encoder's
first-frame distribution just enough that the SOS-primed TDT decoder
drifts back to its English-biased prior at every chunk seam.

Reproduction (4 fixtures, default vs --no-mel-context):
  - notes_1408 (FR):  drift -> clean
  - wwii (FR):        clean -> clean
  - user_en (EN):     clean -> clean
  - user2 99.9s (FR): clean -> clean

Changes:
  - ASRConfig gains `melChunkContext: Bool = true` (default preserves
    PR #264 behavior; set to false for non-English long-form batch).
  - ChunkProcessor reads the flag and zeroes the prepend when disabled,
    expanding chunkSamples back so chunks aren't 80ms smaller than the
    encoder's max receptive window.
  - `transcribe` and `asr-benchmark` CLIs accept `--no-mel-context`.

Closes #594
@Alex-Wengg Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from f7f58f5 to bfa14a1 Compare May 12, 2026 12:55
@Alex-Wengg Alex-Wengg changed the title fix(asr): persist decoder state, extend audio context, guard punctuation termination (#594) fix(asr): add melChunkContext opt-out flag for Issue #594 May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264

1 participant