Skip to content

fix(asr): harden v3 multilingual chunk seams#604

Draft
vdt4534 wants to merge 1 commit into
FluidInference:fix/asr-594-french-chunk-boundaryfrom
vdt4534:codex/asr-594-v3-seam-warmup
Draft

fix(asr): harden v3 multilingual chunk seams#604
vdt4534 wants to merge 1 commit into
FluidInference:fix/asr-594-french-chunk-boundaryfrom
vdt4534:codex/asr-594-v3-seam-warmup

Conversation

@vdt4534
Copy link
Copy Markdown

@vdt4534 vdt4534 commented May 12, 2026

Summary

This updates the existing stacked draft after the #596 overhaul. I retested the current #596 head (bfa14a17) with a larger fixture set. The no-mel direction still looks right and keeps the fast path, but a few longer/more varied fixtures still drift into English at chunk seams.

This patch keeps the same basic direction and speed profile, then adds a v3-only acoustic boundary/warmup path for no-mel batch chunks.

It does not use language hints, vocabulary filtering, language-specific token rules, or a second decode.

What changed

  • For Parakeet TDT v3 with melChunkContext == false, choose chunk starts from local acoustic boundaries instead of a fixed grid only.
  • Use local energy contrast around the boundary candidate, not hard-coded French/English rules or token vocabulary.
  • Decode a hidden 7-frame real-audio prefix only on short troughs, suppress prefix-region output tokens, but still let those tokens update decoder state.
  • Skip warmup across stable quiet/true pause regions.
  • Avoid pulling a chunk boundary earlier when that would force the next boundary into high-energy speech.
  • Harden overlap merging so equal-length/leading contested gaps use token confidence instead of always choosing the older chunk.
  • Keep this gated to v3/no-mel; v2, tdt-ja, and zh-CN CTC are not routed through this path.

Why

On the current #596 branch, the original notes_1408_clean.wav repro is fixed, and speed is much better than the old pinned baseline. The expanded fixtures showed remaining drift:

  • climate voice memo: gouvernements and entreprises...
  • long FR clip: départemental, the camions...
  • Perrault FR 2min: the degree of pénétration...
  • Candide FR 10min: and Montepulciano... Chypre and Samos

The failures do not look French-specific in mechanism. They look like seam/decoder-context issues where v3's multilingual prior falls into English after a bad boundary. The fix only looks at audio energy, frame indices, and model token confidence.

Validation

20 fixtures locally:

  • 8 Amnis/repro fixtures: FR, EN, mixed FR->EN, long FR
  • 12 LibriVox public-domain fixtures: EN/FR/ES/PT at roughly 2, 5, and 10 minutes

Timing:

Variant Total / 20 fixtures Avg
pinned bb96003 30.748s 1.537s
current #596 bfa14a17 no-mel 17.153s 0.858s
this PR 17.032s 0.852s

Known target snippets with this PR:

  • notes_1408_clean: faire un effort sur le progrès
  • wwii_belgique_fr: En Belgique... a été jour férié
  • user2_2026-05-12: blouses médicales blanches et portant
  • climate voice memo: gouvernements et entreprises ont revu
  • mixed FR/EN: oblasts occupés de Donetsk... A decade on
  • long FR clip: départemental les camions rouges
  • Perrault: degré de pénétration
  • Candide: et les invite... et à boire

Checks:

  • swift test --filter ChunkProcessorTests
  • swift test --filter TdtRefactoredComponentsTests
  • swift test --filter TdtDecoderV2Tests
  • swift build -c release --product fluidaudiocli
  • git diff --check

Fixtures

The already-uploaded issue fixtures are enough to reproduce part of this. I also tested additional longer clips. The LibriVox clips are public-domain and can be attached or regenerated as clipped WAVs if useful. The private Amnis/voice fixtures are not committed here; I can attach specific ones only if we decide they are okay to make public.

@Alex-Wengg
Copy link
Copy Markdown
Member

i actually overhauled this change . #596

@vdt4534 vdt4534 force-pushed the codex/asr-594-v3-seam-warmup branch from 0af1b54 to 01f2b62 Compare May 13, 2026 04:52
@vdt4534 vdt4534 changed the title fix(asr): avoid v3 multilingual seam drift fix(asr): harden v3 multilingual chunk seams May 13, 2026
@vdt4534
Copy link
Copy Markdown
Author

vdt4534 commented May 13, 2026

Updated this draft after your #596 overhaul. The previous version of this PR is superseded; the current branch is now based on bfa14a17 and keeps the no-mel direction, with the extra acoustic-boundary/warmup handling described in the updated body.

I also added a fresh note on #594 with the newer fixture results and timing numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants