Skip to content

fix(asr): align acoustic and semantic feature lengths to prevent tensor mismatch#309

Open
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-220-tensor-mismatch
Open

fix(asr): align acoustic and semantic feature lengths to prevent tensor mismatch#309
JasonOA888 wants to merge 1 commit intomicrosoft:mainfrom
JasonOA888:fix/issue-220-tensor-mismatch

Conversation

@JasonOA888
Copy link
Copy Markdown

Fixes #220

Root Cause

The acoustic and semantic tokenizers use different encoder architectures with different downsampling ratios. For certain audio lengths, they produce slightly different frame counts (e.g. 228 vs 223 frames).

When the element-wise addition acoustic_features + semantic_features is attempted with mismatched temporal dimensions, PyTorch raises:

RuntimeError: The size of tensor a (228) must match the size of tensor b (223)

Fix

Truncate both feature sequences to the minimum common length before the addition, applied consistently to:

  • Short-audio path (direct processing): align before combining
  • Long-audio path (streaming): align after segment concatenation

Both paths already had the alignment logic for streaming but it was missing from the direct processing path.

…or mismatch

The acoustic and semantic tokenizers use different encoder architectures
with different downsampling ratios. When processing audio of certain
lengths, they can produce slightly different frame counts (e.g. 228 vs
223 frames), causing a tensor size mismatch when the features are
combined via element-wise addition.

This fix truncates both feature sequences to the minimum length before
combination, applied consistently to both the short-audio (direct) and
long-audio (streaming) code paths.

Fixes microsoft#220
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The size of tensor a (228) must match the size of tensor b (223) at non-singleton dimension 3...

1 participant