Skip to content

fix: place Gemma 4 server <|audio|> block inside the user turn#440

Merged
inureyes merged 3 commits into
mainfrom
fix/issue-437-gemma4-server-audio-placement
Jun 25, 2026
Merged

fix: place Gemma 4 server <|audio|> block inside the user turn#440
inureyes merged 3 commits into
mainfrom
fix/issue-437-gemma4-server-audio-placement

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Server-side counterpart to #436 (PR #438): Gemma 4 audio understanding over POST /v1/chat/completions with an input_audio part emitted 0 tokens. This places the audio block inside the user turn so the model answers instead of emitting an immediate EOS. The server resampling half of #436 is already merged in #438 and is not re-added here; this PR is only the <|audio|> placement.

Root cause

The server renders chat messages text-only (MessageContent::text() in src/server/types/request.rs keeps only ContentPart::Text and drops input_audio), so the Gemma 4 chat template never renders an <|audio|> marker into the user turn. expand_gemma4_audio_tokens_for_server (src/multimodal/vlm_runtime.rs) then found no placeholder and used its "insert before the final token" fallback, which lands the BOA + AUDIO*N + EOA block inside the trailing model turn of the Gemma generation prompt (<start_of_turn>user\n…<end_of_turn>\n<start_of_turn>model\n), forcing an immediate EOS. This is the exact placement defect class that #436 Fix A resolved for the CLI.

What changed

  • src/multimodal/vlm_runtime.rs: expand_gemma4_audio_tokens_for_server gains an end_of_turn_token_id: Option<i32> parameter and a three-tier placement policy. (1) If a rendered <|audio|> placeholder is present, wrap the first occurrence in place (unchanged behavior, used by the CLI-style path). (2) Otherwise, splice the audio block immediately before the last <end_of_turn> token, which in a Gemma generation prompt closes the latest user turn (the pending model turn has no closing marker yet), so the block lands inside the user turn. (3) Only when no <end_of_turn> marker exists (e.g. --no-chat-template) does it keep the legacy "before the final token" insertion as a last resort.
  • src/server/model_worker.rs: new resolve_end_of_turn_token_id helper resolves <end_of_turn> from the tokenizer (HF token_to_id, with a single-token encode fallback for SentencePiece/Tiktoken). The id is resolved once in prepare_request_vlm_embeddings and threaded through both Gemma 4 audio paths, prepare_gemma4_audio_embeddings (Conformer VLM) and prepare_gemma4_unified_audio_embeddings (encoder-free Unified), which both feed it to the expansion. This covers the audio-only and image+audio combined call sites.
  • src/multimodal/vlm_runtime_tests.rs: unit tests for in-place wrap, user-turn insertion (before the closing <end_of_turn>), multi-turn targeting of the last user turn, and both last-resort fallbacks (no <end_of_turn> id, and <end_of_turn> id known but absent from the prompt).

Why images are the reference, and why non-audio is unchanged

Server image markers are placed at the token-stream level (expand_gemma4_image_tokens / apply_image_token_blocks), not by routing image content parts through the chat template. Mirroring that, this fix stays at the token-stream level and does not touch src/server/chat_request.rs, so the prefix-cache-sensitive typed-vs-raw path selection (the preserve_thinking flow) is untouched. Only the two Gemma 4 audio embedding paths change, so text-only, image-only, and non-Gemma server requests keep a byte-identical prompt and prompt-cache prefix.

Test plan

  • cargo test --lib --features metal,accelerate server_audio (5 new tests pass)
  • cargo check --bin mlxcel --features metal,accelerate
  • cargo fmt --all --check
  • cargo clippy --lib --tests clean for the three changed files (remaining useless_vec failures are pre-existing in src/audio/nemotron_h_nano_omni/encoder.rs, untouched here, surfaced by a newer local clippy)
  • Real-model server E2E (run by the orchestrator): start mlxcel-server with gemma-4-e2b-it-4bit, POST a /v1/chat/completions request with an input_audio part, confirm a coherent transcription instead of 0 tokens

Closes #437

Gemma 4 audio understanding over the server (POST /v1/chat/completions with an input_audio part) emitted 0 tokens. The server renders chat messages text-only (MessageContent::text() drops input_audio content parts), so the chat template never produced an <|audio|> marker; expand_gemma4_audio_tokens_for_server then found no placeholder and used its "insert before the final token" fallback, which lands the BOA + AUDIO*N + EOA block inside the trailing model turn of the Gemma generation prompt and forces an immediate EOS. This is the server half of the placement defect that #436 fixed for the CLI.

expand_gemma4_audio_tokens_for_server now places the audio block at the token-stream level, the same layer the image path uses via expand_gemma4_image_tokens / apply_image_token_blocks, so the prefix-cache-sensitive render path in chat_request.rs is untouched. When a rendered <|audio|> placeholder is present it is wrapped in place; otherwise the block is spliced in immediately before the last <end_of_turn> token, which closes the latest user turn in a Gemma generation prompt, so the audio stays inside the user turn. The model worker resolves the <end_of_turn> id from the tokenizer and threads it through both Gemma 4 audio paths: prepare_gemma4_audio_embeddings (Conformer VLM) and prepare_gemma4_unified_audio_embeddings (encoder-free Unified). The legacy before-the-final-token insertion stays only as a last resort for prompts with no <end_of_turn> marker.

Non-audio requests are byte-identical: only the two audio embedding paths change and chat_request.rs is not touched, so text-only, image-only, and non-Gemma server requests keep their exact prompt and prompt-cache prefix behavior. New unit tests cover in-place wrap, user-turn insertion, multi-turn last-user-turn targeting, and both last-resort fallbacks.
@inureyes inureyes added status:review Under review type:bug Bug fixes, error corrections, or issue resolutions priority:medium Medium priority area:inference Generation, sampling, decoding (incl. speculative, DRY) labels Jun 25, 2026
…user turn

The server places the Gemma 4 audio block before the last end-of-turn token (issue #437), but `resolve_end_of_turn_token_id` only looked up "<end_of_turn>". Gemma 4 renamed the marker to "<turn|>" (id 106; start-of-turn is "<|turn>"), so the lookup returned None and the placement silently degraded to the model-turn fallback, still yielding 0 tokens. Try both "<end_of_turn>" (Gemma 2/3) and "<turn|>" (Gemma 4) so the audio block lands inside the last user turn. Verified end to end: a chat-completions request with an input_audio part to gemma-4-e2b now transcribes (0 -> 120 tokens) instead of emitting EOS at prefill.
@inureyes

Copy link
Copy Markdown
Member Author

Security and performance review (PR #440 @ 0ff694c)

Scope: changed files only (src/multimodal/vlm_runtime.rs, src/server/model_worker.rs, src/multimodal/vlm_runtime_tests.rs).

Result: no CRITICAL or HIGH findings. No code changes required.

Request-path panic safety

  • expand_gemma4_audio_tokens_for_server is total over every input reachable on the request path. Tier 2 splice(pos..pos, block) uses a pos from rposition, which is always < len, so the empty insertion range is in bounds and cannot panic. Tier 3 uses Vec::pop (returns None on empty) and re-pushes only on Some, so an empty prompt, a prompt with no end-of-turn token, and a prompt that is entirely end-of-turn tokens all terminate without panicking. None / not-found degrades to tier 3 as intended.
  • resolve_end_of_turn_token_id performs only fallible lookups (token_to_id, then encode(...) guarded by is_ok and len() == 1). No unwrap, index, or slice that can panic.

Resource bounds

  • The Conformer path caps the block at 750 via compute_audio_num_tokens. The WAV parser rejects 0 sample-rate and 0 channels and caps decoded data at 500 MB, so a crafted odd-rate or oversized payload cannot reach an unbounded count or a divide-by-zero. The base64 envelope is bounded upstream. The splice allocates one block of the same size the previous fallback did, so no per-request growth was introduced.

No injection / no prefix-cache regression

  • Placement is purely token-id level (no string templating). resolve_end_of_turn_token_id runs only inside if !audio.is_empty(), and the expansion runs only from the two Gemma 4 audio paths, so text-only, image-only, and non-Gemma requests keep byte-identical prompt_tokens and an unchanged prompt-cache prefix. chat_request.rs and the preserve_thinking path selection are untouched.

Performance

  • The added rposition scan and splice run once per audio request at prefill, both O(n) in prompt length, with no per-decode-token cost and no extra allocation on the text or image hot paths.

Observations (non-blocking, not introduced by this PR)

  • The Unified path sizes the block from audio_input.num_frames (samples / audio_samples_per_token) rather than the 750 cap used by the Conformer path. It stays bounded by the 500 MB WAV limit and the upstream base64 bound, so it is not an unbounded-resource issue, but it is the larger of the two block sizes a long clip can produce. Pre-existing, out of scope here.
  • resolve_end_of_turn_token_id returns the first matching candidate (<end_of_turn> before <turn|>). If a future Gemma 4 vocab carried both literals while the template rendered only <turn|>, first-match could select a stale id. Validated working end-to-end on gemma-4-e2b (0 to 120 tokens), so not a current defect.

Add three unit tests to src/server/model_worker_tests.rs covering the function that had the bug (resolve_end_of_turn_token_id had no tests before this PR):

- resolve_end_of_turn_id_handles_gemma4_turn_marker: a tokenizer with "<turn|>" = id 106 but no "<end_of_turn>" must return Some(106). This is the exact regression guard for the #437 bug, where the old implementation only looked up "<end_of_turn>" and returned None for Gemma 4 tokenizers.
- resolve_end_of_turn_id_handles_gemma23_end_of_turn_marker: a tokenizer with "<end_of_turn>" = id 107 must return Some(107).
- resolve_end_of_turn_id_returns_none_when_no_marker_present: a tokenizer with neither marker must return None so the caller keeps its own fallback.

Each test builds a minimal HuggingFace BPE stub (via two helper functions, stub_eot_tokenizer and stub_tokenizer_no_eot) that mirrors the pattern of MlxcelTokenizer::stub_with_byte_fallback.

Also remove the stale "not yet wired through the chat prompt" note from docs/supported-models.md and replace it with a brief description of what the fix delivers.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Tests added (3 new, in src/server/model_worker_tests.rs)

resolve_end_of_turn_token_id had zero tests before this PR. Three tests were added, backed by two helper functions (stub_eot_tokenizer, stub_tokenizer_no_eot) that build minimal HuggingFace BPE stubs from inline JSON, following the same pattern as MlxcelTokenizer::stub_with_byte_fallback.

Test What it guards
resolve_end_of_turn_id_handles_gemma4_turn_marker A tokenizer with `"<turn
resolve_end_of_turn_id_handles_gemma23_end_of_turn_marker A tokenizer with "<end_of_turn>" = id 107 returns Some(107). Guards Gemma 2/3 behavior after the two-candidate lookup was introduced.
resolve_end_of_turn_id_returns_none_when_no_marker_present A tokenizer with neither marker returns None, confirming the caller's own fallback is not bypassed for non-Gemma models.

No integration test was added for expand_gemma4_audio_tokens_for_server with Gemma 4 token ids specifically: the five existing server_audio_* tests already use EOT = 106 and cover all placement tiers, so no genuinely new combination was available.

Documentation

docs/supported-models.md: removed the stale "Server-side audio input for this path is not yet wired through the chat prompt." sentence from the Gemma 4 VL audio bullet and replaced it with a brief note that POST /v1/chat/completions with input_audio now works and splices the audio block before <turn|>.

Lint / Format

cargo fmt --all applied and cargo fmt --all --check confirmed clean. cargo clippy --lib --features metal,accelerate -- -D warnings produced no new warnings (pre-existing out-of-scope warnings in the audio encoder are not touched).

Verification commands run

cargo fmt --all --check                                         # clean
cargo test --lib --features metal,accelerate 'resolve_end_of_turn'  # 3/3 pass
cargo test --lib --features metal,accelerate 'server::model_provider::model_worker::tests'  # 20/20 pass
cargo test --lib --features metal,accelerate 'server_audio'    # 5/5 pass
cargo clippy --lib --features metal,accelerate -- -D warnings  # no new warnings

New commit: 097bbc405

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 25, 2026
@inureyes inureyes merged commit 6776688 into main Jun 25, 2026
5 checks passed
@inureyes inureyes deleted the fix/issue-437-gemma4-server-audio-placement branch June 25, 2026 16:27
inureyes added a commit that referenced this pull request Jun 25, 2026
The initial v0.3.3 cut (4d5f4fb) was never tagged, and 16 more commits
landed on main afterward. Fold those fixes into the v0.3.3 CHANGELOG and
debian/changelog entries and set the release date to 2026-06-25:

- N-gram loop detection (#433) and Nemotron-H Nano Omni audio input (#443)
- Prefill masks sized from the live window under --max-kv-size trim
  (#418, #420, #422, #431)
- Double-transpose crash on mlx-community conv checkpoints (#429)
- conv1d/conv2d and nemotron audio-encoder convs fallible at the FFI
  boundary (#434, #439)
- Gemma 4 audio placed in the user turn (#438, #440)
- mistral4 MLA backbone routing and 2D MoE token flattening (#423, #425)
- Bump actions/checkout from 6 to 7 (#395)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:inference Generation, sampling, decoding (incl. speculative, DRY) priority:medium Medium priority status:done Completed type:bug Bug fixes, error corrections, or issue resolutions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: Gemma 4 server audio (input_audio) places <|audio|> outside the user turn (0-token output)

1 participant