fix: place Gemma 4 server <|audio|> block inside the user turn#440
Conversation
Gemma 4 audio understanding over the server (POST /v1/chat/completions with an input_audio part) emitted 0 tokens. The server renders chat messages text-only (MessageContent::text() drops input_audio content parts), so the chat template never produced an <|audio|> marker; expand_gemma4_audio_tokens_for_server then found no placeholder and used its "insert before the final token" fallback, which lands the BOA + AUDIO*N + EOA block inside the trailing model turn of the Gemma generation prompt and forces an immediate EOS. This is the server half of the placement defect that #436 fixed for the CLI. expand_gemma4_audio_tokens_for_server now places the audio block at the token-stream level, the same layer the image path uses via expand_gemma4_image_tokens / apply_image_token_blocks, so the prefix-cache-sensitive render path in chat_request.rs is untouched. When a rendered <|audio|> placeholder is present it is wrapped in place; otherwise the block is spliced in immediately before the last <end_of_turn> token, which closes the latest user turn in a Gemma generation prompt, so the audio stays inside the user turn. The model worker resolves the <end_of_turn> id from the tokenizer and threads it through both Gemma 4 audio paths: prepare_gemma4_audio_embeddings (Conformer VLM) and prepare_gemma4_unified_audio_embeddings (encoder-free Unified). The legacy before-the-final-token insertion stays only as a last resort for prompts with no <end_of_turn> marker. Non-audio requests are byte-identical: only the two audio embedding paths change and chat_request.rs is not touched, so text-only, image-only, and non-Gemma server requests keep their exact prompt and prompt-cache prefix behavior. New unit tests cover in-place wrap, user-turn insertion, multi-turn last-user-turn targeting, and both last-resort fallbacks.
…user turn The server places the Gemma 4 audio block before the last end-of-turn token (issue #437), but `resolve_end_of_turn_token_id` only looked up "<end_of_turn>". Gemma 4 renamed the marker to "<turn|>" (id 106; start-of-turn is "<|turn>"), so the lookup returned None and the placement silently degraded to the model-turn fallback, still yielding 0 tokens. Try both "<end_of_turn>" (Gemma 2/3) and "<turn|>" (Gemma 4) so the audio block lands inside the last user turn. Verified end to end: a chat-completions request with an input_audio part to gemma-4-e2b now transcribes (0 -> 120 tokens) instead of emitting EOS at prefill.
Security and performance review (PR #440 @ 0ff694c)Scope: changed files only ( Result: no CRITICAL or HIGH findings. No code changes required. Request-path panic safety
Resource bounds
No injection / no prefix-cache regression
Performance
Observations (non-blocking, not introduced by this PR)
|
Add three unit tests to src/server/model_worker_tests.rs covering the function that had the bug (resolve_end_of_turn_token_id had no tests before this PR): - resolve_end_of_turn_id_handles_gemma4_turn_marker: a tokenizer with "<turn|>" = id 106 but no "<end_of_turn>" must return Some(106). This is the exact regression guard for the #437 bug, where the old implementation only looked up "<end_of_turn>" and returned None for Gemma 4 tokenizers. - resolve_end_of_turn_id_handles_gemma23_end_of_turn_marker: a tokenizer with "<end_of_turn>" = id 107 must return Some(107). - resolve_end_of_turn_id_returns_none_when_no_marker_present: a tokenizer with neither marker must return None so the caller keeps its own fallback. Each test builds a minimal HuggingFace BPE stub (via two helper functions, stub_eot_tokenizer and stub_tokenizer_no_eot) that mirrors the pattern of MlxcelTokenizer::stub_with_byte_fallback. Also remove the stale "not yet wired through the chat prompt" note from docs/supported-models.md and replace it with a brief description of what the fix delivers.
PR Finalization CompleteTests added (3 new, in
|
| Test | What it guards |
|---|---|
resolve_end_of_turn_id_handles_gemma4_turn_marker |
A tokenizer with `"<turn |
resolve_end_of_turn_id_handles_gemma23_end_of_turn_marker |
A tokenizer with "<end_of_turn>" = id 107 returns Some(107). Guards Gemma 2/3 behavior after the two-candidate lookup was introduced. |
resolve_end_of_turn_id_returns_none_when_no_marker_present |
A tokenizer with neither marker returns None, confirming the caller's own fallback is not bypassed for non-Gemma models. |
No integration test was added for expand_gemma4_audio_tokens_for_server with Gemma 4 token ids specifically: the five existing server_audio_* tests already use EOT = 106 and cover all placement tiers, so no genuinely new combination was available.
Documentation
docs/supported-models.md: removed the stale "Server-side audio input for this path is not yet wired through the chat prompt." sentence from the Gemma 4 VL audio bullet and replaced it with a brief note that POST /v1/chat/completions with input_audio now works and splices the audio block before <turn|>.
Lint / Format
cargo fmt --all applied and cargo fmt --all --check confirmed clean. cargo clippy --lib --features metal,accelerate -- -D warnings produced no new warnings (pre-existing out-of-scope warnings in the audio encoder are not touched).
Verification commands run
cargo fmt --all --check # clean
cargo test --lib --features metal,accelerate 'resolve_end_of_turn' # 3/3 pass
cargo test --lib --features metal,accelerate 'server::model_provider::model_worker::tests' # 20/20 pass
cargo test --lib --features metal,accelerate 'server_audio' # 5/5 pass
cargo clippy --lib --features metal,accelerate -- -D warnings # no new warnings
New commit: 097bbc405
The initial v0.3.3 cut (4d5f4fb) was never tagged, and 16 more commits landed on main afterward. Fold those fixes into the v0.3.3 CHANGELOG and debian/changelog entries and set the release date to 2026-06-25: - N-gram loop detection (#433) and Nemotron-H Nano Omni audio input (#443) - Prefill masks sized from the live window under --max-kv-size trim (#418, #420, #422, #431) - Double-transpose crash on mlx-community conv checkpoints (#429) - conv1d/conv2d and nemotron audio-encoder convs fallible at the FFI boundary (#434, #439) - Gemma 4 audio placed in the user turn (#438, #440) - mistral4 MLA backbone routing and 2D MoE token flattening (#423, #425) - Bump actions/checkout from 6 to 7 (#395)
Summary
Server-side counterpart to #436 (PR #438): Gemma 4 audio understanding over
POST /v1/chat/completionswith aninput_audiopart emitted 0 tokens. This places the audio block inside the user turn so the model answers instead of emitting an immediate EOS. The server resampling half of #436 is already merged in #438 and is not re-added here; this PR is only the<|audio|>placement.Root cause
The server renders chat messages text-only (
MessageContent::text()insrc/server/types/request.rskeeps onlyContentPart::Textand dropsinput_audio), so the Gemma 4 chat template never renders an<|audio|>marker into the user turn.expand_gemma4_audio_tokens_for_server(src/multimodal/vlm_runtime.rs) then found no placeholder and used its "insert before the final token" fallback, which lands theBOA + AUDIO*N + EOAblock inside the trailing model turn of the Gemma generation prompt (<start_of_turn>user\n…<end_of_turn>\n<start_of_turn>model\n), forcing an immediate EOS. This is the exact placement defect class that #436 Fix A resolved for the CLI.What changed
src/multimodal/vlm_runtime.rs:expand_gemma4_audio_tokens_for_servergains anend_of_turn_token_id: Option<i32>parameter and a three-tier placement policy. (1) If a rendered<|audio|>placeholder is present, wrap the first occurrence in place (unchanged behavior, used by the CLI-style path). (2) Otherwise, splice the audio block immediately before the last<end_of_turn>token, which in a Gemma generation prompt closes the latest user turn (the pending model turn has no closing marker yet), so the block lands inside the user turn. (3) Only when no<end_of_turn>marker exists (e.g.--no-chat-template) does it keep the legacy "before the final token" insertion as a last resort.src/server/model_worker.rs: newresolve_end_of_turn_token_idhelper resolves<end_of_turn>from the tokenizer (HFtoken_to_id, with a single-tokenencodefallback for SentencePiece/Tiktoken). The id is resolved once inprepare_request_vlm_embeddingsand threaded through both Gemma 4 audio paths,prepare_gemma4_audio_embeddings(Conformer VLM) andprepare_gemma4_unified_audio_embeddings(encoder-free Unified), which both feed it to the expansion. This covers the audio-only and image+audio combined call sites.src/multimodal/vlm_runtime_tests.rs: unit tests for in-place wrap, user-turn insertion (before the closing<end_of_turn>), multi-turn targeting of the last user turn, and both last-resort fallbacks (no<end_of_turn>id, and<end_of_turn>id known but absent from the prompt).Why images are the reference, and why non-audio is unchanged
Server image markers are placed at the token-stream level (
expand_gemma4_image_tokens/apply_image_token_blocks), not by routing image content parts through the chat template. Mirroring that, this fix stays at the token-stream level and does not touchsrc/server/chat_request.rs, so the prefix-cache-sensitive typed-vs-raw path selection (thepreserve_thinkingflow) is untouched. Only the two Gemma 4 audio embedding paths change, so text-only, image-only, and non-Gemma server requests keep a byte-identical prompt and prompt-cache prefix.Test plan
cargo test --lib --features metal,accelerate server_audio(5 new tests pass)cargo check --bin mlxcel --features metal,acceleratecargo fmt --all --checkcargo clippy --lib --testsclean for the three changed files (remaininguseless_vecfailures are pre-existing insrc/audio/nemotron_h_nano_omni/encoder.rs, untouched here, surfaced by a newer local clippy)mlxcel-serverwithgemma-4-e2b-it-4bit, POST a/v1/chat/completionsrequest with aninput_audiopart, confirm a coherent transcription instead of 0 tokensCloses #437