fix: place Gemma 4 server <|audio|> block inside the user turn by inureyes · Pull Request #440 · lablup/mlxcel

inureyes · 2026-06-25T15:30:47Z

Summary

Server-side counterpart to #436 (PR #438): Gemma 4 audio understanding over POST /v1/chat/completions with an input_audio part emitted 0 tokens. This places the audio block inside the user turn so the model answers instead of emitting an immediate EOS. The server resampling half of #436 is already merged in #438 and is not re-added here; this PR is only the <|audio|> placement.

Root cause

The server renders chat messages text-only (MessageContent::text() in src/server/types/request.rs keeps only ContentPart::Text and drops input_audio), so the Gemma 4 chat template never renders an <|audio|> marker into the user turn. expand_gemma4_audio_tokens_for_server (src/multimodal/vlm_runtime.rs) then found no placeholder and used its "insert before the final token" fallback, which lands the BOA + AUDIO*N + EOA block inside the trailing model turn of the Gemma generation prompt (<start_of_turn>user\n…<end_of_turn>\n<start_of_turn>model\n), forcing an immediate EOS. This is the exact placement defect class that #436 Fix A resolved for the CLI.

What changed

src/multimodal/vlm_runtime.rs: expand_gemma4_audio_tokens_for_server gains an end_of_turn_token_id: Option<i32> parameter and a three-tier placement policy. (1) If a rendered <|audio|> placeholder is present, wrap the first occurrence in place (unchanged behavior, used by the CLI-style path). (2) Otherwise, splice the audio block immediately before the last <end_of_turn> token, which in a Gemma generation prompt closes the latest user turn (the pending model turn has no closing marker yet), so the block lands inside the user turn. (3) Only when no <end_of_turn> marker exists (e.g. --no-chat-template) does it keep the legacy "before the final token" insertion as a last resort.
src/server/model_worker.rs: new resolve_end_of_turn_token_id helper resolves <end_of_turn> from the tokenizer (HF token_to_id, with a single-token encode fallback for SentencePiece/Tiktoken). The id is resolved once in prepare_request_vlm_embeddings and threaded through both Gemma 4 audio paths, prepare_gemma4_audio_embeddings (Conformer VLM) and prepare_gemma4_unified_audio_embeddings (encoder-free Unified), which both feed it to the expansion. This covers the audio-only and image+audio combined call sites.
src/multimodal/vlm_runtime_tests.rs: unit tests for in-place wrap, user-turn insertion (before the closing <end_of_turn>), multi-turn targeting of the last user turn, and both last-resort fallbacks (no <end_of_turn> id, and <end_of_turn> id known but absent from the prompt).

Why images are the reference, and why non-audio is unchanged

Server image markers are placed at the token-stream level (expand_gemma4_image_tokens / apply_image_token_blocks), not by routing image content parts through the chat template. Mirroring that, this fix stays at the token-stream level and does not touch src/server/chat_request.rs, so the prefix-cache-sensitive typed-vs-raw path selection (the preserve_thinking flow) is untouched. Only the two Gemma 4 audio embedding paths change, so text-only, image-only, and non-Gemma server requests keep a byte-identical prompt and prompt-cache prefix.

Test plan

cargo test --lib --features metal,accelerate server_audio (5 new tests pass)
cargo check --bin mlxcel --features metal,accelerate
cargo fmt --all --check
cargo clippy --lib --tests clean for the three changed files (remaining useless_vec failures are pre-existing in src/audio/nemotron_h_nano_omni/encoder.rs, untouched here, surfaced by a newer local clippy)
Real-model server E2E (run by the orchestrator): start mlxcel-server with gemma-4-e2b-it-4bit, POST a /v1/chat/completions request with an input_audio part, confirm a coherent transcription instead of 0 tokens

Closes #437

Gemma 4 audio understanding over the server (POST /v1/chat/completions with an input_audio part) emitted 0 tokens. The server renders chat messages text-only (MessageContent::text() drops input_audio content parts), so the chat template never produced an <|audio|> marker; expand_gemma4_audio_tokens_for_server then found no placeholder and used its "insert before the final token" fallback, which lands the BOA + AUDIO*N + EOA block inside the trailing model turn of the Gemma generation prompt and forces an immediate EOS. This is the server half of the placement defect that #436 fixed for the CLI. expand_gemma4_audio_tokens_for_server now places the audio block at the token-stream level, the same layer the image path uses via expand_gemma4_image_tokens / apply_image_token_blocks, so the prefix-cache-sensitive render path in chat_request.rs is untouched. When a rendered <|audio|> placeholder is present it is wrapped in place; otherwise the block is spliced in immediately before the last <end_of_turn> token, which closes the latest user turn in a Gemma generation prompt, so the audio stays inside the user turn. The model worker resolves the <end_of_turn> id from the tokenizer and threads it through both Gemma 4 audio paths: prepare_gemma4_audio_embeddings (Conformer VLM) and prepare_gemma4_unified_audio_embeddings (encoder-free Unified). The legacy before-the-final-token insertion stays only as a last resort for prompts with no <end_of_turn> marker. Non-audio requests are byte-identical: only the two audio embedding paths change and chat_request.rs is not touched, so text-only, image-only, and non-Gemma server requests keep their exact prompt and prompt-cache prefix behavior. New unit tests cover in-place wrap, user-turn insertion, multi-turn last-user-turn targeting, and both last-resort fallbacks.

…user turn The server places the Gemma 4 audio block before the last end-of-turn token (issue #437), but `resolve_end_of_turn_token_id` only looked up "<end_of_turn>". Gemma 4 renamed the marker to "<turn|>" (id 106; start-of-turn is "<|turn>"), so the lookup returned None and the placement silently degraded to the model-turn fallback, still yielding 0 tokens. Try both "<end_of_turn>" (Gemma 2/3) and "<turn|>" (Gemma 4) so the audio block lands inside the last user turn. Verified end to end: a chat-completions request with an input_audio part to gemma-4-e2b now transcribes (0 -> 120 tokens) instead of emitting EOS at prefill.

inureyes · 2026-06-25T16:18:25Z

Security and performance review (PR #440 @ `0ff694c`)

Scope: changed files only (src/multimodal/vlm_runtime.rs, src/server/model_worker.rs, src/multimodal/vlm_runtime_tests.rs).

Result: no CRITICAL or HIGH findings. No code changes required.

Request-path panic safety

expand_gemma4_audio_tokens_for_server is total over every input reachable on the request path. Tier 2 splice(pos..pos, block) uses a pos from rposition, which is always < len, so the empty insertion range is in bounds and cannot panic. Tier 3 uses Vec::pop (returns None on empty) and re-pushes only on Some, so an empty prompt, a prompt with no end-of-turn token, and a prompt that is entirely end-of-turn tokens all terminate without panicking. None / not-found degrades to tier 3 as intended.
resolve_end_of_turn_token_id performs only fallible lookups (token_to_id, then encode(...) guarded by is_ok and len() == 1). No unwrap, index, or slice that can panic.

Resource bounds

The Conformer path caps the block at 750 via compute_audio_num_tokens. The WAV parser rejects 0 sample-rate and 0 channels and caps decoded data at 500 MB, so a crafted odd-rate or oversized payload cannot reach an unbounded count or a divide-by-zero. The base64 envelope is bounded upstream. The splice allocates one block of the same size the previous fallback did, so no per-request growth was introduced.

No injection / no prefix-cache regression

Placement is purely token-id level (no string templating). resolve_end_of_turn_token_id runs only inside if !audio.is_empty(), and the expansion runs only from the two Gemma 4 audio paths, so text-only, image-only, and non-Gemma requests keep byte-identical prompt_tokens and an unchanged prompt-cache prefix. chat_request.rs and the preserve_thinking path selection are untouched.

Performance

The added rposition scan and splice run once per audio request at prefill, both O(n) in prompt length, with no per-decode-token cost and no extra allocation on the text or image hot paths.

Observations (non-blocking, not introduced by this PR)

The Unified path sizes the block from audio_input.num_frames (samples / audio_samples_per_token) rather than the 750 cap used by the Conformer path. It stays bounded by the 500 MB WAV limit and the upstream base64 bound, so it is not an unbounded-resource issue, but it is the larger of the two block sizes a long clip can produce. Pre-existing, out of scope here.
resolve_end_of_turn_token_id returns the first matching candidate (<end_of_turn> before <turn|>). If a future Gemma 4 vocab carried both literals while the template rendered only <turn|>, first-match could select a stale id. Validated working end-to-end on gemma-4-e2b (0 to 120 tokens), so not a current defect.

Add three unit tests to src/server/model_worker_tests.rs covering the function that had the bug (resolve_end_of_turn_token_id had no tests before this PR): - resolve_end_of_turn_id_handles_gemma4_turn_marker: a tokenizer with "<turn|>" = id 106 but no "<end_of_turn>" must return Some(106). This is the exact regression guard for the #437 bug, where the old implementation only looked up "<end_of_turn>" and returned None for Gemma 4 tokenizers. - resolve_end_of_turn_id_handles_gemma23_end_of_turn_marker: a tokenizer with "<end_of_turn>" = id 107 must return Some(107). - resolve_end_of_turn_id_returns_none_when_no_marker_present: a tokenizer with neither marker must return None so the caller keeps its own fallback. Each test builds a minimal HuggingFace BPE stub (via two helper functions, stub_eot_tokenizer and stub_tokenizer_no_eot) that mirrors the pattern of MlxcelTokenizer::stub_with_byte_fallback. Also remove the stale "not yet wired through the chat prompt" note from docs/supported-models.md and replace it with a brief description of what the fix delivers.

inureyes · 2026-06-25T16:26:29Z

PR Finalization Complete

Tests added (3 new, in `src/server/model_worker_tests.rs`)

resolve_end_of_turn_token_id had zero tests before this PR. Three tests were added, backed by two helper functions (stub_eot_tokenizer, stub_tokenizer_no_eot) that build minimal HuggingFace BPE stubs from inline JSON, following the same pattern as MlxcelTokenizer::stub_with_byte_fallback.

Test	What it guards
`resolve_end_of_turn_id_handles_gemma4_turn_marker`	A tokenizer with `"<turn
`resolve_end_of_turn_id_handles_gemma23_end_of_turn_marker`	A tokenizer with `"<end_of_turn>"` = id 107 returns `Some(107)`. Guards Gemma 2/3 behavior after the two-candidate lookup was introduced.
`resolve_end_of_turn_id_returns_none_when_no_marker_present`	A tokenizer with neither marker returns `None`, confirming the caller's own fallback is not bypassed for non-Gemma models.

No integration test was added for expand_gemma4_audio_tokens_for_server with Gemma 4 token ids specifically: the five existing server_audio_* tests already use EOT = 106 and cover all placement tiers, so no genuinely new combination was available.

Documentation

docs/supported-models.md: removed the stale "Server-side audio input for this path is not yet wired through the chat prompt." sentence from the Gemma 4 VL audio bullet and replaced it with a brief note that POST /v1/chat/completions with input_audio now works and splices the audio block before <turn|>.

Lint / Format

cargo fmt --all applied and cargo fmt --all --check confirmed clean. cargo clippy --lib --features metal,accelerate -- -D warnings produced no new warnings (pre-existing out-of-scope warnings in the audio encoder are not touched).

Verification commands run

cargo fmt --all --check                                         # clean
cargo test --lib --features metal,accelerate 'resolve_end_of_turn'  # 3/3 pass
cargo test --lib --features metal,accelerate 'server::model_provider::model_worker::tests'  # 20/20 pass
cargo test --lib --features metal,accelerate 'server_audio'    # 5/5 pass
cargo clippy --lib --features metal,accelerate -- -D warnings  # no new warnings

New commit: 097bbc405

The initial v0.3.3 cut (4d5f4fb) was never tagged, and 16 more commits landed on main afterward. Fold those fixes into the v0.3.3 CHANGELOG and debian/changelog entries and set the release date to 2026-06-25: - N-gram loop detection (#433) and Nemotron-H Nano Omni audio input (#443) - Prefill masks sized from the live window under --max-kv-size trim (#418, #420, #422, #431) - Double-transpose crash on mlx-community conv checkpoints (#429) - conv1d/conv2d and nemotron audio-encoder convs fallible at the FFI boundary (#434, #439) - Gemma 4 audio placed in the user turn (#438, #440) - mistral4 MLA backbone routing and 2D MoE token flattening (#423, #425) - Bump actions/checkout from 6 to 7 (#395)

inureyes added status:review Under review type:bug Bug fixes, error corrections, or issue resolutions priority:medium Medium priority area:inference Generation, sampling, decoding (incl. speculative, DRY) labels Jun 25, 2026

inureyes added status:done Completed and removed status:review Under review labels Jun 25, 2026

inureyes merged commit 6776688 into main Jun 25, 2026
5 checks passed

inureyes deleted the fix/issue-437-gemma4-server-audio-placement branch June 25, 2026 16:27

This was referenced Jun 25, 2026

feat(server): wire Nemotron-H Nano Omni input_audio into the chat-completions audio dispatch #442

Closed

feat: wire Nemotron-H Nano Omni input_audio into server chat audio #443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: place Gemma 4 server <|audio|> block inside the user turn#440

fix: place Gemma 4 server <|audio|> block inside the user turn#440
inureyes merged 3 commits into
mainfrom
fix/issue-437-gemma4-server-audio-placement

inureyes commented Jun 25, 2026

Uh oh!

inureyes commented Jun 25, 2026

Uh oh!

inureyes commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jun 25, 2026

Summary

Root cause

What changed

Why images are the reference, and why non-audio is unchanged

Test plan

Uh oh!

inureyes commented Jun 25, 2026

Security and performance review (PR #440 @ 0ff694c)

Request-path panic safety

Resource bounds

No injection / no prefix-cache regression

Performance

Observations (non-blocking, not introduced by this PR)

Uh oh!

inureyes commented Jun 25, 2026

PR Finalization Complete

Tests added (3 new, in src/server/model_worker_tests.rs)

Documentation

Lint / Format

Verification commands run

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Security and performance review (PR #440 @ `0ff694c`)

Tests added (3 new, in `src/server/model_worker_tests.rs`)