docs: finalize MLXCEL_FUSED_QK_NORM per-backend decision#444
Merged
Conversation
CUDA (GB10/SM 12.1) was the last open backend question after the M1 Ultra regression (#326, #341). The GB10 benchmark (#357) found the fused path is also slower there (qwen3-0.6b 0.96x, qwen3-8b ~1.0x, qwen3-30b-a3b 0.92x fused/graph), so there is no per-backend win on any measured backend (M1 Ultra, M5 Max, GB10/CUDA). The default stays off on all backends; no code changes. Two stale claims in the MLXCEL_FUSED_QK_NORM table row in docs/environment-variables.md are corrected: 1. The byte-identical overclaim is replaced with the canonical precision from the layers.rs doc comment: the fused path matches within RMS < 5e-3, but greedy temp-0 is not byte-identical over long generation. The determinism note is added: on CUDA the graph path is non-deterministic run-to-run from GPU FP-reduction order, while the fused path is deterministic, so its divergence from the graph stays inside the graph baseline's own envelope. 2. The CUDA-pending rationale is replaced with the settled verdict: GB10/CUDA was measured and is also slower, so the flag stays opt-in on every measured backend. CHANGELOG.md gets a Docs entry under [Unreleased] recording the decision finalization.
Member
Author
Implementation Review SummaryVerdict: APPROVED. Docs-only finalization of issue #355. Every claim in the edited row cross-checks against both canonical sources, the two stale claims are fully removed, markdown is intact, and no code/default changed. Intent
Findings Addressed
Remaining Items
Verification
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Finalizes the per-backend
MLXCEL_FUSED_QK_NORMdefault decision by updatingdocs/environment-variables.mdto reflect the settled result. No code changes; the default remains off on all backends.The GB10/CUDA benchmark from PR #357 found the fused QK-norm path is slower on every wired model (qwen3-0.6b 0.96x, qwen3-8b ~1.0x, qwen3-30b-a3b 0.92x fused/graph), matching the M1 Ultra and M5 Max direction. There is no per-backend win, so the flag stays opt-in everywhere.
What changed
docs/environment-variables.md: fixes two stale claims in theMLXCEL_FUSED_QK_NORMtable row. The byte-identical overclaim is replaced with the canonical RMS < 5e-3 precision statement plus the CUDA determinism nuance (fused is deterministic, graph is not, so fused divergence stays inside the graph baseline's envelope). The CUDA-pending rationale is replaced with the settled verdict (GB10 also slower; referencesdocs/benchmark_results/fused-qk-norm-decode-gb10.md).CHANGELOG.md: adds a### Docsentry under[Unreleased]recording the decision finalization.Test plan
.rsfiles touched.fused_qk_norm_enableddoc comment inlayers.rs(lines 528-552) and the GB10 benchmark doc.Closes #355