Skip to content

docs: finalize MLXCEL_FUSED_QK_NORM per-backend decision#444

Merged
inureyes merged 1 commit into
mainfrom
update/issue-355-finalize-qk-norm-decision
Jun 25, 2026
Merged

docs: finalize MLXCEL_FUSED_QK_NORM per-backend decision#444
inureyes merged 1 commit into
mainfrom
update/issue-355-finalize-qk-norm-decision

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Finalizes the per-backend MLXCEL_FUSED_QK_NORM default decision by updating docs/environment-variables.md to reflect the settled result. No code changes; the default remains off on all backends.

The GB10/CUDA benchmark from PR #357 found the fused QK-norm path is slower on every wired model (qwen3-0.6b 0.96x, qwen3-8b ~1.0x, qwen3-30b-a3b 0.92x fused/graph), matching the M1 Ultra and M5 Max direction. There is no per-backend win, so the flag stays opt-in everywhere.

What changed

  • docs/environment-variables.md: fixes two stale claims in the MLXCEL_FUSED_QK_NORM table row. The byte-identical overclaim is replaced with the canonical RMS < 5e-3 precision statement plus the CUDA determinism nuance (fused is deterministic, graph is not, so fused divergence stays inside the graph baseline's envelope). The CUDA-pending rationale is replaced with the settled verdict (GB10 also slower; references docs/benchmark_results/fused-qk-norm-decode-gb10.md).
  • CHANGELOG.md: adds a ### Docs entry under [Unreleased] recording the decision finalization.

Test plan

  • Table row is a single line (no embedded newlines that would break the markdown table).
  • No .rs files touched.
  • Wording mirrors the canonical fused_qk_norm_enabled doc comment in layers.rs (lines 528-552) and the GB10 benchmark doc.

Closes #355

CUDA (GB10/SM 12.1) was the last open backend question after the M1 Ultra regression (#326, #341). The GB10 benchmark (#357) found the fused path is also slower there (qwen3-0.6b 0.96x, qwen3-8b ~1.0x, qwen3-30b-a3b 0.92x fused/graph), so there is no per-backend win on any measured backend (M1 Ultra, M5 Max, GB10/CUDA). The default stays off on all backends; no code changes.

Two stale claims in the MLXCEL_FUSED_QK_NORM table row in docs/environment-variables.md are corrected:

1. The byte-identical overclaim is replaced with the canonical precision from the layers.rs doc comment: the fused path matches within RMS < 5e-3, but greedy temp-0 is not byte-identical over long generation. The determinism note is added: on CUDA the graph path is non-deterministic run-to-run from GPU FP-reduction order, while the fused path is deterministic, so its divergence from the graph stays inside the graph baseline's own envelope.

2. The CUDA-pending rationale is replaced with the settled verdict: GB10/CUDA was measured and is also slower, so the flag stays opt-in on every measured backend.

CHANGELOG.md gets a Docs entry under [Unreleased] recording the decision finalization.
@inureyes inureyes added status:review Under review type:docs Documentation improvements or additions priority:medium Medium priority area:docs User and developer documentation labels Jun 25, 2026
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Verdict: APPROVED. Docs-only finalization of issue #355. Every claim in the edited row cross-checks against both canonical sources, the two stale claims are fully removed, markdown is intact, and no code/default changed.

Intent

Record the settled per-backend MLXCEL_FUSED_QK_NORM decision in the user-facing docs: GB10/CUDA was measured and is also slower, so the fused QK-norm decode path stays opt-in (default off) on every backend. No code change.

Findings Addressed

  • None. No CRITICAL/HIGH/MEDIUM accuracy, consistency, or markdown defects found.

Remaining Items

  • None.

Verification

  • Doc numbers and claims match layers.rs fused_qk_norm_enabled comment (lines 528-552) and docs/benchmark_results/fused-qk-norm-decode-gb10.md: RMS < 5e-3 parity, not byte-identical over long greedy generation, CUDA graph-path run-to-run non-determinism with the fused path staying inside that envelope, M1 Ultra 1-3.4% slower (275 vs 284 / 82.3 vs 83.2), GB10 0.96x / ~1.0x / 0.92x.
  • Stale claims removed: the "byte-identical to the graph path" overclaim and the "gated off pending a per-backend win (for example CUDA)" rationale are both gone, replaced by the resolved verdict.
  • Markdown table row is a single physical line with 4 columns (5 pipes), delimiters intact.
  • No .rs / code / default change. Acceptance criterion (b) satisfied: CUDA lost, so the default correctly stays off; Metal unchanged.
  • CHANGELOG entry is a single non-wrapped line under [Unreleased] -> ### Docs (an established heading, used 14x), references perf(models): re-bench fused QKV+RMSNorm+RoPE on CUDA and decide the MLXCEL_FUSED_QK_NORM default #355.
  • Conventions: docs: subject 63 chars, no em dashes, no AI attribution, Closes #355 present in the PR body.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 25, 2026
@inureyes inureyes merged commit 28298d6 into main Jun 25, 2026
5 checks passed
@inureyes inureyes deleted the update/issue-355-finalize-qk-norm-decision branch June 25, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs User and developer documentation priority:medium Medium priority status:done Completed type:docs Documentation improvements or additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(models): re-bench fused QKV+RMSNorm+RoPE on CUDA and decide the MLXCEL_FUSED_QK_NORM default

1 participant