docs: finalize MLXCEL_FUSED_QK_NORM per-backend decision by inureyes · Pull Request #444 · lablup/mlxcel

inureyes · 2026-06-25T21:40:40Z

Summary

Finalizes the per-backend MLXCEL_FUSED_QK_NORM default decision by updating docs/environment-variables.md to reflect the settled result. No code changes; the default remains off on all backends.

The GB10/CUDA benchmark from PR #357 found the fused QK-norm path is slower on every wired model (qwen3-0.6b 0.96x, qwen3-8b ~1.0x, qwen3-30b-a3b 0.92x fused/graph), matching the M1 Ultra and M5 Max direction. There is no per-backend win, so the flag stays opt-in everywhere.

What changed

docs/environment-variables.md: fixes two stale claims in the MLXCEL_FUSED_QK_NORM table row. The byte-identical overclaim is replaced with the canonical RMS < 5e-3 precision statement plus the CUDA determinism nuance (fused is deterministic, graph is not, so fused divergence stays inside the graph baseline's envelope). The CUDA-pending rationale is replaced with the settled verdict (GB10 also slower; references docs/benchmark_results/fused-qk-norm-decode-gb10.md).
CHANGELOG.md: adds a ### Docs entry under [Unreleased] recording the decision finalization.

Test plan

Table row is a single line (no embedded newlines that would break the markdown table).
No .rs files touched.
Wording mirrors the canonical fused_qk_norm_enabled doc comment in layers.rs (lines 528-552) and the GB10 benchmark doc.

Closes #355

CUDA (GB10/SM 12.1) was the last open backend question after the M1 Ultra regression (#326, #341). The GB10 benchmark (#357) found the fused path is also slower there (qwen3-0.6b 0.96x, qwen3-8b ~1.0x, qwen3-30b-a3b 0.92x fused/graph), so there is no per-backend win on any measured backend (M1 Ultra, M5 Max, GB10/CUDA). The default stays off on all backends; no code changes. Two stale claims in the MLXCEL_FUSED_QK_NORM table row in docs/environment-variables.md are corrected: 1. The byte-identical overclaim is replaced with the canonical precision from the layers.rs doc comment: the fused path matches within RMS < 5e-3, but greedy temp-0 is not byte-identical over long generation. The determinism note is added: on CUDA the graph path is non-deterministic run-to-run from GPU FP-reduction order, while the fused path is deterministic, so its divergence from the graph stays inside the graph baseline's own envelope. 2. The CUDA-pending rationale is replaced with the settled verdict: GB10/CUDA was measured and is also slower, so the flag stays opt-in on every measured backend. CHANGELOG.md gets a Docs entry under [Unreleased] recording the decision finalization.

inureyes · 2026-06-25T21:46:59Z

Implementation Review Summary

Verdict: APPROVED. Docs-only finalization of issue #355. Every claim in the edited row cross-checks against both canonical sources, the two stale claims are fully removed, markdown is intact, and no code/default changed.

Intent

Record the settled per-backend MLXCEL_FUSED_QK_NORM decision in the user-facing docs: GB10/CUDA was measured and is also slower, so the fused QK-norm decode path stays opt-in (default off) on every backend. No code change.

Findings Addressed

None. No CRITICAL/HIGH/MEDIUM accuracy, consistency, or markdown defects found.

Remaining Items

None.

Verification

Doc numbers and claims match layers.rs fused_qk_norm_enabled comment (lines 528-552) and docs/benchmark_results/fused-qk-norm-decode-gb10.md: RMS < 5e-3 parity, not byte-identical over long greedy generation, CUDA graph-path run-to-run non-determinism with the fused path staying inside that envelope, M1 Ultra 1-3.4% slower (275 vs 284 / 82.3 vs 83.2), GB10 0.96x / ~1.0x / 0.92x.
Stale claims removed: the "byte-identical to the graph path" overclaim and the "gated off pending a per-backend win (for example CUDA)" rationale are both gone, replaced by the resolved verdict.
Markdown table row is a single physical line with 4 columns (5 pipes), delimiters intact.
No .rs / code / default change. Acceptance criterion (b) satisfied: CUDA lost, so the default correctly stays off; Metal unchanged.
CHANGELOG entry is a single non-wrapped line under [Unreleased] -> ### Docs (an established heading, used 14x), references perf(models): re-bench fused QKV+RMSNorm+RoPE on CUDA and decide the MLXCEL_FUSED_QK_NORM default #355.
Conventions: docs: subject 63 chars, no em dashes, no AI attribution, Closes #355 present in the PR body.

inureyes added status:review Under review type:docs Documentation improvements or additions priority:medium Medium priority area:docs User and developer documentation labels Jun 25, 2026

inureyes added status:done Completed and removed status:review Under review labels Jun 25, 2026

inureyes merged commit 28298d6 into main Jun 25, 2026
5 checks passed

inureyes deleted the update/issue-355-finalize-qk-norm-decision branch June 25, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: finalize MLXCEL_FUSED_QK_NORM per-backend decision#444

docs: finalize MLXCEL_FUSED_QK_NORM per-backend decision#444
inureyes merged 1 commit into
mainfrom
update/issue-355-finalize-qk-norm-decision

inureyes commented Jun 25, 2026

Uh oh!

inureyes commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Jun 25, 2026

Summary

What changed

Test plan

Uh oh!

inureyes commented Jun 25, 2026

Implementation Review Summary

Intent

Findings Addressed

Remaining Items

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant