Vendor-neutral reference: NVFP4 vs OCP MXFP4 block-structure parameters (request for confirmation)

Hi TransformerEngine maintainers,

I am compiling a vendor-neutral numeric format catalog (84 formats, 13
families) with bit-exact conformance vectors. The catalog is open and
lives at https://github.com/gHashTag/t27. NVFP4 is on the near-term
roadmap (Track 2) but I would like to ground its row entry in
parameters confirmed by the upstream implementer rather than guessed
from blog posts. This issue is an information request, not a bug
report.

## What I have so far

Based on the public NVIDIA developer blog
(https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
and code references in TransformerEngine, I have populated the
following parameter table for NVFP4 alongside its closest OCP MX
counterpart (MXFP4):

| Parameter                    | OCP MX MXFP4         | NVIDIA NVFP4         |
| ---------------------------- | -------------------- | -------------------- |
| Element layout               | S1E2M1 (4 bits)      | S1E2M1 (4 bits)      |
| Block size                   | 32 elements          | 16 elements          |
| Scale format                 | E8M0 (8-bit)         | FP8 E4M3 (8-bit)     |
| Scale exponent bits          | 8 (pure exponent)    | 4                    |
| Scale mantissa bits          | 0                    | 3                    |
| Scale dynamic range          | 2^-127 to 2^127      | ~2^-9 to 448         |
| Scale granularity per decade | 1 (power-of-two)     | 8 (3-bit mantissa)   |
| Bits/element including scale | 4 + 8/32 = 4.25      | 4 + 8/16 = 4.50      |

The element layout S1E2M1 is bit-identical between the two formats;
they diverge at the block-and-scale level. Three structural
consequences follow:

1. NVFP4 resolves intra-block dynamic range 8x more finely than MXFP4
   within its representable range (3-bit mantissa on FP8 E4M3 scale).
2. NVFP4 cannot represent per-block scales outside FP8 E4M3 range
   (saturates at 448, underflows below ~2^-9) without higher-level
   rescaling; MXFP4 spans a much wider scale range via E8M0.
3. Effective bits per element differ: 4.25 (MXFP4) vs 4.50 (NVFP4),
   a 5.9% overhead delta in NVFP4 that any compression-ratio
   comparison should account for.

## Specific requests

If a maintainer could confirm or correct any of the following, that
would close out the row and let me publish a sister conformance pack
to the existing MXFP4 pack:

(a) **Block size confirmation.** Is 16 elements per block the only
    supported block size, or is it a default with alternatives?

(b) **Scale format confirmation.** Is FP8 E4M3 (with the standard
    `fn` saturation flag, no infinities) the canonical scale
    encoding? Are there variants that use FP8 E5M2 instead?

(c) **Encoding endianness.** When 16 four-bit elements are packed
    into 64 bits, are the first element bits in the most-significant
    or least-significant nibble?

(d) **Reference vectors.** Does TransformerEngine ship any unit
    tests with documented input/output bit-patterns that I can use
    as ground-truth boundary vectors (NaN, +/-Inf-equivalent
    saturation, smallest normal, smallest subnormal, denormal-block
    behavior)?

(e) **Round-trip behavior on out-of-range scale.** When a tensor's
    natural per-block scale would land outside the FP8 E4M3
    representable range, is the recommended behavior (i) clamp the
    scale and saturate the elements, (ii) error out, or (iii)
    something else?

## What I will do with confirmed answers

Open a small PR (catalog row + conformance pack) on
`gHashTag/t27`, with full attribution to this issue and a
cross-link back to the relevant TransformerEngine references. The
pack will follow the same shared row schema as the existing six
packs (GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, E8M0 block
scale), with honest `abs_error` reporting (no overflow-to-Inf
masked as a match).

Background and methodology are documented in a 16-page methodology
paper (Trinity S^3 AI, 2026-06-08, file
`paper3-methodology-2026-06-08-v3-trinity.pdf`, SHA-256
`f31f5dd243afc7b2ba4a423859a1e1dc67036c3a93affab30acc8d02f0a15eef`)
that I plan to upload to arXiv this week.

Async only -- no rush, no specific deadline. If the relevant
maintainer is on vacation or sprint-locked, a one-line "ping us back
in N weeks" is a fine answer.

Thank you for the open release of NVFP4 documentation and for the
maintained NVFP4 reference implementation in TransformerEngine.

-- Dmitrii Vasilev
   Trinity S^3 AI
   admin@t27.ai
   GitHub: @gHashTag


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vendor-neutral reference: NVFP4 vs OCP MXFP4 block-structure parameters (request for confirmation) #3105

What I have so far

Specific requests

What I will do with confirmed answers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parameter	OCP MX MXFP4	NVIDIA NVFP4
Element layout	S1E2M1 (4 bits)	S1E2M1 (4 bits)
Block size	32 elements	16 elements
Scale format	E8M0 (8-bit)	FP8 E4M3 (8-bit)
Scale exponent bits	8 (pure exponent)	4
Scale mantissa bits	0	3
Scale dynamic range	2^-127 to 2^127	~2^-9 to 448
Scale granularity per decade	1 (power-of-two)	8 (3-bit mantissa)
Bits/element including scale	4 + 8/32 = 4.25	4 + 8/16 = 4.50

Vendor-neutral reference: NVFP4 vs OCP MXFP4 block-structure parameters (request for confirmation) #3105

Description

What I have so far

Specific requests

What I will do with confirmed answers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions