Hi TransformerEngine maintainers,
I am compiling a vendor-neutral numeric format catalog (84 formats, 13
families) with bit-exact conformance vectors. The catalog is open and
lives at https://github.com/gHashTag/t27. NVFP4 is on the near-term
roadmap (Track 2) but I would like to ground its row entry in
parameters confirmed by the upstream implementer rather than guessed
from blog posts. This issue is an information request, not a bug
report.
What I have so far
Based on the public NVIDIA developer blog
(https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
and code references in TransformerEngine, I have populated the
following parameter table for NVFP4 alongside its closest OCP MX
counterpart (MXFP4):
| Parameter |
OCP MX MXFP4 |
NVIDIA NVFP4 |
| Element layout |
S1E2M1 (4 bits) |
S1E2M1 (4 bits) |
| Block size |
32 elements |
16 elements |
| Scale format |
E8M0 (8-bit) |
FP8 E4M3 (8-bit) |
| Scale exponent bits |
8 (pure exponent) |
4 |
| Scale mantissa bits |
0 |
3 |
| Scale dynamic range |
2^-127 to 2^127 |
~2^-9 to 448 |
| Scale granularity per decade |
1 (power-of-two) |
8 (3-bit mantissa) |
| Bits/element including scale |
4 + 8/32 = 4.25 |
4 + 8/16 = 4.50 |
The element layout S1E2M1 is bit-identical between the two formats;
they diverge at the block-and-scale level. Three structural
consequences follow:
- NVFP4 resolves intra-block dynamic range 8x more finely than MXFP4
within its representable range (3-bit mantissa on FP8 E4M3 scale).
- NVFP4 cannot represent per-block scales outside FP8 E4M3 range
(saturates at 448, underflows below ~2^-9) without higher-level
rescaling; MXFP4 spans a much wider scale range via E8M0.
- Effective bits per element differ: 4.25 (MXFP4) vs 4.50 (NVFP4),
a 5.9% overhead delta in NVFP4 that any compression-ratio
comparison should account for.
Specific requests
If a maintainer could confirm or correct any of the following, that
would close out the row and let me publish a sister conformance pack
to the existing MXFP4 pack:
(a) Block size confirmation. Is 16 elements per block the only
supported block size, or is it a default with alternatives?
(b) Scale format confirmation. Is FP8 E4M3 (with the standard
fn saturation flag, no infinities) the canonical scale
encoding? Are there variants that use FP8 E5M2 instead?
(c) Encoding endianness. When 16 four-bit elements are packed
into 64 bits, are the first element bits in the most-significant
or least-significant nibble?
(d) Reference vectors. Does TransformerEngine ship any unit
tests with documented input/output bit-patterns that I can use
as ground-truth boundary vectors (NaN, +/-Inf-equivalent
saturation, smallest normal, smallest subnormal, denormal-block
behavior)?
(e) Round-trip behavior on out-of-range scale. When a tensor's
natural per-block scale would land outside the FP8 E4M3
representable range, is the recommended behavior (i) clamp the
scale and saturate the elements, (ii) error out, or (iii)
something else?
What I will do with confirmed answers
Open a small PR (catalog row + conformance pack) on
gHashTag/t27, with full attribution to this issue and a
cross-link back to the relevant TransformerEngine references. The
pack will follow the same shared row schema as the existing six
packs (GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, E8M0 block
scale), with honest abs_error reporting (no overflow-to-Inf
masked as a match).
Background and methodology are documented in a 16-page methodology
paper (Trinity S^3 AI, 2026-06-08, file
paper3-methodology-2026-06-08-v3-trinity.pdf, SHA-256
f31f5dd243afc7b2ba4a423859a1e1dc67036c3a93affab30acc8d02f0a15eef)
that I plan to upload to arXiv this week.
Async only -- no rush, no specific deadline. If the relevant
maintainer is on vacation or sprint-locked, a one-line "ping us back
in N weeks" is a fine answer.
Thank you for the open release of NVFP4 documentation and for the
maintained NVFP4 reference implementation in TransformerEngine.
-- Dmitrii Vasilev
Trinity S^3 AI
admin@t27.ai
GitHub: @gHashTag
Hi TransformerEngine maintainers,
I am compiling a vendor-neutral numeric format catalog (84 formats, 13
families) with bit-exact conformance vectors. The catalog is open and
lives at https://github.com/gHashTag/t27. NVFP4 is on the near-term
roadmap (Track 2) but I would like to ground its row entry in
parameters confirmed by the upstream implementer rather than guessed
from blog posts. This issue is an information request, not a bug
report.
What I have so far
Based on the public NVIDIA developer blog
(https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
and code references in TransformerEngine, I have populated the
following parameter table for NVFP4 alongside its closest OCP MX
counterpart (MXFP4):
The element layout S1E2M1 is bit-identical between the two formats;
they diverge at the block-and-scale level. Three structural
consequences follow:
within its representable range (3-bit mantissa on FP8 E4M3 scale).
(saturates at 448, underflows below ~2^-9) without higher-level
rescaling; MXFP4 spans a much wider scale range via E8M0.
a 5.9% overhead delta in NVFP4 that any compression-ratio
comparison should account for.
Specific requests
If a maintainer could confirm or correct any of the following, that
would close out the row and let me publish a sister conformance pack
to the existing MXFP4 pack:
(a) Block size confirmation. Is 16 elements per block the only
supported block size, or is it a default with alternatives?
(b) Scale format confirmation. Is FP8 E4M3 (with the standard
fnsaturation flag, no infinities) the canonical scaleencoding? Are there variants that use FP8 E5M2 instead?
(c) Encoding endianness. When 16 four-bit elements are packed
into 64 bits, are the first element bits in the most-significant
or least-significant nibble?
(d) Reference vectors. Does TransformerEngine ship any unit
tests with documented input/output bit-patterns that I can use
as ground-truth boundary vectors (NaN, +/-Inf-equivalent
saturation, smallest normal, smallest subnormal, denormal-block
behavior)?
(e) Round-trip behavior on out-of-range scale. When a tensor's
natural per-block scale would land outside the FP8 E4M3
representable range, is the recommended behavior (i) clamp the
scale and saturate the elements, (ii) error out, or (iii)
something else?
What I will do with confirmed answers
Open a small PR (catalog row + conformance pack) on
gHashTag/t27, with full attribution to this issue and across-link back to the relevant TransformerEngine references. The
pack will follow the same shared row schema as the existing six
packs (GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, E8M0 block
scale), with honest
abs_errorreporting (no overflow-to-Infmasked as a match).
Background and methodology are documented in a 16-page methodology
paper (Trinity S^3 AI, 2026-06-08, file
paper3-methodology-2026-06-08-v3-trinity.pdf, SHA-256f31f5dd243afc7b2ba4a423859a1e1dc67036c3a93affab30acc8d02f0a15eef)that I plan to upload to arXiv this week.
Async only -- no rush, no specific deadline. If the relevant
maintainer is on vacation or sprint-locked, a one-line "ping us back
in N weeks" is a fine answer.
Thank you for the open release of NVFP4 documentation and for the
maintained NVFP4 reference implementation in TransformerEngine.
-- Dmitrii Vasilev
Trinity S^3 AI
admin@t27.ai
GitHub: @gHashTag