Conversation
Reviewed By: JuanBesa Differential Revision: D91299742
JiwaniZakir
left a comment
There was a problem hiding this comment.
The renaming of i_b → _i_b and o_b → _o_b across the Triton kernels (_fused_kernel_quantize_into_fp8, _fused_kernel_dequantize_from_fp8, and _fused_kernel_reduce_fp8) correctly signals that the loop counter itself is never referenced inside the loop body — the actual iteration is driven by col_offsets being updated implicitly. However, it's worth double-checking that col_offsets is actually advancing each iteration; scanning the loop bodies, col_offsets appears to be re-initialized to tl.arange(0, BLOCK_SIZE) before each loop but not incremented within the loop, which means every iteration loads and processes the same block. If this is intentional (e.g., col_offsets advances via a side-effect not visible in this diff), that should be made explicit with a comment. If it's a pre-existing bug, renaming to _i_b makes the loop counter look intentionally unused while the real issue (missing col_offsets += BLOCK_SIZE) goes unaddressed. This warrants a closer look before merging.
Reviewed By: JuanBesa
Differential Revision: D91299742