Hello team,
I recently came across the SemiAnalysis article “Vera Rubin: Extreme Co-Design as an Evolution” (https://newsletter.semianalysis.com/p/vera-rubin-extreme-co-design-an-evolution) where Adaptive Compression for transformer workloads was discussed. The article mentions significant speedups (50 PFLOPS vs 35 FLOPS), but I could not find detailed information on how this is implemented in the Transformer Engine.
Now that GTC 2026 has concluded, I wanted to ask for clarification on the following:
- Could you provide more details on the implementation of Adaptive Compression in Transformer Engine?
- Specifically, how is sparsity identified and exploited dynamically?
- Are there any public code examples, demos, or documentation illustrating this feature?
Any guidance or pointers would be greatly appreciated, as I am interested in evaluating and experimenting with this feature for transformer model acceleration.
Thank you for your time and support.
Best regards,
Guanchen
Hello team,
I recently came across the SemiAnalysis article “Vera Rubin: Extreme Co-Design as an Evolution” (https://newsletter.semianalysis.com/p/vera-rubin-extreme-co-design-an-evolution) where Adaptive Compression for transformer workloads was discussed. The article mentions significant speedups (50 PFLOPS vs 35 FLOPS), but I could not find detailed information on how this is implemented in the Transformer Engine.
Now that GTC 2026 has concluded, I wanted to ask for clarification on the following:
Any guidance or pointers would be greatly appreciated, as I am interested in evaluating and experimenting with this feature for transformer model acceleration.
Thank you for your time and support.
Best regards,
Guanchen