Graphcore researchers developed MXNorm, a method that reuses scale factors already computed during MXFP8 quantization to perform tensor normalization 32 times more efficiently. Published on arXiv on March 13, 2026, the technique addresses a growing performance bottleneck where normalization operations increasingly dominate compute time as matrix multiplication accelerates through low-precision formats.
Normalization Layers Lag Behind Matrix Multiplication Performance Gains
While specialized accelerators and low-precision formats like FP8, MXFP4, and MXFP8 have dramatically improved matrix multiplication performance, reductions and elementwise computations remain stuck at higher precision. Normalization layers like RMSNorm still perform reductions in FP32 or FP16, creating an efficiency imbalance. As matrix multiplication gets faster, normalization consumes a larger proportion of total compute time—hardware improvements have "far outstripped improvements in performance on reductions."
MXNorm Extracts RMS Information From Existing Quantization Scales
The key insight: MXFP8 (Microscaling FP8) format already calculates per-block scale factors as part of quantization, and these scales contain magnitude distribution information essentially identical to what's needed for RMS (Root Mean Square) estimation. MXNorm extracts these already-computed block scales, uses them to approximate RMS values, applies the estimated RMS for normalization, and achieves a 32× reduction in the size of the reduction operation—all without additional computation.
Validation on Llama 3 Shows Minimal Accuracy Loss With Measurable Speedups
The method was validated on Llama 3 model pre-training at three scales: 125 million, 1 billion, and 8 billion parameters. Compared to baseline RMSNorm with MXFP8 matrix multiplications, results showed minimal loss of training accuracy.
Performance improvements included:
- Up to 2.4× kernel-level speedup for MXNorm over RMSNorm using only torch.compile
- 1.3% speedup in Llama 3 8B transformer layers when using MXFP8
- 2.6% speedup when using NVFP4 (NVIDIA's 4-bit format)
The "using only torch.compile" achievement is significant—these speedups don't require custom CUDA kernels or specialized hardware implementations, making adoption straightforward through PyTorch's built-in compilation.
Small Percentage Gains Compound Across Training at Scale
While 1.3-2.6% improvements may appear modest, they become substantial in context. Transformer layers are heavily optimized, making any improvement significant. These gains compound across all normalization layers in the network. When models scale and training runs extend to millions of GPU-hours, small percentage improvements translate to major resource savings. The performance is essentially "free" from reusing already-computed values.
The 32× reduction in reduction size provides benefits beyond speed:
- Lower memory bandwidth requirements
- Reduced power consumption
- Better scaling to larger batch sizes
- Improved efficiency on memory-bandwidth-limited hardware
The approach functions as a drop-in replacement for RMSNorm, requiring no architectural changes, allowing existing trained models to potentially adopt it, needing minimal training pipeline modification, and enabling easy experimentation. Testing specifically on Llama 3 across three scales (125M, 1B, 8B) demonstrates the approach works across different model sizes with widely-used, extensively benchmarked models.
Key Takeaways
- MXNorm achieves 32× reduction in normalization operation size by reusing block scales already computed during MXFP8 quantization
- Validation on Llama 3 models at 125M, 1B, and 8B parameters showed minimal training accuracy loss compared to standard RMSNorm
- Performance improvements include up to 2.4× kernel-level speedup and 1.3-2.6% speedups in transformer layers using only PyTorch's torch.compile
- The method works as a drop-in replacement for RMSNorm without requiring custom CUDA kernels or architectural changes
- Lower memory bandwidth requirements and reduced power consumption compound benefits across large-scale training runs spanning millions of GPU-hours