MXNorm Achieves 32× Faster Tensor Normalization by Reusing MXFP8 Block Scales

Graphcore researchers developed MXNorm, a method that reuses scale factors already computed during MXFP8 quantization to perform tensor normalization 32 times more efficiently. Published on arXiv on March 13, 2026, the technique addresses a growing performance bottleneck where normalization operations increasingly dominate compute time as matrix multiplication accelerates through low-precision formats.

Normalization Layers Lag Behind Matrix Multiplication Performance Gains

While specialized accelerators and low-precision formats like FP8, MXFP4, and MXFP8 have dramatically improved matrix multiplication performance, reductions and elementwise computations remain stuck at higher precision. Normalization layers like RMSNorm still perform reductions in FP32 or FP16, creating an efficiency imbalance. As matrix multiplication gets faster, normalization consumes a larger proportion of total compute time—hardware improvements have "far outstripped improvements in performance on reductions."

MXNorm Extracts RMS Information From Existing Quantization Scales

The key insight: MXFP8 (Microscaling FP8) format already calculates per-block scale factors as part of quantization, and these scales contain magnitude distribution information essentially identical to what's needed for RMS (Root Mean Square) estimation. MXNorm extracts these already-computed block scales, uses them to approximate RMS values, applies the estimated RMS for normalization, and achieves a 32× reduction in the size of the reduction operation—all without additional computation.

Validation on Llama 3 Shows Minimal Accuracy Loss With Measurable Speedups

The method was validated on Llama 3 model pre-training at three scales: 125 million, 1 billion, and 8 billion parameters. Compared to baseline RMSNorm with MXFP8 matrix multiplications, results showed minimal loss of training accuracy.

Performance improvements included:

Up to 2.4× kernel-level speedup for MXNorm over RMSNorm using only torch.compile
1.3% speedup in Llama 3 8B transformer layers when using MXFP8
2.6% speedup when using NVFP4 (NVIDIA's 4-bit format)

The "using only torch.compile" achievement is significant—these speedups don't require custom CUDA kernels or specialized hardware implementations, making adoption straightforward through PyTorch's built-in compilation.

Small Percentage Gains Compound Across Training at Scale

While 1.3-2.6% improvements may appear modest, they become substantial in context. Transformer layers are heavily optimized, making any improvement significant. These gains compound across all normalization layers in the network. When models scale and training runs extend to millions of GPU-hours, small percentage improvements translate to major resource savings. The performance is essentially "free" from reusing already-computed values.

The 32× reduction in reduction size provides benefits beyond speed:

Lower memory bandwidth requirements
Reduced power consumption
Better scaling to larger batch sizes
Improved efficiency on memory-bandwidth-limited hardware

The approach functions as a drop-in replacement for RMSNorm, requiring no architectural changes, allowing existing trained models to potentially adopt it, needing minimal training pipeline modification, and enabling easy experimentation. Testing specifically on Llama 3 across three scales (125M, 1B, 8B) demonstrates the approach works across different model sizes with widely-used, extensively benchmarked models.

Key Takeaways

MXNorm achieves 32× reduction in normalization operation size by reusing block scales already computed during MXFP8 quantization
Validation on Llama 3 models at 125M, 1B, and 8B parameters showed minimal training accuracy loss compared to standard RMSNorm
Performance improvements include up to 2.4× kernel-level speedup and 1.3-2.6% speedups in transformer layers using only PyTorch's torch.compile
The method works as a drop-in replacement for RMSNorm without requiring custom CUDA kernels or architectural changes
Lower memory bandwidth requirements and reduced power consumption compound benefits across large-scale training runs spanning millions of GPU-hours

Normalization Layers Lag Behind Matrix Multiplication Performance Gains

MXNorm Extracts RMS Information From Existing Quantization Scales

Validation on Llama 3 Shows Minimal Accuracy Loss With Measurable Speedups

Performance improvements included:

Up to 2.4× kernel-level speedup for MXNorm over RMSNorm using only torch.compile

1.3% speedup in Llama 3 8B transformer layers when using MXFP8

2.6% speedup when using NVFP4 (NVIDIA's 4-bit format)

Small Percentage Gains Compound Across Training at Scale

The 32× reduction in reduction size provides benefits beyond speed:

Lower memory bandwidth requirements

Reduced power consumption

Better scaling to larger batch sizes

Improved efficiency on memory-bandwidth-limited hardware

Key Takeaways

MXNorm achieves 32× reduction in normalization operation size by reusing block scales already computed during MXFP8 quantization

Validation on Llama 3 models at 125M, 1B, and 8B parameters showed minimal training accuracy loss compared to standard RMSNorm

Performance improvements include up to 2.4× kernel-level speedup and 1.3-2.6% speedups in transformer layers using only PyTorch's torch.compile

The method works as a drop-in replacement for RMSNorm without requiring custom CUDA kernels or architectural changes

Lower memory bandwidth requirements and reduced power consumption compound benefits across large-scale training runs spanning millions of GPU-hours