TriAttention: KV Cache Compression Enables 32B Models on Consumer GPUs

Researchers released TriAttention, a novel KV cache compression method that enables 32B parameter models to run on single consumer GPUs while maintaining full reasoning accuracy. The paper, published on arXiv on April 6, 2026, addresses a critical bottleneck: extended reasoning in large language models creates severe KV cache memory constraints that prevent deployment on consumer hardware.

Technical Innovation in Pre-RoPE Space

TriAttention works in pre-RoPE space where query and key vectors are highly concentrated around fixed non-zero centers. The method uses trigonometric series to estimate key importance based on positional distance preferences, leveraging Q/K norms as additional importance signals. This approach achieves bump-aware guidance that mitigates collisions without requiring fine-grained scene geometry.

Performance Benchmarks

On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving either 2.5x higher throughput or 10.7x KV memory reduction. Leading baseline methods achieve only about half the accuracy at the same efficiency level. The breakthrough enables OpenClaw, a 32B model, to deploy on a single 24GB RTX 4090 GPU where Full Attention would cause out-of-memory errors.

Open Source Release and Deployment

The team open-sourced the complete codebase with vLLM-ready integration for one-click deployment. The method is orthogonal to other compression techniques like TurboQuant, which quantizes bit precision. TriAttention prunes the sequence dimension while TurboQuant compresses the value dimension, allowing both methods to combine for even greater efficiency gains.

The paper is authored by Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Community response highlighted the KV cache reduction as addressing the main bottleneck for running larger models locally, beyond raw compute constraints.

Key Takeaways

TriAttention achieves 2.5x higher throughput or 10.7x KV memory reduction while matching Full Attention accuracy on AIME25
The method enables 32B parameter models like OpenClaw to run on single consumer GPUs with 24GB VRAM
Leading baseline methods achieve only about half the accuracy at the same efficiency level
The technique is orthogonal to quantization methods and can be combined for greater efficiency
Full code is open-sourced with vLLM-ready integration for one-click deployment

Technical Innovation in Pre-RoPE Space

Performance Benchmarks

Open Source Release and Deployment

Key Takeaways

TriAttention achieves 2.5x higher throughput or 10.7x KV memory reduction while matching Full Attention accuracy on AIME25

The method enables 32B parameter models like OpenClaw to run on single consumer GPUs with 24GB VRAM

Leading baseline methods achieve only about half the accuracy at the same efficiency level

The technique is orthogonal to quantization methods and can be combined for greater efficiency

Full code is open-sourced with vLLM-ready integration for one-click deployment