Researchers from UC Berkeley, NVIDIA, MIT, and other institutions have introduced AutoGaze, a lightweight module that reduces visual tokens by 4x-100x while enabling multimodal large language models to process 1,000-frame 4K-resolution videos. Published March 12, 2026 on arXiv, the paper "Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing" addresses a critical bottleneck in video AI: the computational cost of processing every pixel equally despite massive spatiotemporal redundancy.
AutoGaze Removes Redundant Patches Before Processing
AutoGaze operates by eliminating redundant patches before they reach the vision transformer or MLLM, rather than processing all pixels uniformly. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold. This "attend before attention" approach treats video understanding as an autoregressive gazing problem, where the model actively decides which spatial regions to attend to across frames.
Performance Benchmarks Show 19x Speed Improvement
The system achieves substantial improvements across multiple metrics:
- Token reduction of 4x-100x fewer visual tokens
- Up to 19x acceleration for vision transformers and MLLMs
- 67.0% accuracy on the VideoMME benchmark
- 4x improvement in ViT latency in addition to LLM latency gains, compared to baseline token reduction methods that only reduce LLM latency
HLVid Benchmark Tests High-Resolution Long-Form Video Understanding
The research team introduced HLVid, the first high-resolution, long-form video question-answering benchmark featuring 5-minute 4K-resolution videos. An MLLM scaled with AutoGaze improved over the baseline by 10.1% and outperformed the previous best MLLM by 4.5% on this benchmark, demonstrating the system's effectiveness on challenging real-world video content.
Novel Approach Extracts Intent Rather Than Processing All Pixels
The key innovation lies in AutoGaze's ability to identify which spatial regions matter for understanding video content, rather than treating all pixels as equally important. By learning to focus attention on informative patches through reinforcement learning, AutoGaze makes previously impractical video understanding tasks feasible on consumer hardware. The project page provides additional implementation details.
Key Takeaways
- AutoGaze reduces visual tokens by 4x-100x while maintaining accuracy, achieving up to 19x speed improvements for video processing
- The system enables MLLMs to process 1,000-frame 4K-resolution videos, previously impractical due to computational constraints
- HLVid benchmark introduces the first high-resolution, long-form video QA test with 5-minute 4K videos
- AutoGaze improves both ViT latency (4x) and LLM latency, unlike baseline methods that only reduce LLM latency
- The approach treats video understanding as autoregressive gazing, actively selecting which spatial regions to process rather than uniformly processing all pixels