AutoGaze Achieves 100x Token Reduction for 1K-Frame 4K Video Understanding

Researchers from UC Berkeley, NVIDIA, MIT, and other institutions have introduced AutoGaze, a lightweight module that reduces visual tokens by 4x-100x while enabling multimodal large language models to process 1,000-frame 4K-resolution videos. Published March 12, 2026 on arXiv, the paper "Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing" addresses a critical bottleneck in video AI: the computational cost of processing every pixel equally despite massive spatiotemporal redundancy.

AutoGaze Removes Redundant Patches Before Processing

AutoGaze operates by eliminating redundant patches before they reach the vision transformer or MLLM, rather than processing all pixels uniformly. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold. This "attend before attention" approach treats video understanding as an autoregressive gazing problem, where the model actively decides which spatial regions to attend to across frames.

Performance Benchmarks Show 19x Speed Improvement

The system achieves substantial improvements across multiple metrics:

Token reduction of 4x-100x fewer visual tokens
Up to 19x acceleration for vision transformers and MLLMs
67.0% accuracy on the VideoMME benchmark
4x improvement in ViT latency in addition to LLM latency gains, compared to baseline token reduction methods that only reduce LLM latency

HLVid Benchmark Tests High-Resolution Long-Form Video Understanding

The research team introduced HLVid, the first high-resolution, long-form video question-answering benchmark featuring 5-minute 4K-resolution videos. An MLLM scaled with AutoGaze improved over the baseline by 10.1% and outperformed the previous best MLLM by 4.5% on this benchmark, demonstrating the system's effectiveness on challenging real-world video content.

Novel Approach Extracts Intent Rather Than Processing All Pixels

The key innovation lies in AutoGaze's ability to identify which spatial regions matter for understanding video content, rather than treating all pixels as equally important. By learning to focus attention on informative patches through reinforcement learning, AutoGaze makes previously impractical video understanding tasks feasible on consumer hardware. The project page provides additional implementation details.

Key Takeaways

AutoGaze reduces visual tokens by 4x-100x while maintaining accuracy, achieving up to 19x speed improvements for video processing
The system enables MLLMs to process 1,000-frame 4K-resolution videos, previously impractical due to computational constraints
HLVid benchmark introduces the first high-resolution, long-form video QA test with 5-minute 4K videos
AutoGaze improves both ViT latency (4x) and LLM latency, unlike baseline methods that only reduce LLM latency
The approach treats video understanding as autoregressive gazing, actively selecting which spatial regions to process rather than uniformly processing all pixels

AutoGaze Removes Redundant Patches Before Processing

Performance Benchmarks Show 19x Speed Improvement

The system achieves substantial improvements across multiple metrics:

Token reduction of 4x-100x fewer visual tokens

Up to 19x acceleration for vision transformers and MLLMs

67.0% accuracy on the VideoMME benchmark

4x improvement in ViT latency in addition to LLM latency gains, compared to baseline token reduction methods that only reduce LLM latency

HLVid Benchmark Tests High-Resolution Long-Form Video Understanding

Novel Approach Extracts Intent Rather Than Processing All Pixels

Key Takeaways

AutoGaze reduces visual tokens by 4x-100x while maintaining accuracy, achieving up to 19x speed improvements for video processing

The system enables MLLMs to process 1,000-frame 4K-resolution videos, previously impractical due to computational constraints

HLVid benchmark introduces the first high-resolution, long-form video QA test with 5-minute 4K videos

AutoGaze improves both ViT latency (4x) and LLM latency, unlike baseline methods that only reduce LLM latency

The approach treats video understanding as autoregressive gazing, actively selecting which spatial regions to process rather than uniformly processing all pixels