Researchers introduced Video Streaming Thinking (VST), a paradigm enabling VideoLLMs to perform logical reasoning while processing streaming video in real-time. Published March 12, 2026, on arXiv, the approach achieves 79.5% on StreamingBench while responding 15.7× faster than existing methods.
Framework Enables Thinking While Watching for Real-Time Video Understanding
Existing online Video Large Language Models focus on streaming perception but lack synchronized logical reasoning. Applying test-time scaling methods directly causes unacceptable response latency, creating a trade-off between reasoning quality and real-time responsiveness.
VST solves this by activating reasoning over incoming video clips during streaming, amortizing LLM reasoning latency over video playback time. This improves timely comprehension and coherent cognition while preserving real-time responsiveness.
Two-Stage Pipeline Uses Robust Visual Features and Modality Fusion
The technical approach employs a two-stage pipeline:
Stage I: Robust Visual Features
- Pretrained DINOv2 ViT-L/14 backbone for visual encoding
- Padding-aware augmentation (PadAug) preprocessing strategy
- Mixture-of-experts (MoE) training head for enhanced classifier diversity
Stage II: Modality Fusion and Temporal Consistency
- Visual modality: Multi-scale face re-cropping with averaged features for robust frame-level representation
- Audio modality: Frame-aligned Wav2Vec 2.0 features from short audio windows
- Lightweight gated fusion module integrating dual-modal features
- Inference-time temporal smoothing for consistency
Training Innovation Uses Video Knowledge Graphs for Data Synthesis
VST-SFT structurally adapts offline VideoLLMs to causal streaming reasoning. VST-RL provides end-to-end improvement through self-exploration in multi-turn video interaction environments.
An automated data synthesis pipeline uses video knowledge graphs to generate high-quality streaming QA pairs with entity-relation grounded streaming Chain-of-Thought, enforcing multi-evidence reasoning.
VST-7B Achieves Strong Performance With 15.7× Speed Improvement
VST-7B performance results:
- 79.5% on StreamingBench (official benchmark)
- 59.3% on OVO-Bench
- Competitive offline benchmark performance (Video-Holmes: +5.4% vs Video-R1)
- 15.7× faster response time compared to Video-R1
This represents a paradigm shift from perceive-then-reason to simultaneous perception and reasoning in streaming video. Authors Yiran Guan, Liang Yin, Dingkang Liang and colleagues are releasing code and models at github.com/1ranGuan/VST.
Key Takeaways
- Video Streaming Thinking enables VideoLLMs to perform logical reasoning while processing streaming video in real-time by amortizing reasoning latency over playback time
- VST-7B achieves 79.5% on StreamingBench and 59.3% on OVO-Bench while responding 15.7× faster than Video-R1
- Two-stage pipeline combines DINOv2 visual encoding, Wav2Vec 2.0 audio features, and gated fusion with temporal smoothing
- Automated data synthesis uses video knowledge graphs to generate streaming QA pairs with entity-relation grounded Chain-of-Thought reasoning
- Paradigm shift from perceive-then-reason to simultaneous perception and reasoning enables real-time video understanding applications