Video Streaming Thinking: VideoLLMs Watch and Reason Simultaneously

Researchers introduced Video Streaming Thinking (VST), a paradigm enabling VideoLLMs to perform logical reasoning while processing streaming video in real-time. Published March 12, 2026, on arXiv, the approach achieves 79.5% on StreamingBench while responding 15.7× faster than existing methods.

Framework Enables Thinking While Watching for Real-Time Video Understanding

Existing online Video Large Language Models focus on streaming perception but lack synchronized logical reasoning. Applying test-time scaling methods directly causes unacceptable response latency, creating a trade-off between reasoning quality and real-time responsiveness.

VST solves this by activating reasoning over incoming video clips during streaming, amortizing LLM reasoning latency over video playback time. This improves timely comprehension and coherent cognition while preserving real-time responsiveness.

Two-Stage Pipeline Uses Robust Visual Features and Modality Fusion

The technical approach employs a two-stage pipeline:

Stage I: Robust Visual Features

Pretrained DINOv2 ViT-L/14 backbone for visual encoding
Padding-aware augmentation (PadAug) preprocessing strategy
Mixture-of-experts (MoE) training head for enhanced classifier diversity

Stage II: Modality Fusion and Temporal Consistency

Visual modality: Multi-scale face re-cropping with averaged features for robust frame-level representation
Audio modality: Frame-aligned Wav2Vec 2.0 features from short audio windows
Lightweight gated fusion module integrating dual-modal features
Inference-time temporal smoothing for consistency

Training Innovation Uses Video Knowledge Graphs for Data Synthesis

VST-SFT structurally adapts offline VideoLLMs to causal streaming reasoning. VST-RL provides end-to-end improvement through self-exploration in multi-turn video interaction environments.

An automated data synthesis pipeline uses video knowledge graphs to generate high-quality streaming QA pairs with entity-relation grounded streaming Chain-of-Thought, enforcing multi-evidence reasoning.

VST-7B Achieves Strong Performance With 15.7× Speed Improvement

VST-7B performance results:

79.5% on StreamingBench (official benchmark)
59.3% on OVO-Bench
Competitive offline benchmark performance (Video-Holmes: +5.4% vs Video-R1)
15.7× faster response time compared to Video-R1

This represents a paradigm shift from perceive-then-reason to simultaneous perception and reasoning in streaming video. Authors Yiran Guan, Liang Yin, Dingkang Liang and colleagues are releasing code and models at github.com/1ranGuan/VST.

Key Takeaways

Video Streaming Thinking enables VideoLLMs to perform logical reasoning while processing streaming video in real-time by amortizing reasoning latency over playback time
VST-7B achieves 79.5% on StreamingBench and 59.3% on OVO-Bench while responding 15.7× faster than Video-R1
Two-stage pipeline combines DINOv2 visual encoding, Wav2Vec 2.0 audio features, and gated fusion with temporal smoothing
Automated data synthesis uses video knowledge graphs to generate streaming QA pairs with entity-relation grounded Chain-of-Thought reasoning
Paradigm shift from perceive-then-reason to simultaneous perception and reasoning enables real-time video understanding applications

Framework Enables Thinking While Watching for Real-Time Video Understanding

Two-Stage Pipeline Uses Robust Visual Features and Modality Fusion

The technical approach employs a two-stage pipeline:

Stage I: Robust Visual Features

Pretrained DINOv2 ViT-L/14 backbone for visual encoding

Padding-aware augmentation (PadAug) preprocessing strategy

Mixture-of-experts (MoE) training head for enhanced classifier diversity

Stage II: Modality Fusion and Temporal Consistency

Visual modality: Multi-scale face re-cropping with averaged features for robust frame-level representation

Audio modality: Frame-aligned Wav2Vec 2.0 features from short audio windows

Lightweight gated fusion module integrating dual-modal features

Inference-time temporal smoothing for consistency

Training Innovation Uses Video Knowledge Graphs for Data Synthesis

VST-SFT structurally adapts offline VideoLLMs to causal streaming reasoning. VST-RL provides end-to-end improvement through self-exploration in multi-turn video interaction environments.

VST-7B Achieves Strong Performance With 15.7× Speed Improvement

VST-7B performance results:

79.5% on StreamingBench (official benchmark)

59.3% on OVO-Bench

Competitive offline benchmark performance (Video-Holmes: +5.4% vs Video-R1)

15.7× faster response time compared to Video-R1

Key Takeaways

Video Streaming Thinking enables VideoLLMs to perform logical reasoning while processing streaming video in real-time by amortizing reasoning latency over playback time

VST-7B achieves 79.5% on StreamingBench and 59.3% on OVO-Bench while responding 15.7× faster than Video-R1

Two-stage pipeline combines DINOv2 visual encoding, Wav2Vec 2.0 audio features, and gated fusion with temporal smoothing

Automated data synthesis uses video knowledge graphs to generate streaming QA pairs with entity-relation grounded Chain-of-Thought reasoning

Paradigm shift from perceive-then-reason to simultaneous perception and reasoning enables real-time video understanding applications