Video Streaming Thinking

VideoLLMs Can Watch and Think Simultaneously

1 Huazhong University of Science and Technology, VLR Lab

2 MiLM Plus, Xiaomi Inc.

* Equal Contribution    ✉ Corresponding Author

Video Streaming Thinking introduces a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.
VST Overview

Introduction

Existing online VideoLLMs focus on efficient streaming perception but lack explicit analytical reasoning. Offline VideoLLMs with Chain-of-Thought (CoT) can reason deeply, but incur high query-answer (QA) latency that violates real-time constraints. VST bridges this gap by shifting the LLM backend from passive waiting to active, intermittent reasoning during video consumption, implementing a thinking-while-watching mechanism inspired by human neural coupling.

VST achieves three core breakthroughs:

1
Streaming Thinking Paradigm: Interleaves autoregressive textual reasoning with real-time video consumption, maintaining a dual-memory system (short-term visual buffer + long-term textual semantic memory). Instead of deferring all reasoning until a user query arrives, VST continuously processes incoming video clips and produces intermediate streaming thoughts in real time.
2
Low QA Latency: By front-loading and amortizing the reasoning cost during video consumption, VST delivers better accuracy than offline CoT methods (e.g., Video-R1) with ~16× lower response latency (0.56s vs. 8.80s), making the final response both deeply grounded and instantly available.
3
Complete Training Pipeline: A two-stage post-training recipe combining VST-SFT and VST-RL, with an automated knowledge-graph-based data synthesis pipeline. Consistent improvements across 3B, 7B, and 32B model scales demonstrate the generalizability of the approach.

Through a simple yet effective design, VST demonstrates that VideoLLMs can watch and think simultaneously — achieving state-of-the-art results on both online benchmarks (StreamingBench, OVO-Bench) and offline benchmarks (VideoMME, LongVideoBench, VideoHolmes).

Method

VST operates as a multi-round video conversation within a constrained context window:

  1. Video clips arrive sequentially from the stream.
  2. At each interval, the LLM generates a streaming thought conditioned on the current clip and accumulated memory.
  3. A first-in-first-out memory update maintains long-term textual semantic memory.
  4. Upon receiving a user query, the model generates the final answer grounded in both the accumulated memory and the latest visual context.
VST Pipeline

Case Study

The following case study demonstrates how VST's streaming thinking paradigm handles complex video understanding tasks:

VST Case Study

Model Zoo

Performance across multiple benchmarks and model sizes:

Model OVO-Bench StreamingBench VideoMME LongVideoBench VideoHolmes
VST-3B 56.2 75.5 59.5 54.1 36.1
VST-7B 59.3 79.5 64.9 58.0 41.9
VST-32B 63.5 80.7 67.2 60.7 45.1

Citation

If you use Video Streaming Thinking in your research, please cite our paper:

@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously}, 
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}