Video Streaming Thinking

VideoLLMs Can Watch and Think Simultaneously

Yiran Guan^1,2* Liang Yin^1,2* Dingkang Liang^1,2 Jianzhong Ju² Zhenbo Luo² Jian Luan² Yuliang Liu¹ Xiang Bai¹✉

1 Huazhong University of Science and Technology, VLR Lab

2 MiLM Plus, Xiaomi Inc.

* Equal Contribution ✉ Corresponding Author

Video Streaming Thinking introduces a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.

Introduction

Existing online VideoLLMs focus on efficient streaming perception but lack explicit analytical reasoning. Offline VideoLLMs with Chain-of-Thought (CoT) can reason deeply, but incur high query-answer (QA) latency that violates real-time constraints. VST bridges this gap by shifting the LLM backend from passive waiting to active, intermittent reasoning during video consumption, implementing a thinking-while-watching mechanism inspired by human neural coupling.

VST achieves three core breakthroughs:

Streaming Thinking Paradigm: Interleaves autoregressive textual reasoning with real-time video consumption, maintaining a dual-memory system (short-term visual buffer + long-term textual semantic memory). Instead of deferring all reasoning until a user query arrives, VST continuously processes incoming video clips and produces intermediate streaming thoughts in real time.

Low QA Latency: By front-loading and amortizing the reasoning cost during video consumption, VST delivers better accuracy than offline CoT methods (e.g., Video-R1) with ~16× lower response latency (0.56s vs. 8.80s), making the final response both deeply grounded and instantly available.

Complete Training Pipeline: A two-stage post-training recipe combining VST-SFT and VST-RL, with an automated knowledge-graph-based data synthesis pipeline. Consistent improvements across 3B, 7B, and 32B model scales demonstrate the generalizability of the approach.

Through a simple yet effective design, VST demonstrates that VideoLLMs can watch and think simultaneously — achieving state-of-the-art results on both online benchmarks (StreamingBench, OVO-Bench) and offline benchmarks (VideoMME, LongVideoBench, VideoHolmes).

Method

VST operates as a multi-round video conversation within a constrained context window:

Video clips arrive sequentially from the stream.
At each interval, the LLM generates a streaming thought conditioned on the current clip and accumulated memory.
A first-in-first-out memory update maintains long-term textual semantic memory.
Upon receiving a user query, the model generates the final answer grounded in both the accumulated memory and the latest visual context.

Model	OVO-Bench	StreamingBench	VideoMME	LongVideoBench	VideoHolmes
VST-3B	56.2	75.5	59.5	54.1	36.1
VST-7B	59.3	79.5	64.9	58.0	41.9
VST-32B	63.5	80.7	67.2	60.7	45.1

Citation

If you use Video Streaming Thinking in your research, please cite our paper:

@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously}, 
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}

Video Streaming Thinking

Introduction

Method

Case Study

Model Zoo

Citation