VideoLLMs Can Watch and Think Simultaneously
1 Huazhong University of Science and Technology, VLR Lab
2 MiLM Plus, Xiaomi Inc.
* Equal Contribution ✉ Corresponding Author
Existing online VideoLLMs focus on efficient streaming perception but lack explicit analytical reasoning. Offline VideoLLMs with Chain-of-Thought (CoT) can reason deeply, but incur high query-answer (QA) latency that violates real-time constraints. VST bridges this gap by shifting the LLM backend from passive waiting to active, intermittent reasoning during video consumption, implementing a thinking-while-watching mechanism inspired by human neural coupling.
VST achieves three core breakthroughs:
Through a simple yet effective design, VST demonstrates that VideoLLMs can watch and think simultaneously — achieving state-of-the-art results on both online benchmarks (StreamingBench, OVO-Bench) and offline benchmarks (VideoMME, LongVideoBench, VideoHolmes).
VST operates as a multi-round video conversation within a constrained context window:
The following case study demonstrates how VST's streaming thinking paradigm handles complex video understanding tasks:
Performance across multiple benchmarks and model sizes:
| Model | OVO-Bench | StreamingBench | VideoMME | LongVideoBench | VideoHolmes |
|---|---|---|---|---|---|
| VST-3B | 56.2 | 75.5 | 59.5 | 54.1 | 36.1 |
| VST-7B | 59.3 | 79.5 | 64.9 | 58.0 | 41.9 |
| VST-32B | 63.5 | 80.7 | 67.2 | 60.7 | 45.1 |
If you use Video Streaming Thinking in your research, please cite our paper:
@article{guan2026videostreamingthinking,
title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
journal={arXiv preprint arXiv:2603.12262},
year={2026},
}