Streaming thoughts are generated continuously as the video plays. When a query arrives, the answer is grounded in accumulated reasoning — delivered instantly.
Online VideoLLMs perceive but don't reason. Offline models reason deeply but break real-time constraints. VST bridges this gap with a thinking-while-watching mechanism inspired by human neural coupling.
Interleaves autoregressive reasoning with real-time video via a dual-memory system: short-term visual buffer + long-term textual semantic memory.
Amortizes reasoning cost during playback. Better accuracy than offline CoT with 0.56s response latency vs. 8.80s.
~16x fasterTwo-stage post-training (VST-SFT + VST-RL) with automated knowledge-graph data synthesis. Scales from 3B to 32B.
VST operates as a multi-round video conversation within a constrained context window:
As clips arrive, VST progressively builds scene understanding through intermediate streaming thoughts. When queried, it leverages accumulated reasoning for a grounded, low-latency response — without replaying the video.
All weights are publicly available. Best per column in bold.
| Model | Weights | OVO-Bench | StreamingBench | VideoMME | LongVideoBench | VideoHolmes |
|---|---|---|---|---|---|---|
| VST-3B | 🤗 Download | 56.2 | 75.5 | 59.5 | 54.1 | 36.1 |
| VST-7B | 🤗 Download | 59.3 | 79.5 | 64.9 | 58.0 | 41.9 |
| VST-32B | 🤗 Download | 63.5 | 80.7 | 67.2 | 60.7 | 45.1 |
If you find VST useful in your research, please consider citing:
@inproceedings{guan2026videostreamingthinking,
title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026},
}