Video Streaming
Thinking.

VideoLLMs that interleave active reasoning with continuous video consumption — enabling real-time understanding with deep, amortized thinking.

Yiran Guan^1,2* Liang Yin^1,2* Dingkang Liang^1,2 Jianzhong Ju² Zhenbo Luo² Jian Luan² Yuliang Liu¹ Xiang Bai¹^✉

HUST, VLR Lab MiLM Plus, Xiaomi

* Equal Contribution ✉ Corresponding Author

🎉 Accepted to ECCV 2026

Paper Code

Models Dataset

Demo

Watch VST Think in Real Time

Streaming thoughts are generated continuously as the video plays. When a query arrives, the answer is grounded in accumulated reasoning — delivered instantly.

Overview

Thinking While Watching

Online VideoLLMs perceive but don't reason. Offline models reason deeply but break real-time constraints. VST bridges this gap with a thinking-while-watching mechanism inspired by human neural coupling.

Streaming Thinking

Interleaves autoregressive reasoning with real-time video via a dual-memory system: short-term visual buffer + long-term textual semantic memory.

Ultra-Low Latency

Amortizes reasoning cost during playback. Better accuracy than offline CoT with 0.56s response latency vs. 8.80s.

~16x faster

Complete Pipeline

Two-stage post-training (VST-SFT + VST-RL) with automated knowledge-graph data synthesis. Scales from 3B to 32B.

Method

Architecture

VST operates as a multi-round video conversation within a constrained context window:

Video clips arrive sequentially from the stream.
The LLM generates a streaming thought conditioned on the current clip and accumulated memory.
A FIFO memory update maintains long-term textual semantic memory while managing context length.
Upon a user query, the model generates the final answer grounded in accumulated memory and visual context.

Case Study

Qualitative Analysis

As clips arrive, VST progressively builds scene understanding through intermediate streaming thoughts. When queried, it leverages accumulated reasoning for a grounded, low-latency response — without replaying the video.

Benchmarks

Model Zoo

All weights are publicly available. Best per column in bold.

Model	Weights	OVO-Bench	StreamingBench	VideoMME	LongVideoBench	VideoHolmes
VST-3B	🤗 Download	56.2	75.5	59.5	54.1	36.1
VST-7B	🤗 Download	59.3	79.5	64.9	58.0	41.9
VST-32B	🤗 Download	63.5	80.7	67.2	60.7	45.1

Downloads

Resources

🤗

VST-3B

Lightweight model for efficient deployment.

~8 GB · Safetensors

🤗

VST-7B

Balanced performance and efficiency.

~16 GB · Safetensors

🤗

VST-32B

Best results across all benchmarks.

~67 GB · Safetensors

Training Data

Full SFT + RL data with preparation scripts.

SFT · RL · Scripts

Citation

BibTeX

If you find VST useful in your research, please consider citing:

@inproceedings{guan2026videostreamingthinking,
    title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
    author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
    booktitle={European Conference on Computer Vision (ECCV)},
    year={2026},
}

Video StreamingThinking.