Video Streaming
Thinking.

VideoLLMs that interleave active reasoning with continuous video consumption — enabling real-time understanding with deep, amortized thinking.

HUST, VLR Lab MiLM Plus, Xiaomi
* Equal Contribution  ✉ Corresponding Author
🎉 Accepted to ECCV 2026
VST Overview
Demo

Watch VST Think in Real Time

Streaming thoughts are generated continuously as the video plays. When a query arrives, the answer is grounded in accumulated reasoning — delivered instantly.

Click to play demo

Thinking While Watching

Online VideoLLMs perceive but don't reason. Offline models reason deeply but break real-time constraints. VST bridges this gap with a thinking-while-watching mechanism inspired by human neural coupling.

1

Streaming Thinking

Interleaves autoregressive reasoning with real-time video via a dual-memory system: short-term visual buffer + long-term textual semantic memory.

2

Ultra-Low Latency

Amortizes reasoning cost during playback. Better accuracy than offline CoT with 0.56s response latency vs. 8.80s.

~16x faster
3

Complete Pipeline

Two-stage post-training (VST-SFT + VST-RL) with automated knowledge-graph data synthesis. Scales from 3B to 32B.

Method

Architecture

VST operates as a multi-round video conversation within a constrained context window:

  • Video clips arrive sequentially from the stream.
  • The LLM generates a streaming thought conditioned on the current clip and accumulated memory.
  • A FIFO memory update maintains long-term textual semantic memory while managing context length.
  • Upon a user query, the model generates the final answer grounded in accumulated memory and visual context.
VST Pipeline

Qualitative Analysis

As clips arrive, VST progressively builds scene understanding through intermediate streaming thoughts. When queried, it leverages accumulated reasoning for a grounded, low-latency response — without replaying the video.

VST Case Study
Benchmarks

Model Zoo

All weights are publicly available. Best per column in bold.

ModelWeightsOVO-BenchStreamingBenchVideoMMELongVideoBenchVideoHolmes
VST-3B 🤗 Download 56.275.559.554.136.1
VST-7B 🤗 Download 59.379.564.958.041.9
VST-32B 🤗 Download 63.580.767.260.745.1

Resources

Citation

BibTeX

If you find VST useful in your research, please consider citing:

@inproceedings{guan2026videostreamingthinking,
    title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
    author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
    booktitle={European Conference on Computer Vision (ECCV)},
    year={2026},
}