thinkomni

ThinkOmni is a training-free framework that enhances omni-modal LLMs (OLLMs) with the reasoning ability of large reasoning models (LRMs) via guidance decoding.
Instead of additional finetuning, ThinkOmni integrates an off-the-shelf LRM at decoding time and adaptively balances perception vs. reasoning signals for robust multi-modal reasoning.

arXiv: https://arxiv.org/abs/2602.23306

Highlights

Training-free omni-modal reasoning boost: no SFT/RFT required.
LRM-as-a-Guide: uses an off-the-shelf reasoning LLM to guide OLLM decoding.
Stepwise Contrastive Scaling (SCS): automatically adjusts guidance strength step-by-step.

Installation

Set up environment

# clone
git clone https://github.com/1ranGuan/thinkomni.git
cd thinkomni

# create environment
conda create -n thinkomni python=3.10 -y
conda activate thinkomni

# install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Prepare data

The scripts use HuggingFace datasets (for example: MathVista / MathVerse / MathVision).
Some datasets are expected in local folders. You can download them from https://huggingface.co/datasets/Catalan258/thinkomni_eval
- ./dataset/MMAU
- ./dataset/OmniBench
- ./dataset/Daily-Omni

Quick Start

Inference

bash scripts/inference.sh

Evaluation

Set your API URL / key in scripts/eval.sh, then run:

bash scripts/eval.sh

Citation

If you find this work useful, please cite:

```bibtex @inproceedings{guan2026thinkomni, title={ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding}, author={Guan, Yiran and Tu, Sifan and Liang, Dingkang and Zhu, Linghao and Ju, Jianzhong and Luo, Zhenbo and Luan, Jian and Liu, Yuliang and Bai, Xiang}, booktitle={International Conference on Learning Representations (ICLR)}, year={2026} }

This site is open source. Improve this page.