OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement

**Yihe Deng$^\dagger$, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang**

$\dagger$: Project lead

Date: 03/20/2025

Table of Content

<aside> 🗒️

Summary

The recent DeepSeek-R1 tech report demonstrated how reinforcement learning (RL) effectively enables complex reasoning behaviors in LLMs, such as self-verification and self-correction, encapsulated by special tokens <think> and </think>. These reasoning abilities significantly improved LLM performance on challenging textual QA benchmarks like MATH and AIME.

Motivated by these results, our exploration aimed to investigate: (1) if similar reasoning capabilities can be introduced into vision-language LLMs, and (2) whether these reasoning enhancements improve model performance on challenging VQA tasks. Specifically:

We consider SFT for reasoning structure learning and RL for exploration.
- Our initial reasoning-capable model was created by distilling from pure-text R1 models, which generated reasoning to VQA tasks via provided image descriptions.
- Instead of applying RL directly to an extensive search space (cold-start), we use SFT as a warm-start, guiding the RL search space.
- We consider RL as the key to enhance reasoning capability and generalization.
Iteratively, with each RL-improved model serving as the basis for generating refined SFT datasets for subsequent iterations, we successfully developed a vision-language LLM that exhibits enhanced reasoning capabilities and improved performance on the MathVista , MathVerse, MathVision. </aside>
Our OpenVLThinker-7B example output:

Performance in a glance:

For evaluation, we extract the model’s final answers and perform heuristic verification using both exact matching and the grade_answer() function from MathRuler. We use the same inference hyperparameter as suggested by Qwen and successfully reproduced Qwen2.5-VL-7B’s reported performance of 68.5% on MathVista. On MathVerse, we considered answer verification with GPT4 due to the more diverse answer format in its free-form questions.

Inference hyperparameters

1. Enabling `Qwen2.5-VL-7B-Instruct` for Complex Reasoning

We investigated the potential of distilling reasoning from pure-text R1 models into LVLMs like Qwen2.5-VL-7B-Instruct. Ideally, leveraging GPT-4o for comprehensive image descriptions and DeepSeek-R1 for advanced reasoning would yield the best results; however, due to budget constraints, we adopted the following simplified setting:

Captioning Model: Qwen2.5-VL-3B-Instruct
Reasoning Model: DeepSeek-R1-Distill-14B
Dataset: FigureQA, GEOS, Geometry3K, TabMWP, VizWiz, and AI2D (59.2k examples).

Captions were generated to sample k=4 responses from DeepSeek-R1-Distill-14B, selecting the shortest reasoning path leading to the correct final answer. Answers were verified through heuristic matching (exact match or via grade_answer() from MathRuler).

1. Enabling Qwen2.5-VL-7B-Instruct for Complex Reasoning

1. Enabling `Qwen2.5-VL-7B-Instruct` for Complex Reasoning