**Yihe Deng$^\dagger$, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang**
$\dagger$: Project lead
Date: 03/20/2025
Table of Content
<aside> 🗒️
Summary
The recent DeepSeek-R1 tech report demonstrated how reinforcement learning (RL) effectively enables complex reasoning behaviors in LLMs, such as self-verification and self-correction, encapsulated by special tokens <think>
and </think>
. These reasoning abilities significantly improved LLM performance on challenging textual QA benchmarks like MATH and AIME.
Motivated by these results, our exploration aimed to investigate: (1) if similar reasoning capabilities can be introduced into vision-language LLMs, and (2) whether these reasoning enhancements improve model performance on challenging VQA tasks. Specifically:
We consider SFT for reasoning structure learning and RL for exploration.
Iteratively, with each RL-improved model serving as the basis for generating refined SFT datasets for subsequent iterations, we successfully developed a vision-language LLM that exhibits enhanced reasoning capabilities and improved performance on the MathVista
, MathVerse
, MathVision
.
</aside>
Our OpenVLThinker-7B
example output:
Performance in a glance:
For evaluation, we extract the model’s final answers and perform heuristic verification using both exact matching and the grade_answer()
function from MathRuler. We use the same inference hyperparameter as suggested by Qwen and successfully reproduced Qwen2.5-VL-7B’s reported performance of 68.5% on MathVista. On MathVerse, we considered answer verification with GPT4 due to the more diverse answer format in its free-form questions.
Qwen2.5-VL-7B-Instruct
for Complex ReasoningWe investigated the potential of distilling reasoning from pure-text R1 models into LVLMs like Qwen2.5-VL-7B-Instruct
. Ideally, leveraging GPT-4o
for comprehensive image descriptions and DeepSeek-R1
for advanced reasoning would yield the best results; however, due to budget constraints, we adopted the following simplified setting:
Qwen2.5-VL-3B-Instruct
DeepSeek-R1-Distill-14B
Captions were generated to sample k=4 responses from DeepSeek-R1-Distill-14B
, selecting the shortest reasoning path leading to the correct final answer. Answers were verified through heuristic matching (exact match or via grade_answer()
from MathRuler).