**Yihe Deng$^\dagger$, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang**

$\dagger$: Project lead

Date: 03/20/2025

[Paper] [Model] [GitHub]


Table of Content

<aside> 🗒️

Summary

The recent DeepSeek-R1 tech report demonstrated how reinforcement learning (RL) effectively enables complex reasoning behaviors in LLMs, such as self-verification and self-correction, encapsulated by special tokens <think> and </think>. These reasoning abilities significantly improved LLM performance on challenging textual QA benchmarks like MATH and AIME.

Motivated by these results, our exploration aimed to investigate: (1) if similar reasoning capabilities can be introduced into vision-language LLMs, and (2) whether these reasoning enhancements improve model performance on challenging VQA tasks. Specifically:

Performance in a glance:

image.png

For evaluation, we extract the model’s final answers and perform heuristic verification using both exact matching and the grade_answer() function from MathRuler. We use the same inference hyperparameter as suggested by Qwen and successfully reproduced Qwen2.5-VL-7B’s reported performance of 68.5% on MathVista. On MathVerse, we considered answer verification with GPT4 due to the more diverse answer format in its free-form questions.


1. Enabling Qwen2.5-VL-7B-Instruct for Complex Reasoning

We investigated the potential of distilling reasoning from pure-text R1 models into LVLMs like Qwen2.5-VL-7B-Instruct. Ideally, leveraging GPT-4o for comprehensive image descriptions and DeepSeek-R1 for advanced reasoning would yield the best results; however, due to budget constraints, we adopted the following simplified setting:

Captions were generated to sample k=4 responses from DeepSeek-R1-Distill-14B, selecting the shortest reasoning path leading to the correct final answer. Answers were verified through heuristic matching (exact match or via grade_answer() from MathRuler).