Recent research has shown that training large language models (LLMs) through reinforcement learning (RL) without supervised fine-tuning can significantly enhance their reasoning abilities. Central to this advancement is the strategic design of RL rewards. Two primary reward categories have emerged as essential: accuracy rewards and format rewards. Accuracy rewards assess the correctness of model outputs, such as solutions to mathematical questions or predefined test cases in coding challenges (e.g., LeetCode). Format rewards, conversely, incentivize models to explicitly structure their reasoning process by encapsulating intermediate reasoning within predefined tags (e.g., <think>).
This project proposes to extend this successful RL-based training paradigm to Multimodal Large Language Models (MLLMs), such as models that integrate both linguistic and visual inputs. While promising, this extension introduces unique challenges due to the inherently diverse and multimodal nature of the data involved. Consequently, we identify two key research questions that guide this project:
First, how can we effectively design multimodal problems accompanied by suitable accuracy reward mechanisms? Unlike purely textual problems, multimodal tasks may involve interpreting visual data, extracting context from images or videos, and providing coherent and correct linguistic responses. Establishing reliable accuracy metrics in this context requires novel methodologies to ensure the precise evaluation of multimodal understanding and reasoning.
Second, how can we effectively stimulate and structure chain-of-thought (CoT) reasoning in multimodal contexts? Encouraging models to articulate explicit reasoning processes in multimodal tasks is not straightforward, as reasoning steps may integrate complex interactions between visual and textual modalities. The challenge lies in prompting the model to systematically present its multimodal reasoning within structured tags, ensuring transparency and interpretability of its reasoning process.
Through addressing these questions, this project aims to develop robust strategies for training advanced Multimodal Large Language Models. By carefully crafting multimodal accuracy reward schemes and enhancing explicit CoT reasoning processes, we anticipate substantial improvements in both the accuracy and interpretability of multimodal reasoning capabilities. Ultimately, this research seeks to broaden the applicability of reinforcement learning techniques to complex multimodal scenarios, facilitating significant progress in the practical deployment of multimodal AI systems.