SUPR
Measuring compositionality and consistency in multimodal reasoning
Dnr:

NAISS 2025/22-374

Type:

NAISS Small Compute

Principal Investigator:

Adam Dahlgren Lindström

Affiliation:

Umeå universitet

Start Date:

2025-03-07

End Date:

2026-04-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Abstract

Recently, the field of large language models have focused on two things; improving reasoning capabilities and building stronger multimodal models. In this project, we are highlighting weaknesses in how existing benchmarks evaluate those capabilities and validating our claims through the creation of a new multimodal reasoning benchmark. Our previous research on compositional generalisation, the composition of existing knowledge and skills to handle novel data/situations, is leveraged to. Our thesis is that the way in which reasoning is measured does not reflect or even utilize the underlying structure (function graphs, program structures, et c.) in the evaluation, and thus reports strengths misleadingly. In particular, these experiments focus on measuring different types of consistency in the reasoning chains. As a toy example, a model that solves "3+5 = x?" should also solve "5+3=x?". Somehow, this has not been systematically evaluated before, even less so in the multimodal reasoning context. Our proposed framework and accompanying benchmark dataset is important to understand how models can be built to be more data efficient and robust. The benchmark is built using synthetic data from the CLEVR domain, and the models we evaluate are a range of state of the art multimodal language models. The compute resources of Berzelius are required to run some of the larger models, as well as perform the experiments to statistical satisfaction.