Evaluating visual and task context in multi-modal referential tasks with multi-modal large language models

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1187

Type:

NAISS Small Compute

Principal Investigator:

Nikolai Ilinykh

Affiliation:

Göteborgs universitet

Start Date:

2025-09-10

End Date:

2026-10-01

Primary Classification:

10208: Natural Language Processing

Webpage:

https://visref-inlg2025-tutorial.github.io

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

The goal of the project is to study how humans and language-and-vision large language models refer to things in the world. Current challenges in the fields of natural language processing and artificial intelligence include automatic modelling of referring, e.g., understanding the meaning of the word in visual context. One challenge is the diversity of computational tasks, visual contexts and expression generation methods to produce and study reference. The set of tasks ranges from single sentence image description generation to generation of dialogues and conversations about images, which makes the scope of research on reference too vast with very little systematisation. At the same time, multi-modal large language models (both open-source and closed-source) are constantly evolving and being evaluated on all types of benchmarks without a proper focus on reference in different visual contexts and tasks. Our project is thus timely as we aim to systematise this research with a strong focus on practical utility of our analysis and the need by a natural language processing and natural language generation communities for a code toolkit to study reference. The materials and the code for the project will be used and presented at the International Conference on Natural Language Generation, INLG (https://2025.inlgmeeting.org). Our goal is to present a tutorial on the topic of reference in multi-modal large language models. The project is also international as it involves researchers from abroad. The project will include extensive model evaluation. We will publish our code and results as part of the tutorial in accordance with the VR guidelines regarding Good Research Practices, i.e., openly available under permissive licenses. We are not working with any type of personal data and we work with publicly available datasets which are widely used in computational linguistics community. Our goal is to run experiments that will be used in the tutorial at the INLG conference, and we require sufficiently large computing resources for this purpose (as our research group does not have access to them). After the presentation at INLG (which will appear as a tutorial abstract), we will develop this work into a full conference publication. We will acknowledge the NAISS infrastructure used for our experiments in all scientific publications that involve the described work.