The goal of this project is to study how multimodal large language models can be guided to generate more socially appropriate and human-preferred narratives about sequences of images. Current work in this area typically focuses on developing benchmarks and datasets for testing the sensitivity of vision-language models (VLMs), particularly their limited ability to understand social context when describing image sequences. Our focus is different: we hypothesise that Direct Preference Optimisation (DPO), a recent method for preference-based fine-tuning of language models, can be used to guide VLMs toward generating descriptions that are more socially grounded and preferred by human annotators. In this project, we will use information about selected linguistic and multimodal properties of generated texts as signals for constructing preference rankings for DPO. These properties may include, for example, coreference consistency, character tracking, entity coherence, visual grounding, and the preservation of socially meaningful relations across images. The project is timely because it aims to introduce a pipeline that enables researchers and users in the social sciences to apply multimodal LLMs more appropriately in contexts where social interpretation is important.
The project aims to produce at least one paper for submission to a top-tier NLP conference, such as ACL, in autumn 2026. The work will be conducted within DSAI 2 at the Faculty of Science and Technology, Chalmers University of Technology. The project will involve extensive fine-tuning of multimodal models using DPO, as well as automatic and human evaluation of the generated outputs. We will publish the code, experimental setup, and results associated with the paper in accordance with the Swedish Research Council's guidelines on good research practice. Where possible, these materials will be made openly available under permissive licenses. The project will not use sensitive or personal data. Instead, we will rely on publicly available datasets that are widely used in the artificial intelligence and computational linguistics communities.
We will acknowledge the NAISS infrastructure used for our experiments in all scientific publications and other research outputs that are produced from the described work.