Beyond Pixels and Words

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-914

Type:

NAISS Small

Principal Investigator:

Amelie Robrecht-Hilbig

Affiliation:

Göteborgs universitet

Start Date:

2026-06-01

End Date:

2027-06-01

Primary Classification:

60201: Comparative Language Studies and Linguistics

Webpage:

https://www.vr.se/english/swecris.html?project=2023-01552_VR#/

Allocation

Arrhenius Disk at NAISS: 250 GiB
Arrhenius GPU at NAISS: 150 GPU-h/month

Abstract

The goal of the project is to study how the spatial reasoning capabilities of artificial situated vision and language agents can be improved. Current agents, based on generative AI models show high abilities when generating captions for pictures, but struggle to adapt towards changes in the physical environment, tend to confuse different perspectives and frame of reference and cannot actively detect and resolve uncertainty. One of the main challenges is that current vision language models are mainly trained on (artificial) picture caption combinations, causing the previously mentioned limitations. In this project we will focus on a more dialogical and interactive approach to allow generative systems to improve their spatial reasoning capabilities. We will focus on the development of benchmarks extracted from human-human dialogues and test whether existing models improve their performance when using additional structured representations of (a) world knowledge, (b) the physical scene, and (c) the dialog history. As part of the project we are supervising multiple master theses. One is testing the task-completion scores of LLM-based agents in the HABITAT AI virtual situated environment developed by Meta AI and their recent PARTNR model. Observed failures will be classified, analyzed, and prevented using different approaches. The second thesis uses retrieval augmented generation with knowledge graphs to improve dialogical grounding. Both thesis will use prompting and fine-tuning approaches to optimize the LLMs performance on the given task. The third thesis evaluates active perception and vision reasoning in small-scale mutimodal models, which is achieved by benchmarking their ability to explore the scenes rebuild from CLEVR dataset in Unity. The research investigates how agents can uncover hidden information through active camera control. Results as well as code, datasets, questionnaires and methods will be published at different venues including EMNLP and Semdial. We will preregister all studies (human and ai) related to the project using Open Science Framework (OSF) to allow for a higher degree of comparability and generalizability of the results. The main project is international as it involves researchers from other European countries (Republic of Ireland). We will acknowledge the NAISS infrastructure used for our experiments in all scientific publications that involve the described work.