VLM Spatial Reasoning

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-147

Type:

NAISS Small

Principal Investigator:

Zhixing Li

Affiliation:

Chalmers tekniska högskola

Start Date:

2026-02-02

End Date:

2027-02-01

Primary Classification:

10207: Computer graphics and computer vision (System engineering aspects at 20208)

Webpage:

Allocation

Alvis at C3SE: 1000 GPU-h/month
Mimer at C3SE: 1000 GiB

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in semantic understanding and short-term video descriptions. However, these models exhibit a critical deficiency in true "Visual-Spatial Intelligence." Research indicates that while models can identify objects, they struggle significantly with allocentric spatial reasoning, such as converting egocentric image or video into a holistic understanding of spatial layouts, distances, and relative positions. This limitation severely hampers the deployment of VLM agents in embodied AI and complex navigation tasks where precise spatial awareness is non-negotiable. To address these fundamental limitations, this project proposes a dual-thrust approach focusing on algorithmic innovation and rigorous evaluation. On the algorithmic front, we aim to develop a novel spatial reasoning framework designed to enhance the model's sensitivity to geometric structures. We will introduce a specialized spatial modeling module, potentially integrating advanced spatial position embeddings or a geometry-aware attention mechanism which explicitly encodes 3D spatial relationships alongside semantic features. This architectural enhancement aims to force the model to internalize the topology of an environment, enabling it to track object locations and spatial changes dynamically without relying solely on massive token memory. Complementing this algorithmic development, we will design and release a comprehensive benchmark suite specifically tailored to evaluate spatial reasoning capabilities. Existing benchmarks often suffer from narrow scope, but our proposed benchmark will introduce complex tasks that require the model to perform rigorous spatial deductions, such as relative distance estimation and relation logic. By penalizing simple retrieval strategies and rewarding genuine spatial inference, this benchmark will provide a standardized metric for the community, guiding the transition of VLMs from passive observers to spatially intelligent agents.