Towards Detailed Visual Understanding via Large Multimodal Models

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1317

Type:

NAISS Small Compute

Principal Investigator:

Fahad Khan

Affiliation:

Linköpings universitet

Start Date:

2025-09-29

End Date:

2026-04-01

Primary Classification:

20208: Computer Vision and learning System (Computer Sciences aspects in 10207)

Webpage:

Allocation

Alvis at C3SE: 1000 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

Significant advancements in the field of computer vision have recently been observed due to the development of many foundational vision-language models. These models represent a significant leap towards creating general-purpose vision models capable of tackling various tasks simultaneously. To this end, conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this project addresses advancing further image-based foundational models as well as the exploring the field of video-based multimodal foundation models. The aim is to develop novel approaches that efficiently combine the capabilities of LLMs with a pretrained visual encoder adapted for multimodal spatial or spatiotemporal representations, especially spatio-temporal grounding and chain-of-thought temporal reasoning. The project further aims to explore a semi-automatic annotation framework for generation high quality instruction data for images, videos and text. Moreover, the aim is to look into developing novel and efficient techniques to integrate grounding and reasoning capabilities. Such new novel proposed methods are expected to have profound impact in many real-world applications ranging from healthcare to intelligent autnomous systems.