SUPR
DG-CLIP
Dnr:

NAISS 2024/22-1393

Type:

NAISS Small Compute

Principal Investigator:

Arsham Gholamzadeh Khoee

Affiliation:

Chalmers tekniska högskola

Start Date:

2024-11-05

End Date:

2025-06-01

Primary Classification:

10201: Computer Sciences

Webpage:

Allocation

Abstract

The potential for adaptability and generalizability in vision-language models is particularly exciting. As AI systems are increasingly deployed in diverse and often unpredictable real-world environments, the ability to adapt to new contexts and generalize across domains becomes crucial. Current models, while powerful, often struggle when faced with scenarios that deviate significantly from their training data. This limitation hinders their practical application and scalability. Our research aims to address these challenges by developing adaptive and generalizable vision-language models. We seek to enhance the robustness and flexibility of these models, enabling them to perform well in novel, unseen environments. This goal aligns with the broader AI community’s push towards more versatile and reliable systems that can operate effectively across a wide range of tasks and domains. At the core of our approach is the integration of semantic space into the representation learning process. By incorporating rich semantic information, we aim to build robust representations that capture the essence of both visual and linguistic inputs. These enhanced representations enable models to form deeper connections between modalities and better extrapolate to new scenarios. We will also explore the potential of prompt learning techniques to further enhance our models’ generalization, complementing our focus on robust representation learning. The objective of this research is twofold. First, we seek to advance the theoretical understanding of how semantic information and prompt learning can be effectively integrated into vision-language models to improve generalization. Second, we aim to develop practical techniques and architectures that implement these insights, resulting in models that demonstrate superior robustness and performance across diverse tasks and domains.