From Points to Distributions: Joint Probabilistic Modeling of Vision-Language Embeddings via Riemannian Flow Matching

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-691

Type:

NAISS Small

Principal Investigator:

Mayank Nautiyal

Affiliation:

Uppsala universitet

Start Date:

2026-04-13

End Date:

2026-08-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

https://www.prashantsingh.se/people/

Allocation

Mimer at C3SE: 750 GiB
Alvis at C3SE: 700 GPU-h/month
Arrhenius Disk at NAISS: 400 GiB
Arrhenius Flash at NAISS: 200 GiB

Abstract

Modern vision-language models map images and text into a shared embedding space, but standard approaches rely on deterministic point embeddings and cannot represent uncertainty or the one-to-many nature of cross-modal correspondence. This project develops GeoFlowVLM, a geometry-aware probabilistic framework that models the joint distribution of image and text embeddings directly on the product hypersphere using Riemannian Flow Matching. The method enables unified joint, conditional, and marginal sampling without retraining the underlying vision-language model. By explicitly modeling cross-modal uncertainty on the manifold, the project aims to improve uncertainty quantification, retrieval robustness, and probabilistic reasoning in vision-language systems.