We are a Computer Vision group at Chalmers headed by Professor Fredrik Kahl. Our research focuses on Computer Vision and mainly uses Deep Learning. Together with fundamental research, we also work with applications within, e.g. autonomous driving and materials research. At our core, we leverage available compute resources to publish strong Computer Vision research at top-tier conferences. For instance, thanks to our prior NAISS projects we recently published papers at ICML [10] (spotlight), CVPR [1, 2, 3, 8, 15, 18, 19, 201], NeurIPS [17], and other top tier venues [13, 16].
Proposed Research. The requested compute (10K hours / month) will be used to realize research projects in (i) Geometric Deep Learning for 3D Vision, (ii) Generative AI for autonomous driving, and (iii) Consistent image and video generation and editing.
(i) Geometric Deep Learning for 3D Vision. We develop scalable 3D representation learning methods that outperform 2D self-supervised models. In our recent CVPR26 work (MuM) [1], we showed that simple 3D-based objectives outperform models such as DINOv3. We will extend this by introducing student–teacher training and latent prediction objectives, requiring large-scale training on multi-view datasets. Training MuM required 64×A100 GPUs for 3 days (~4.6K GPUh per run). The proposed extensions involve multiple such runs and ablations. In parallel, we will continue our successful work on feature matching [2, 4, 6, 13, 15, 16, 18, 19] by combining it with our work on equivariance [7, 10, 14, 17, 20, 21]. We aim to create an equivariant feature matcher that achieves state-of-the-art performance, especially on aerial imagery. These are typically trained on 16×A100.
(ii) Generative AI for autonomous driving. We tackle generative image and video diffusion models for correcting artifacts from 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRFs) in autonomous driving scenes [3]; World Foundation Models for generating novel 4D driving scenes from text or road markers for use in closed-loop simulation; reinforcement learning of autonomous driving models in dynamic 3D Gaussian Splatting driving scenes; and regularization methods for improved 3DGS reconstruction quality.
(iii) Consistent image and video generation and editing. We develop methods addressing 3D geometric and semantic consistency in image/video generation and editing. We leverage foundation image generation models and aim to improve them without retraining or fine-tuning [5, 8]. Our approaches include guidance of the denoising process, output verification, and input augmentation combined with 3D reconstruction methods. Recent focus has been on 3D editing, i.e. consistent editing of multiple views, with an emphasis on handling challenging non-rigid edits and large viewpoint changes. Pipelines involve large foundation models (e.g., FLUX, RoMa, DepthAnything, AnySplat) requiring 30–80+ GB of VRAM, with backpropagation through the denoising process adding further memory and compute demands. Experiments with a large number of baseline methods and guidance approaches lead to an increased computational demand.