ML for Computer Vision

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/5-609

Type:

NAISS Medium Compute

Principal Investigator:

Fredrik Kahl

Affiliation:

Chalmers tekniska högskola

Start Date:

2024-12-01

End Date:

2025-12-01

Primary Classification:

10207: Computer Vision and Robotics (Autonomous Systems)

Webpage:

https://fredkahl.github.io/

Allocation

Alvis at C3SE: 4000 GPU-h/month

Abstract

Machine learning research conducted at the computer vision group at electrical engineering, Chalmers. We are currently working on the following four problems, where we need access to the GPU cluster and be able to store large image data sets such as Megadepth, Megascenes, ImageNet and more. Some more specifics are detailed at the end of this section. 1. Generalization and invariance One of the key factors for the success of deep learning for 2D image interpretation is the convolutional layer, which induces translation equivariance. That is, if the input is translated, then the output is too. It drastically reduces the number of learnable parameters, compared to a fully connected layer, and increases generalization performance. For 3D data, like point clouds, meshes and graphs, other equivariances have been investigated for classification, but except for convolutions, they are not exploited in encoder-decoder models. We analyze, construct and develop efficient implementations for other equivariances, both for the encoder hϕ ◦ g(X) = g ◦ hϕ(X) and similarly for the decoder, fθ ◦ g(Z) = g ◦ fθ(Z) for all g belonging to a group. We will investigate the groups of SO(3), SE(3) and their subgroups, for instance, azimuthal rotations, especially for the application of image matching. 2. Dynamics and deformation. We will explore methods to learn latent space representations of deformation fields and enforce continuity by exploiting the sequential nature of the data. The ambition is to decouple deformations from geometry, aggregate information over multiple scenes, and learn deformation models that can be quickly adapted to novel scenes, by relying on meta-learning. 3. Flexible and controllable 3D generative models. Given a trained encoder-decoder model, we need to be able to perform inference on incomplete data. Typically, objects are only visible from a few viewpoints or even a single viewpoint, whereas the encoder-decoder model training usually relies on complete data. We are currently focusing on controllable outdoors scenes where one can change conditions according to lighting, season and weather. 4. Medical image analysis. We are also using deep learning for developing algorithms that automatically extract and segment medical images such as CT, MR and PET. More specific updates: - In medical image analysis, the project would be used to explore how we can utilize a mixed-resolution transformer to improve volumetric semantic segmentation in medical CT and MR data. - Exploring more label-constrained computer vision problems, such as few-shot adaptation of generic segmentation models to specific tasks, with expected applications in medical image analysis. - Studying the use of text-to-image diffusion models in various under-determined computer vision problems, ranging from few-shot classification/segmentation to 3D applications such as novel view synthesis and monocular depth estimation. - Continuing to investigate how diffusion models can be used to generate geometrically consistent images and 3D representations. - Development of deep learning for 3D computer vision and efficient algorithms for geometric deep learning.