Recently, image representations based on convolutional neural networks (CNNs) and Vision Transformers (ViTs) have demonstrated significant improvements over the state-of-the-art in many computer vision applications including image classification, object detection, scene recognition, semantic segmentation, action recognition, and visual tracking. CNNs consist of a series of convolution and pooling operations followed by one or more fully connected (FC) layers. ViTs consist of an encoder-only Transformer and a head suitable for the respective task. Both Deep networks are trained using raw image pixels with a fixed input size or sparse point clouds in a finite volume. These networks require large amounts of labelled training data. The introduction of large datasets (e.g. ImageNet, 14 million images, semantic 3D datasets, and synthetic datasets) and the parallelism enabled by modern GPUs have facilitated the rapid deployment of deep networks for many visual tasks. This development has led to what many peers call the deep learning revolution in computer vision. CVL is currently working on eight different research tasks within this project for which GPU-resources are requested.
1. Simulation of Quantum Machine Learning
2. Human motion analysis from videos
3. Deep learning for large scale remote sensing scene analysis
4. Probabilistic 3D computation from time-of-flight measurements
5. Spatio-temporal networks for scene flow estimation
6. Injection of geometry into Deep Learning
7. WASP NEST _main_ (hybrid machine learning)
8. Large multi-modal models (LMMs) for biodiversity