Multi-Modal Open-Set Perception for Autonomous Driving

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-977

Type:

NAISS Small Compute

Principal Investigator:

Jesper Eriksson

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-07-10

End Date:

2026-08-01

Primary Classification:

20208: Computer Vision and learning System (Computer Sciences aspects in 10207)

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

Autonomous driving promises to improve road safety and transport efficiency, yet key challenges persist, notably the "long-tail" problem, i.e., the occurrence of rare, complex scenarios difficult to model using conventional, closed-set perception systems. This research project targets these challenges by advancing multi-modal open-set perception for autonomous vehicles through the development and evaluation of sensor data processing techniques suited for foundation models. These models, characterized by their reasoning and generalization capabilities, hold the potential to overcome limitations of current systems by enabling zero-shot learning and robust scene understanding. The core objective of the project is to create a unified multi-modal embedding space that fuses data from diverse vehicle sensors such as cameras, LiDARs, and radars. This space must support both spatial and temporal reasoning tasks critical for safe driving in complex and dynamic environments. Emphasis is placed on embedding richness, computational efficiency, and real-time feasibility within the constraints of onboard vehicle hardware. Three main research goals guide the project. First, it aims to develop a spatial embedding framework that allows for detailed understanding of scenes containing many diverse objects. This involves aligning heterogeneous sensor modalities into a shared representation while preserving spatial fidelity and ensuring scalability. Second, the project addresses temporal reasoning by extending the embedding space to process sequential sensor data, capturing motion and dynamic changes. This includes exploring self-supervised learning techniques for pre-training on raw sensor data and leveraging modality-specific strengths, such as LiDAR’s geometric precision and low-light performance. Third, the project will integrate these embeddings into downstream reasoning tasks such as motion prediction and planning, evaluating their impact on key safety and performance indicators within an autonomous driving software stack. The approach is grounded in recent advances in foundation models, such as contrastive learning (e.g., CLIP), multi-modal fusion (e.g., IMAGEBIND), and dense open-set segmentation (e.g., ZegCLIP). Inspired by developments in vision-language models and robotic reasoning, the project explores how to directly interface the multi-modal embedding space with language models for scene understanding and decision-making. This includes investigating token-space projections and cross-modal attention mechanisms that enable flexible, explainable, and end-to-end trainable systems. By pushing the boundaries of how autonomous vehicles perceive and reason about their environments, this project aims to significantly reduce the long-tail risk, enabling safer operation in broader and more diverse Operational Design Domains (ODDs), and ultimately paving the way toward higher levels of driving autonomy.