Combining DINOv3 and Mask2Former for Urban Streetscapes Semantic and Instance Segmentation

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-506

Type:

NAISS Small

Principal Investigator:

Yinghao Chen

Affiliation:

Chalmers tekniska högskola

Start Date:

2026-03-12

End Date:

2027-04-01

Primary Classification:

20105: Transport Systems and Logistics

Webpage:

Allocation

Mimer at C3SE: 2000 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

Built environment characteristics, such as sidewalk availability, vegetation coverage, and traffic object density, are important inputs for transportation analysis. However, automated extraction of these indicators from street-level imagery often requires separate computer vision models for different tasks. This project investigates a unified framework for extracting transportation-relevant built environment features using street-level imagery. The proposed approach combines task-oriented supervision with a shared foundation model architecture based on DINOv3 and Mask2Former. A whitelist strategy is used to focus training on transportation-relevant classes while preserving the broader prediction space of the original dataset. The framework jointly produces pixel-level semantic segmentation and instance-level object masks from the same model. Experiments will be conducted on the Mapillary Vistas dataset, which contains high-resolution street-scene images with detailed annotations. Training and evaluation of transformer-based segmentation models require substantial GPU resources due to the large dataset size and high image resolution. The project aims to improve the extraction of transportation-relevant built environment indicators and provide reliable inputs for downstream transportation analysis and planning.