Built environment characteristics, such as sidewalk availability, vegetation coverage, and traffic object density, are important inputs for transportation analysis. However, automated extraction of these indicators from street-level imagery often requires separate computer vision models for different tasks. This project investigates a unified framework for extracting transportation-relevant built environment features using street-level imagery. The proposed approach combines task-oriented supervision with a shared foundation model architecture based on DINOv3 and Mask2Former. A whitelist strategy is used to focus training on transportation-relevant classes while preserving the broader prediction space of the original dataset. The framework jointly produces pixel-level semantic segmentation and instance-level object masks from the same model. Experiments will be conducted on the Mapillary Vistas dataset, which contains high-resolution street-scene images with detailed annotations. Training and evaluation of transformer-based segmentation models require substantial GPU resources due to the large dataset size and high image resolution. The project aims to improve the extraction of transportation-relevant built environment indicators and provide reliable inputs for downstream transportation analysis and planning.