3D scene understanding is crucial for image matching and depth estimation which in turn can be used for downstream applications within autonomous driving, scene reconstruction and robotics.
To create a strong 3D scene extractor one needs plenty of data. This data is often quite large.
Update 2026-04-01:
This project been instrumental in creating MuM: Multi-View Masked Image Modeling for 3D Vision (Nordström et. al) which will be presented at CVPR26 (top-tier ML/CV conference). It has also helped in creating RoMa v2 (which won best paper at SSBA26) as well as LoMa (to be released soon).