3D scene representations are an important component of several AR/VR/MR (XR) applications, game design, real estate etc. For applications in XR, traditional representations such as point clouds, generated by Structure from Motion (SfM), or Simultaneous Localization and Mapping (SLAM) seem to be the standard choice as they allow for fast and robust operation of downstream tasks such as camera localization. These techniques, as well as learning-based representations recently suggested, such as deep implicit fields ignore an important aspect of man-made environments - compositionality. Most man-made environments are composed of several smaller building blocks such as objects or components that are shared across many other environments. Such representations are common in other industries such as gaming, but those are mostly for synthetically generated scenes composed of synthetic assets. On the other hand, there has recently been rapid development in deep generative algorithms (such as DALL-E, Stable Diffusion etc). These techniques have been incredibly successful at both - unconditional, and conditional generation of realistic images. However, little work has been done in the 3D generation domain and these have been focussed on the generation of small-scale objects, and not of room-scale scenes.
Our objective is to use the potential of generative modeling in filling these two gaps in the research on 3D scene representations. More concretely, we want to investigate how the idea of compositionality can be implicitly built into a generative model for 3D scenes.