Augmented Telepresence (AT) aims to provide immersive remote interactions by reconstructing and rendering realistic environments from limited visual inputs. Traditional approaches to Novel View Synthesis (NVS) and depth estimation are often independent, leading to inconsistencies in depth maps across synthesized views, particularly in dynamic environments, and introducing visual artifacts like flickering. Moreover, such pipelines are computationally expensive and unsuitable for real-time applications, which are critical for AT.
This continuation project builds upon our prior work on light field reconstruction to propose a unified deep learning framework for joint NVS and depth estimation. By leveraging a shared latent space representation, the proposed method ensures consistency between synthesized views and their corresponding depth maps, significantly mitigating flickering and artifacts. The unified architecture is designed to optimize inference time, making it ideal for real-time AT applications. The focus will be on training and testing the network on realistic datasets that mimic telepresence scenarios, including human motion and object interactions, while ensuring scalability to dynamic environments.
To support the development and evaluation of this framework, a high-performance computing (HPC) cluster is essential for training the computationally intensive models, particularly as we integrate domain-specific constraints like temporal coherence, spatial consistency, and efficient rendering for augmented telepresence. The project's outcome will provide a foundation for scalable, real-time, and robust telepresence systems with applications in remote collaboration, education, and healthcare.