This project aims to address 3D consistent image and video editing/generation and novel view synthesis. The primary focus is on leveraging the existing foundation image and video editing/generation models. Such models have been trained on extremely large datasets for extremely long times. Consequently, they are able to generate very realistic high-resolution images. However, for multiple views of the same scene, in general, geometric (and sometimes even semantic) consistency is not guaranteed. We are looking into ways to force this consistency without re-training or fine-tuning the large foundation models. We are investigating approaches like guidance of the denoising process, output verification, input augmentation, all with the help of different methods for 3D reconstruction.