Recent advances in diffusion models can be used to render realistic videos. However, upon closer inspection it is common that such rendered videos contain artifacts such as appearing and disappearing scene content, and there can be significant deformations of parts of the scene that are supposed to be static. Furthermore, there are challenges in precisely controlling the viewpoint and camera motion of the video. In this project I will attempt to modify the video diffusion sampling in such a way that the rendered video is grounded to a 3D model of the scene. This is done by iteratively fitting a 3D model according to the current sample of the diffusion model, as well as using the 3D model as a guidance signal to steer the diffusion sample in the direction towards the space of renders from a correct 3D scene. The method is training-free and does not require fine-tuning a diffusion model. The method will be evaluated on the tanks-and-temples dataset using various baseline video diffusion models.