PhyGeo-World: A Physics- and Geometry-Consistent World Model for Embodied AI
Diffusion-based video generation models, have achieved remarkable breakthroughs in visual fidelity, producing high-resolution, photorealistic content. However, when applied to closed-loop simulations for Embodied AI and Autonomous Driving, these models reveal critical limitations. Most current architectures function primarily as "2D pixel probability predictors" rather than true "world simulators." Consequently, they lack 3D geometric consistency and physical grounding; generated sequences frequently violate kinematic laws, exhibit unrealistic object deformations, or ignore fundamental principles such as volume conservation. Given that real-world trial-and-error for autonomous systems is prohibitively expensive and risky, there is an urgent need for a "parallel testing environment" that accurately reflects physical laws and responds in real-time to control commands.
This project proposes PhyGeo-World, a unified world model designed to bridge the gap between visual synthesis and physical reality through the following core methodologies:
1. Explicit Spatial Memory via 3DGS: Instead of relying solely on the implicit memory of neural networks, we innovatively integrate 3D Gaussian Splatting (3DGS) as an explicit spatial memory module. By decoupling static backgrounds from dynamic foregrounds, we fundamentally resolve multi-view consistency and allow for precise scene manipulation.
2. Physics-Informed Latent Constraints: We develop a dedicated Physics Calibrator that translates Newtonian mechanics and simplified Navier-Stokes equations into soft constraints within the latent space. This ensures that generated motion adheres to momentum conservation and volume consistency.
3. Advanced Conditioning & Control: A high-precision Pose Encoder is implemented to map 6-DoF camera extrinsics into the SVD-XT latent space. Furthermore, Reinforcement Learning (RL) Alignment is utilized to ensure the model provides reliable, closed-loop feedback for end-to-end policy training.
4. Curriculum Learning Framework: The model is trained through a five-stage progressive pipeline, scaling from visual foundation building to complex physical-cognitive injection and final RL-based trajectory optimization.
The primary objective of this project is to establish a high-fidelity synthetic data engine and safety testbed for robotics and autonomous vehicles including the following aspects:
1. Technical Milestones: We expect to significantly reduce trajectory prediction errors on the nuScenes validation set and achieve superior geometric consistency compared to baseline diffusion models.
2. Scientific Impact: The project will provide a robust framework for generating "long-tail" safety-critical scenarios that are difficult to capture in the real world.
3. System Validation: The final model will be integrated into simulation platforms like CARLA and Bench2Drive to evaluate the task success rate of embodied agents within the generated environment, proving its utility as a reliable proxy for real-world physics.