The project aims to develop and evaluate an advanced world-model-based reinforcement learning framework for embodied autonomous systems, focusing on uncertainty-aware state representation and real-time decision-making.
To support robust perception and environment encoding, the project will employ DINOv3 visual features as a pre-trained foundation for image and video understanding. These features will be used for inference-only embedding extraction, providing stable and semantically rich representations of visual scenes.
The extracted representations will serve as input to a latent world model for prediction, imagination, and policy learning, enabling the agent to reason about future states without direct environment access.
GPU resources are required primarily for large-scale DINOv3 inference over image and video datasets, while model training will be performed on lightweight downstream components.