The idea that world modeling is required for more robust autonomous intelligence has garnered much attention in the last few years. In this context, a world model refers to a predictive model of the world, whereby an autonomous agent can predict the future given the past and conditioned on one or more future actions. In this proposal, we aim to develop methods for efficiently learning world models from observations in a self-supervised manner, while also developing world models that are efficient to evaluate at inference time. We leave the incorporation of action to future work and focus our attention on passive learning.
Prior research on world models has focused on generative modeling, whereby the model is trained to predict the next signal in a sequence, e.g. predicting the next frame of a video. However, the world is filled with aleatoric uncertainty. An alternative approach is to learn a latent predictive model. In this case, an encoder is trained jointly with the predictor, with the predictor having the task of predicting the future latent representation. Such latent learning approaches have been shown to greatly improve learning efficiency, acquiring representations with equal or greater performance to those resulting from generative modeling, all while using an order of magnitude less compute (see I-JEPA). Additionally, a latent prediction allows us to learn a hierarchical world model, where prediction happens at increasing time-scales and levels of abstraction.
The second desiderata for the world model is for it to be efficient to evaluate at inference time. This is necessary, as planning will amount to multi-step optimization and/or discrete search over latent variables/actions. As such, the approach must feature a learnable method of compressing observations into its bare-essential components, for example the objects in a scene. Fortunately there exists prior research on this topic, Slot Attention being a notable example that yields compositional representations.
Preliminary results from combining hierarchical latent prediction with compositional representations demonstrate significant promise. For example, the model has been observed to learn tracking of objects, entities and their parts (e.g. limbs) in videos, biasing attention towards dynamic components when latent resolution is low, and doing so entirely self-supervised.
To summarize, we aim to develop methods for efficiently pretraining efficient world models. These methods will primarily be trained and evaluated using video, with the potential for multi-modal video-audio modeling. The datasets we will use long-form video datasets such as Kinetics (500GB), Charades, ActivityNet and the egocentric Walking Tours. Evaluation will be conducted through a mixture of qualitative and quantitative methods. Our proposed method allows us to explicitly see how the model has factored observations, giving unique qualitative insight into what it chooses to focus on. Attentional probing will allow us to quantitatively determine how well the model’s learned representations encode information, e.g. by classifying the action category of a video. The proposed models will be implemented through several Transformer models using PyTorch and Flash Attention for more efficient throughput.