In the Robust RL setting, there are several definitions of robustness. We start by defining an MDP $M = \langle S, A, P, R, \gamma, s_0 \rangle$, where $S$ is the state space, $A$ is the action space, $P$ is the transition probability function, $R$ is the reward function, $\gamma$ is the discount factor and $s_0$ is the starting state distribution. For this idea, we consider a Model-Based RL scenario. In Model-Based methods, the target domain is approximated by a source (or simulation) domain. The agent will then learn an optimal policy $\pi^\ast$ by interacting with the simulated environment. When dealing with Model-Based algorithm we encounter two main challenges, that are closely related:
- the gap between the source and target domain: when interacting with a simulated environment, there will be modeling errors that will reduce the performance of the learned policy when transferred to the target environment;
- the hallucinations caused by compounding prediction errors: when using functions approximators like neural networks to learn the transition function of our MDP, the state predicted at time $t$ will be used as input to predict the state at time $t+1$. The prediction errors will then sum up along the trajectory leading to future predictions far from reality.
Diffusion models have been employed to reduce the damage caused by compounding errors. When using diffusion to model the environment dynamics we can generate all the trajectory steps concurrently, thus avoiding the single-step prediction errors to sum up. However, diffusion models are still far from perfect when it comes to modeling errors: the approximated distribution of the trajectories will not be identical to the one coming from the real environment and the modeling errors will still prevent the learned policy from being transferred to the target domain. So, how can we improve the policy robustness to modeling errors? Can we improve its performance when transferred to the unseen target environment?
In this paper, we aim to make our policy more robust to modeling errors using a Conditional Value at Risk (CVaR) objective for our diffusion-based model of the environment.