Latent-Action Model Predictive Control with Online Failure Recovery and Affordance-Aware Planning

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-889

Type:

NAISS Small Compute

Principal Investigator:

Victor Aregbede

Affiliation:

Örebro universitet

Start Date:

2025-06-11

End Date:

2026-07-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

Robotic vision-language-action models such as UniVLA convert camera images and natural-language instructions into high-level, discrete “latent-action” tokens that can be reused across different robot embodiments. Yet current deployments execute those tokens greedily, one step at a time. When an object slips, lighting changes, or an unexpected obstacle appears, the robot has no look-ahead and simply fails. We propose UniVLA-MPC, a control stack that transforms any frozen VLA policy into a real-time model-predictive controller (MPC) that (i) plans several latent moves ahead, (ii) automatically discards physically impossible actions, and (iii) learns online to foresee failure before it happens. The three key ideas are: Fast latent-space planning. Instead of sampling torques in a simulator, we sample short sequences of latent tokens and roll them forward with UniVLA’s own decoder in DINO-v2 feature space. A single modern GPU evaluates tens of thousands of sequences in under 30 ms, enabling a 25 Hz closed loop with no physics engine. Self-supervised affordance filter. Because UniVLA’s decoder was trained only on valid motions, its one-step reconstruction error is low for feasible tokens and high for impossible ones. We keep tokens whose error falls below an adaptive threshold, shrinking the search space by 60–80 % with negligible overhead. A subsequent CLIP similarity check keeps only those imagined next frames that are semantically relevant to the language goal. Online failure anticipation. A small value-prediction head is trained in the background from real rollouts via temporal-difference learning. The planner adds this value “tail” to its cost function, so it re-grips or detours before a slip or collision occurs.