Understanding Action Representations in Vision-Language-Action Models

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-1120

Type:

NAISS Small

Principal Investigator:

Yufei Duan

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2026-06-16

End Date:

2027-07-01

Primary Classification:

20201: Robotics and automation

Webpage:

Allocation

Arrhenius Disk at NAISS: 500 GiB
Arrhenius GPU at NAISS: 250 GPU-h/month
Arrhenius Flash at NAISS: 250 GiB

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for general-purpose robot learning by integrating visual perception, language understanding, and action generation within a unified framework. While recent progress has largely focused on scaling model size, training data, and multimodal representations, comparatively little attention has been given to the action space itself. Since actions constitute the final interface between high-level reasoning and physical interaction, their representation can significantly influence policy performance, robustness, and generalization. This project aims to revisit VLA models from an action-centric perspective. The primary objective is to investigate how different action representations, action parameterizations, and temporal abstractions affect robotic manipulation performance. The study will systematically evaluate existing VLA architectures and explore alternative action formulations that may improve policy stability, control precision, and cross-task generalization. Experiments will be conducted using established robotic manipulation benchmarks, including the LIBERO benchmark suite, which provides a diverse set of language-conditioned tasks designed to evaluate generalization across tasks and environments. Multiple VLA models and action representations will be assessed under controlled experimental conditions. The expected outcome is a deeper understanding of the relationship between action representations and embodied intelligence, together with recommendations for future VLA architectures. The results will contribute to the development of more reliable and efficient robot learning systems capable of generalizing across diverse manipulation tasks. Main supervisor: Danica Kragic, Kungliga Tekniska högskolan (KTH Royal Institute of Technology).