Robotics foundation models for general-purpose manipulation skills

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/3-376

Type:

NAISS Medium

Principal Investigator:

Sichao Liu

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2026-07-01

End Date:

2027-01-01

Primary Classification:

20201: Robotics and automation

Webpage:

https://sites.google.com/view/sichao-liu/home

Allocation

Arrhenius Disk at NAISS: 6000 GiB
Arrhenius GPU at NAISS: 5000 GPU-h/month

Abstract

The advent of Generative Pre-trained Transformers (GPT) marked a pivotal advancement in modern AI. By tapping into vast amounts of internet data, GPTs revolutionised AI by enhancing the versatility of models. Foundation models are neural networks “pre-trained” on massive amounts of data without specific use cases in mind. They have transformed AI – powering large language models (LLMs) such as ChatGPT. Robotics emerges as a leading frontier in this evolution, poised to revolutionise physical-world efficiencies akin to digital transformations. Robotics foundation models, trained on diverse datasets encompassing both internet-derived and real-world physical interactions, signify a significant leap towards constructing AI models capable of adeptly navigating the complexities of real-world environments. The momentum of robotics foundation models is rapidly accelerating, buoyed by access to extensive, varied robotic data from real-world production settings. This project uses state-of-the-art generative AI approaches, such as language and vision models, to build robotics foundation models for general-purpose manipulation skills in robotic applications. The main tasks of the project are composed of: • Fine-tuning multimodal LLM for robotic manipulation aligning with autonomous mobile robotic systems, which is supported by high-performance GPUs such as NVIDIA GH200 (Grace Hopper) GPUs with 96GB HBM + 128GB LPDDR per GPU; • Pre-training foundation models with massive datasets includes texts, images and videos with the help of GPUs; • Training a robotics foundation model for general-purpose manipulation skills for wide-range robotic applications, where the robotics foundation model is built on a framework of a state-of-the-art open language model. • using reinforcement learning and world model for robot policy training, and such a policy training is built on large-scale robotic datasets such as Open-X embodiment and RT-X, and then the trained policies are deployed on the real-world robot to evaluate the robustness and reliability, and also test the generalisation across different tasks and robots. • using reinforcement learning and imitation learning for post-training of large-scale vision-language models (VLMs) with large-scale datasets such as RobotVQA, it aims to enable the VLMs to have the enhanced capability of reasoning, thinking and decision-making, which is then used for high-level task planning, tool use and function call. Finally, the vision-language-action models work as a low-level action executor for task execution on real-world robots. The models and algorithms we plan to develop or apply are GPU-based. This project combines robotics and AI research. We will acknowledge NAISS SUPR's support for our research if our proposal is approved.