Agent in the room: Robots handling unforseen interactions

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-128

Type:

NAISS Small Compute

Principal Investigator:

Ronald Cumbal

Affiliation:

Uppsala universitet

Start Date:

2025-02-13

End Date:

2025-11-01

Primary Classification:

10204: Human Computer Interaction (Social aspects at 50804)

Webpage:

Allocation

Alvis at C3SE: 1000 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

The integration of robots into shared human environments requires not only ensuring physical safety but also fostering socially acceptable behavior. To operate effectively in dynamic contexts, robots must handle unexpected interactions with humans or other agents while pursuing their objectives. A critical challenge in this domain is enabling robots to interpret the intentions of interacting agents and adapt their task plans accordingly. This study investigates the use of large language models (LLMs) to equip robots with the ability to navigate scenarios involving unforeseen interactions by dynamically updating their plans based on multimodal contextual input. This work focuses on designing a system that leverages multimodal LLMs to estimate and clarify the intentions of humans and other agents, enabling real-time adaptation to disturbances. The proposed system comprises three core stages: (1) estimating the goals of both the robot and interacting agents, (2) developing and iteratively refining a resolution plan that considers these goals, and (3) translating the plan into low-level robotic controls. By concentrating on optimizing the first stage, the study aims to enhance robots’ ability to interpret multimodal signals, such as visual and linguistic cues, and reason about contextual information through self-reflection and chain-of-thought techniques. This work highlights the advantages of integrating vision-language models (VLMs) into the initial stages of task planning. Traditional approaches to intention estimation, such as classifiers, lack the flexibility to perform self-reflection and cross-modal reasoning, limiting their ability to adapt to general contexts. In contrast, VLMs enable richer multimodal reasoning by combining pre-trained vision encoders with LLMs. While current VLM architectures predominantly use early fusion strategies that project vision features into token inputs, these approaches constrain reasoning across modalities. This limitation is particularly evident in video contexts, where strategies like frame sampling often fail to capture dynamic changes effectively. By incorporating self-reflection capabilities, robots can better analyze video data and prioritize relevant aspects for decision-making. This approach lays the groundwork for more adaptive and socially aware robots, capable of seamlessly integrating into everyday human spaces while maintaining robust task performance.