BELLA: Building Expressive Language for Likeable Agents

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-767

Type:

NAISS Small Compute

Principal Investigator:

Ricardo Caldas Santana

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-06-01

End Date:

2026-06-01

Primary Classification:

10204: Human Computer Interaction (Social aspects at 50804)

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

While LLMs have advanced complex communication, they cannot effectively process and produce non-verbal signals such as facial expressions, gestures, and gaze, hindering natural human-machine communication. This project aims to bridge this gap by exploring the integration of multimodal social cues into foundation models to develop more sophisticated, embodied AI systems. The research will tackle key scientific challenges, including integrating multimodal perception by fusing foundation models with embodied and task-specific tags, generating appropriate verbal and non-verbal responses, and developing novel metrics for comprehensive evaluation in real-world scenarios. The core scientific approach involves transforming non-verbal and task-based contextual information into tokens that fine-tuned LLMs can interpret alongside verbal input, enhancing their social perception and decision-making capabilities. The methodology will involve testing methods for tokenizing multimodal environments, using this data to improve dialogue models, and enabling these models to generate embodied multimodal information, with case studies focusing on enjoyment perception in robot-elderly interactions and multimodal behavior generation in collaborative game settings. By advancing social robots and virtual agents capable of effective human engagement, this work promises novel applications in human-centered domains such as collaborative manufacturing, education, healthcare, and entertainment, ultimately improving human-robot interactions by making them more intuitive and aligned with human expectations.