Multimodal Conversation model for Turn taking prediction

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-620

Type:

NAISS Small Compute

Principal Investigator:

Haotian Qi

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-04-23

End Date:

2026-05-01

Primary Classification:

10204: Human Computer Interaction (Social aspects at 50804)

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

Turn-taking refers to the process by which speakers and listeners alternate roles during conversation. In human communication, individuals rely on visual cues—such as eye gaze and facial expressions—and audio signals like pauses and contextual speech cues to anticipate who will speak next. The goal is to develop a multimodal deep learning model that predicts turn-taking in multi-speaker environments using these cues. This study builds on earlier models that focused solely on audio signals by integrating both visual and audio inputs, ultimately aiming to improve turn-taking predictions and facilitate more natural interactions, particularly in human- robot communication contexts.