SUPR
Multimodal Conversation model for Turn taking prediction
Dnr:

NAISS 2025/22-620

Type:

NAISS Small Compute

Principal Investigator:

Haotian Qi

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-04-23

End Date:

2026-05-01

Primary Classification:

10204: Human Computer Interaction (Social aspects at 50804)

Webpage:

Allocation

Abstract

Turn-taking refers to the process by which speakers and listeners alternate roles during conversation. In human communication, individuals rely on visual cues—such as eye gaze and facial expressions—and audio signals like pauses and contextual speech cues to anticipate who will speak next. The goal is to develop a multimodal deep learning model that predicts turn-taking in multi-speaker environments using these cues. This study builds on earlier models that focused solely on audio signals by integrating both visual and audio inputs, ultimately aiming to improve turn-taking predictions and facilitate more natural interactions, particularly in human- robot communication contexts.