Turn-taking refers to the process by which speakers and listeners alternate roles during conversation. In human communication, individuals rely on visual cues—such as eye gaze and facial expressions—and audio signals like pauses and contextual speech cues to anticipate who will speak next. The goal is to develop a multimodal deep learning model that predicts turn-taking in multi-speaker environments using these cues. This study builds on earlier models that focused solely on audio signals by integrating both visual and audio inputs, ultimately aiming to improve turn-taking predictions and facilitate more natural interactions, particularly in human- robot communication contexts.