Currently, Matcha-TTS is a fast TTS architecture based on conditional flow matching. Conditional Flow Matching can be considered a fast diffusion model, which opens the possibility to apply wherever diffusion has been successful including image synthesis and multimodal synthesis. The project aims to benefit from these advancements in the field of probabilistic synthesis and create a larger and stronger multimodal model.