Controllable Expressive Speech with Adversarially Trained Orthogonal Neural Codec.

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1716

Type:

NAISS Small Compute

Principal Investigator:

Juliana Francis

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-12-10

End Date:

2026-12-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

The purpose of this project is to train a new neural codec for speech which will . The goal of this is to enable more expressive control in speech synthesis through the use of various layers of codebooks at different levels of speech, as well as training methods to ensure orthogonality between these differing scales of codebooks. Speech systems we create using this codec will potentially be able to use this to enable both local and global control and transfer of expressiveness. We also will enforce through a secondary predictive model that only certain speech features will be encoded by given codebooks, and adversarially prevent others learning these features. Through this work, we aim to enable more controllable expressive speech synthesis that can be used in a streaming manner for realtime applications. Initially this will be trained in English, and then in a multilingual context within which Swedish will be included (something which is uncommon within many current models).