The purpose of this project is to train a new neural codec for speech which will . The goal of this is to enable more expressive control in speech synthesis through the use of various layers of codebooks at different levels of speech, as well as training methods to ensure orthogonality between these differing scales of codebooks. Speech systems we create using this codec will potentially be able to use this to enable both local and global control and transfer of expressiveness.
We also will enforce through a secondary predictive model that only certain speech features will be encoded by given codebooks, and adversarially prevent others learning these features.
Through this work, we aim to enable more controllable expressive speech synthesis that can be used in a streaming manner for realtime applications. Initially this will be trained in English, and then in a multilingual context within which Swedish will be included (something which is uncommon within many current models).