dc.description.abstract | Foreign accent conversion (FAC) aims to generate a synthetic voice that has the voice identity of a given non-native speaker (NNS), but the pronunciation patterns (i.e., accent) of a native speaker (NS). This synthetic voice is often referred to as “Golden Speaker” in the computer-assisted pronunciation training literature. Prior FAC algorithms do not fully remove mispronunciations in the original non-native speech or fully capture the voice quality of the non-native speaker. More importantly, most prior methods require a reference utterance from a native speaker at synthesis time, thus limiting the application scope of FAC in pronunciation training. This dissertation aims to address these issues by proposing solutions to three interrelated problems:
• Reducing mispronunciation in the accent converted speech
• Improving the voice similarity between the accent conversions and the NNS
• Removing the need for an NS reference utterance at synthesis time
To address the first problem, I propose an approach that matches frames from the
native reference speaker and non-native speakers based on their phonetic similarity. To generate accent conversions, I then use the paired frames to train a Gaussian Mixture Model (GMM) that converts the native reference utterance to match the voice identity of the non-native speaker. The algorithm outperforms earlier methods that match frames based on Dynamic Time Warping or acoustic similarity, improving ratings of acoustic quality and native accent while retaining the voice identity of the non-native speaker. I also show that this approach can be applied to non-parallel training data and achieve comparable performance.
To address the second problem, I develop a sequence-to-sequence speech synthesizer that maps speech embeddings (e.g., phonetic posteriorgrams) from the non-native speaker into the corresponding spectrograms. At inference time, I drive the synthesizer with a speech embedding from an NS reference utterance. The proposed system produces speech that sounds clearer, and more natural and similar to the non-native speaker compared with the model presented in the first work, while significantly reducing the perceived accentedness compared with non-native utterances.
To address the third and final problem, I present a reference-free FAC system.
First, I generate a synthetic golden speaker for the non-native speaker using the method proposed in the second work. Then, I train a pronunciation-correction model that maps the non-native speaker utterance into the synthetic golden speaker utterance. Both objective and subjective evaluations show that the reference-free FAC model generates speech that resembles the non-native speaker’s voice while being significantly less accented.
In the process of conducting this research, I also took a leading role in collecting,
curating, and releasing a non-native speech corpus named L2-ARCTIC, which is the first open-source corpus of its kind and provides valuable resources for the speech community. I include descriptions of the curation process, data analysis, and applications of the corpus in this dissertation. | en |