Show simple item record

dc.contributor.advisorGutierrez-Osuna, Ricardo
dc.creatorZhao, Guanlong
dc.date.accessioned2021-02-03T17:09:54Z
dc.date.available2022-08-01T06:52:02Z
dc.date.created2020-08
dc.date.issued2020-06-22
dc.date.submittedAugust 2020
dc.identifier.urihttps://hdl.handle.net/1969.1/192349
dc.description.abstractForeign accent conversion (FAC) aims to generate a synthetic voice that has the voice identity of a given non-native speaker (NNS), but the pronunciation patterns (i.e., accent) of a native speaker (NS). This synthetic voice is often referred to as “Golden Speaker” in the computer-assisted pronunciation training literature. Prior FAC algorithms do not fully remove mispronunciations in the original non-native speech or fully capture the voice quality of the non-native speaker. More importantly, most prior methods require a reference utterance from a native speaker at synthesis time, thus limiting the application scope of FAC in pronunciation training. This dissertation aims to address these issues by proposing solutions to three interrelated problems: • Reducing mispronunciation in the accent converted speech • Improving the voice similarity between the accent conversions and the NNS • Removing the need for an NS reference utterance at synthesis time To address the first problem, I propose an approach that matches frames from the native reference speaker and non-native speakers based on their phonetic similarity. To generate accent conversions, I then use the paired frames to train a Gaussian Mixture Model (GMM) that converts the native reference utterance to match the voice identity of the non-native speaker. The algorithm outperforms earlier methods that match frames based on Dynamic Time Warping or acoustic similarity, improving ratings of acoustic quality and native accent while retaining the voice identity of the non-native speaker. I also show that this approach can be applied to non-parallel training data and achieve comparable performance. To address the second problem, I develop a sequence-to-sequence speech synthesizer that maps speech embeddings (e.g., phonetic posteriorgrams) from the non-native speaker into the corresponding spectrograms. At inference time, I drive the synthesizer with a speech embedding from an NS reference utterance. The proposed system produces speech that sounds clearer, and more natural and similar to the non-native speaker compared with the model presented in the first work, while significantly reducing the perceived accentedness compared with non-native utterances. To address the third and final problem, I present a reference-free FAC system. First, I generate a synthetic golden speaker for the non-native speaker using the method proposed in the second work. Then, I train a pronunciation-correction model that maps the non-native speaker utterance into the synthetic golden speaker utterance. Both objective and subjective evaluations show that the reference-free FAC model generates speech that resembles the non-native speaker’s voice while being significantly less accented. In the process of conducting this research, I also took a leading role in collecting, curating, and releasing a non-native speech corpus named L2-ARCTIC, which is the first open-source corpus of its kind and provides valuable resources for the speech community. I include descriptions of the curation process, data analysis, and applications of the corpus in this dissertation.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectaccent conversionen
dc.subjectvoice conversionen
dc.subjectspeech modificationen
dc.subjectspeech synthesisen
dc.subjectacoustic modelingen
dc.titleForeign Accent Conversion with Neural Acoustic Modelingen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberChoe, Yoonsuck
dc.contributor.committeeMemberHuang, Ruihong
dc.contributor.committeeMemberVaid, Jyotsna
dc.type.materialtexten
dc.date.updated2021-02-03T17:10:02Z
local.embargo.terms2022-08-01
local.etdauthor.orcid0000-0002-6059-4053


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record