Foreign Accent Conversion with Neural Acoustic Modeling

Zhao, Guanlong

dc.contributor.advisor	Gutierrez-Osuna, Ricardo
dc.creator	Zhao, Guanlong
dc.date.accessioned	2021-02-03T17:09:54Z
dc.date.available	2022-08-01T06:52:02Z
dc.date.created	2020-08
dc.date.issued	2020-06-22
dc.date.submitted	August 2020
dc.identifier.uri	https://hdl.handle.net/1969.1/192349
dc.description.abstract	Foreign accent conversion (FAC) aims to generate a synthetic voice that has the voice identity of a given non-native speaker (NNS), but the pronunciation patterns (i.e., accent) of a native speaker (NS). This synthetic voice is often referred to as “Golden Speaker” in the computer-assisted pronunciation training literature. Prior FAC algorithms do not fully remove mispronunciations in the original non-native speech or fully capture the voice quality of the non-native speaker. More importantly, most prior methods require a reference utterance from a native speaker at synthesis time, thus limiting the application scope of FAC in pronunciation training. This dissertation aims to address these issues by proposing solutions to three interrelated problems: • Reducing mispronunciation in the accent converted speech • Improving the voice similarity between the accent conversions and the NNS • Removing the need for an NS reference utterance at synthesis time To address the first problem, I propose an approach that matches frames from the native reference speaker and non-native speakers based on their phonetic similarity. To generate accent conversions, I then use the paired frames to train a Gaussian Mixture Model (GMM) that converts the native reference utterance to match the voice identity of the non-native speaker. The algorithm outperforms earlier methods that match frames based on Dynamic Time Warping or acoustic similarity, improving ratings of acoustic quality and native accent while retaining the voice identity of the non-native speaker. I also show that this approach can be applied to non-parallel training data and achieve comparable performance. To address the second problem, I develop a sequence-to-sequence speech synthesizer that maps speech embeddings (e.g., phonetic posteriorgrams) from the non-native speaker into the corresponding spectrograms. At inference time, I drive the synthesizer with a speech embedding from an NS reference utterance. The proposed system produces speech that sounds clearer, and more natural and similar to the non-native speaker compared with the model presented in the first work, while significantly reducing the perceived accentedness compared with non-native utterances. To address the third and final problem, I present a reference-free FAC system. First, I generate a synthetic golden speaker for the non-native speaker using the method proposed in the second work. Then, I train a pronunciation-correction model that maps the non-native speaker utterance into the synthetic golden speaker utterance. Both objective and subjective evaluations show that the reference-free FAC model generates speech that resembles the non-native speaker’s voice while being significantly less accented. In the process of conducting this research, I also took a leading role in collecting, curating, and releasing a non-native speech corpus named L2-ARCTIC, which is the first open-source corpus of its kind and provides valuable resources for the speech community. I include descriptions of the curation process, data analysis, and applications of the corpus in this dissertation.	en
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	accent conversion	en
dc.subject	voice conversion	en
dc.subject	speech modification	en
dc.subject	speech synthesis	en
dc.subject	acoustic modeling	en
dc.title	Foreign Accent Conversion with Neural Acoustic Modeling	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Choe, Yoonsuck
dc.contributor.committeeMember	Huang, Ruihong
dc.contributor.committeeMember	Vaid, Jyotsna
dc.type.material	text	en
dc.date.updated	2021-02-03T17:10:02Z
local.embargo.terms	2022-08-01
local.etdauthor.orcid	0000-0002-6059-4053

Files in this item

Name:: ZHAO-DISSERTATION-2020.pdf
Size:: 7.974Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record