FEW-SHOT VOICE AND FOREIGN ACCENT CONVERSION AND ITS APPLICATIONS IN PRONUNCIATION TRAINING

Ding, Shaojin

dc.contributor.advisor	Gutierrez-Osuna, Ricardo
dc.creator	Ding, Shaojin
dc.date.accessioned	2022-02-23T18:08:46Z
dc.date.available	2023-05-01T06:37:14Z
dc.date.created	2021-05
dc.date.issued	2021-04-15
dc.date.submitted	May 2021
dc.identifier.uri	https://hdl.handle.net/1969.1/195720
dc.description.abstract	Voice Conversion (VC) aims to transform the speech of a source speaker to sound as if a target speaker had produced it. As a closely related but more challenging research problem, foreign accent conversion (FAC) [1] aims to create a new voice that has the voice identity of a given non-native (L2) speaker and the accent of a native (L1) speaker. Prior VC and FAC approaches require a considerable amount of speech data from each target speaker for model training, which can be tedious to collect and demotivating for users when applying to real-world scenarios. This dissertation aims to address these problems by introducing three few-shot learning approaches for VC and FAC: • Few-shot VC based on sparse representation • Zero-shot VC based on sequence-to-sequence (seq2seq) model • Zero-shot FAC based on seq2seq model In the first approach, I develop a novel sparse representation for VC that requires as much as one minute of speech from each target speaker. The proposed approach consists of two complementary components: a Cluster-Structured Dictionary Learning module to learn a dictionary capturing the speakers’ characteristics and a Cluster-Selective Objective Function to compute the sparse representation carrying linguistic content. The approach outperforms previous methods that use Gaussian Mixture Model (GMM) and sparse representation, improving both acoustic quality and voice identity of the VC syntheses. In the second approach, I create a seq2seq model for zero-shot VC that reduces the amount of target speech required from minutes to seconds. The model transforms a linguistic content representation (e.g., from phonetic posteriorgrams) to Mel-spectrogram, conditioned on the target speaker embedding. Moreover, I propose an adversarial training scheme that reduces the speaker-dependent cues from the phonetic posteriorgram. The approach can synthesize more natural speech than conventional methods based on GMM and sparse representations. I also show that the adversarial training scheme further improves the voice identity of the synthesized speech. In the third approach, I generalize the zero-shot VC model for use in FAC. Compared to zero-shot VC, this approach has an additional accent encoder that generates an accent embedding, which is consumed by the seq2seq model. As in the second approach, the seq2seq model transforms the linguistic content representation to Mel-spectrogram, conditioned on the desired speaker embedding, but now also on the accent embedding. I show via perceptual studies that the proposed approach reduces the accentedness of the syntheses compared to a state-of-the-art seq2seq based FAC approach, while retaining the acoustic quality and voice identity. More importantly, it reduces the required speech from each L2 speaker from hours to seconds. During the time of conducting this research, I also acted with a leading role in developing Golden Speaker Builder, a web application that uses FAC algorithms for pronunciation training. I will describe the design and implementation of this web application and the user experience feedback collected from the users who participated in the user studies.	en
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	Voice conversion	en
dc.subject	Foreign accent conversion	en
dc.subject	Few-shot learning	en
dc.subject	Zero-shot learning	en
dc.subject	Pronunciation training	en
dc.title	FEW-SHOT VOICE AND FOREIGN ACCENT CONVERSION AND ITS APPLICATIONS IN PRONUNCIATION TRAINING	en
dc.type	Thesis	en
thesis.degree.department	Computer Science and Engineering	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	Texas A&M University	en
thesis.degree.name	Doctor of Philosophy	en
thesis.degree.level	Doctoral	en
dc.contributor.committeeMember	Huang, Ruihong
dc.contributor.committeeMember	Chaspari, Theodora
dc.contributor.committeeMember	Liu, Tie
dc.type.material	text	en
dc.date.updated	2022-02-23T18:08:47Z
local.embargo.terms	2023-05-01
local.etdauthor.orcid	0000-0002-2108-3111

Files in this item

Name:: DING-DISSERTATION-2021.pdf
Size:: 2.765Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Electronic Theses, Dissertations, and Records of Study (2002– )
Texas A&M University Theses, Dissertations, and Records of Study (2002– )

Show simple item record