Show simple item record

dc.contributor.advisorGutierrez-Osuna, Ricardo
dc.creatorDing, Shaojin
dc.date.accessioned2022-02-23T18:08:46Z
dc.date.available2023-05-01T06:37:14Z
dc.date.created2021-05
dc.date.issued2021-04-15
dc.date.submittedMay 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/195720
dc.description.abstractVoice Conversion (VC) aims to transform the speech of a source speaker to sound as if a target speaker had produced it. As a closely related but more challenging research problem, foreign accent conversion (FAC) [1] aims to create a new voice that has the voice identity of a given non-native (L2) speaker and the accent of a native (L1) speaker. Prior VC and FAC approaches require a considerable amount of speech data from each target speaker for model training, which can be tedious to collect and demotivating for users when applying to real-world scenarios. This dissertation aims to address these problems by introducing three few-shot learning approaches for VC and FAC: • Few-shot VC based on sparse representation • Zero-shot VC based on sequence-to-sequence (seq2seq) model • Zero-shot FAC based on seq2seq model In the first approach, I develop a novel sparse representation for VC that requires as much as one minute of speech from each target speaker. The proposed approach consists of two complementary components: a Cluster-Structured Dictionary Learning module to learn a dictionary capturing the speakers’ characteristics and a Cluster-Selective Objective Function to compute the sparse representation carrying linguistic content. The approach outperforms previous methods that use Gaussian Mixture Model (GMM) and sparse representation, improving both acoustic quality and voice identity of the VC syntheses. In the second approach, I create a seq2seq model for zero-shot VC that reduces the amount of target speech required from minutes to seconds. The model transforms a linguistic content representation (e.g., from phonetic posteriorgrams) to Mel-spectrogram, conditioned on the target speaker embedding. Moreover, I propose an adversarial training scheme that reduces the speaker-dependent cues from the phonetic posteriorgram. The approach can synthesize more natural speech than conventional methods based on GMM and sparse representations. I also show that the adversarial training scheme further improves the voice identity of the synthesized speech. In the third approach, I generalize the zero-shot VC model for use in FAC. Compared to zero-shot VC, this approach has an additional accent encoder that generates an accent embedding, which is consumed by the seq2seq model. As in the second approach, the seq2seq model transforms the linguistic content representation to Mel-spectrogram, conditioned on the desired speaker embedding, but now also on the accent embedding. I show via perceptual studies that the proposed approach reduces the accentedness of the syntheses compared to a state-of-the-art seq2seq based FAC approach, while retaining the acoustic quality and voice identity. More importantly, it reduces the required speech from each L2 speaker from hours to seconds. During the time of conducting this research, I also acted with a leading role in developing Golden Speaker Builder, a web application that uses FAC algorithms for pronunciation training. I will describe the design and implementation of this web application and the user experience feedback collected from the users who participated in the user studies.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectVoice conversionen
dc.subjectForeign accent conversionen
dc.subjectFew-shot learningen
dc.subjectZero-shot learningen
dc.subjectPronunciation trainingen
dc.titleFEW-SHOT VOICE AND FOREIGN ACCENT CONVERSION AND ITS APPLICATIONS IN PRONUNCIATION TRAININGen
dc.typeThesisen
thesis.degree.departmentComputer Science and Engineeringen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameDoctor of Philosophyen
thesis.degree.levelDoctoralen
dc.contributor.committeeMemberHuang, Ruihong
dc.contributor.committeeMemberChaspari, Theodora
dc.contributor.committeeMemberLiu, Tie
dc.type.materialtexten
dc.date.updated2022-02-23T18:08:47Z
local.embargo.terms2023-05-01
local.etdauthor.orcid0000-0002-2108-3111


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record