Deep Learning Approaches for Textual Data Generation

Yu, Xiaojing

The full text of this item is not available at this time because the student has placed this item under an embargo for a period of time. The Libraries are not authorized to provide a copy of this work during the embargo period, even for Texas A&M users with NetID.

View/ Open

YU-DISSERTATION-2022.pdf (2.622Mb)

Date

2022-07-12

Author

Yu, Xiaojing

Metadata

Show full item record

Abstract

Generation is a fundamental sub-area in artificial intelligence. Compared with the remarkable progress in image generation, textual data generation still faces many challenges and is far from perfect. This dissertation aims at addressing some key challenges in textual data generation with constraints. We focus on three topics: text-to-label generation, label-to-text generation, and text-to-text generation. For each topic, we discuss the major issues and propose our approaches to address those issues with a special application. Firstly, we extend open domain text-to-SQL parsing to clinic domain and introduce a new task that automatically translates eligibility criteria to SQL queries. To avoid domain shift problem, we create a new dataset Criteria2SQL with eligibility criteria with paired SQL annotations and summarize a set of grammar rules. With the designed grammar rules, our proposed semantic parsing model can parse eligibility criteria with both simple SQL statements and domain-specific statements, which significantly improves the parsing accuracy. Training generation model with class-imbalanced dataset could lead to tedious and repetitive expression of generated sentences. To tackle this problem, we apply flexible templates to guide neural-based generation. We propose a novel framework for diversity-aware SQL-to-question generation, which extracts natural templates from cross-domain datasets and enforces the generator to produce diverse and high-quality questions. Evaluation on two large-scale datasets demonstrates the effectiveness of our model in generating both diverse and high-quality sentences. Privacy-preserving text generation approaches usually suffer from semantic inconsistency and quality degradation problems. Considering this limitation, we introduce a new measurement to first evaluate the privacy-quality trade-off limit of a generator and then present an efficient authorship obfuscation model to rewrite original text into privacy-preserving text with minimum edition cost. Experiment results show our model improves the upper bound of privacy-quality trade-offs and is adjustable to meet different needs of privacy protection.

Citation

Yu, Xiaojing (2022). Deep Learning Approaches for Textual Data Generation. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /198003.