Optimization, Learning and Generation for Proteins: Docking Structures and Mapping Sequence–Function Relationships
Abstract
Proteins are the workhorse molecules of lives. Understanding how proteins function is one of the most fundamental problems in molecular biology, which can drive a plethora of biological and pharmaceutical applications. However, the experimental determination of protein mechanisms is expensive and time-consuming. Such a gap motivates developing computational methods for protein science. The goal of this thesis is to investigate to what extent machine learning can uncover the underlying mechanisms of proteins. We concentrate on two primitives: predicting the 3D structures of protein--protein interactions (called protein docking) and understanding the protein sequence--function relationships. Accordingly, we organize the thesis as follows:
First, we study protein docking. We introduce Bayesian Active Learning (BAL), the first optimization algorithm with uncertainty quantification (UQ) for protein docking. Extensive experiments demonstrated the superior performance of BAL against competitors on both optimization and UQ. In addition, we generalize BAL into the realm of meta-learning and propose LOIS: Learning to Optimize in Swarms. LOIS outperforms various optimization algorithms for general optimization tasks. Finally, we focus on the scoring problem in protein docking and introduce Energy‐based Graph Convolutional Networks (EGCN) that directly learns energies from graph representations of docking models, which performed better than competitors.
Second, we focus on understanding the protein sequence--function relationship. We first study the forward protein function prediction and introduce TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. We also study the inverse design and describe our novel conditional autoregressive deep generative models. By learning the functional embeddings from Gene Ontology (GO) graph as conditional inputs, our conditional autoregressive models were able to model the distributions of protein sequences for given functions.
Subject
proteinmachine learning
protein docking
optimization
uncertainty quantification
protein function prediction
protein design
deep learning
Citation
Cao, Yue (2021). Optimization, Learning and Generation for Proteins: Docking Structures and Mapping Sequence–Function Relationships. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /196384.
Related items
Showing items related by title, author, creator and subject.
-
Moon, Soon Young (2012-05-04)Rotavirus (RV) causes more than 2 million diarrhea incidents and more than 600,000 deaths around the world every year. In order to prevent and treat this fatal disease, we must study, in depth, the mechanisms and pathogenesis ...
-
Zeng, Bin (2009-05-15)From opportunistic protist Cryptosporidium parvum we identified and functionally assayed a fatty acyl-CoA-binding protein (ACBP) gene. The CpACBP1 gene encodes a protein of 268 aa that is three times larger than typical ...
-
Quinlan, Robert Jason (Texas A&M University, 2005-02-17)Protein-ligand and protein-protein interactions are critical to cellular function. Most cellular metabolic and signal tranduction pathways are influenced by these interactions, consequently molecular level understanding ...