Optimization, Learning and Generation for Proteins: Docking Structures and Mapping Sequence–Function Relationships

Cao, Yue

View/ Open

CAO-DISSERTATION-2021.pdf (7.291Mb)

Date

2021-12-08

Author

Cao, Yue

Metadata

Show full item record

Abstract

Proteins are the workhorse molecules of lives. Understanding how proteins function is one of the most fundamental problems in molecular biology, which can drive a plethora of biological and pharmaceutical applications. However, the experimental determination of protein mechanisms is expensive and time-consuming. Such a gap motivates developing computational methods for protein science. The goal of this thesis is to investigate to what extent machine learning can uncover the underlying mechanisms of proteins. We concentrate on two primitives: predicting the 3D structures of protein--protein interactions (called protein docking) and understanding the protein sequence--function relationships. Accordingly, we organize the thesis as follows: First, we study protein docking. We introduce Bayesian Active Learning (BAL), the first optimization algorithm with uncertainty quantification (UQ) for protein docking. Extensive experiments demonstrated the superior performance of BAL against competitors on both optimization and UQ. In addition, we generalize BAL into the realm of meta-learning and propose LOIS: Learning to Optimize in Swarms. LOIS outperforms various optimization algorithms for general optimization tasks. Finally, we focus on the scoring problem in protein docking and introduce Energy‐based Graph Convolutional Networks (EGCN) that directly learns energies from graph representations of docking models, which performed better than competitors. Second, we focus on understanding the protein sequence--function relationship. We first study the forward protein function prediction and introduce TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. We also study the inverse design and describe our novel conditional autoregressive deep generative models. By learning the functional embeddings from Gene Ontology (GO) graph as conditional inputs, our conditional autoregressive models were able to model the distributions of protein sequences for given functions.

Citation

Cao, Yue (2021). Optimization, Learning and Generation for Proteins: Docking Structures and Mapping Sequence–Function Relationships. Doctoral dissertation, Texas A&M University. Available electronically from https : / /hdl .handle .net /1969 .1 /196384.