Genetic Mutation Effect Prediction: Incorporating 1D Sequence and 3D Structure
Abstract
It is critical to accurately predict genomic variations’ effects, as those variations are closely related to human health and disease development. However, due to the lack of the data and the understanding of the variation effect mechanism, especially for the noncoding genomic variations, the need for variation effect prediction has not been fully addressed. Therefore, a few models have been proposed to better utilize the current data and provide better interpretation of the mechanisms of the variation effect. Specifically, this dissertation focuses on 1) noncoding variant effect prediction, by incorporating 1D genome sequence and 3D chromatin structure through developing multimodal deep learning models; 2) coding variant effect prediction and design for desired property, by incorporating 1D protein sequence and 3D protein structure through applying protein language models and structure-based protein docking & design.
The effect of genomic variations in the noncoding region (estimated to constitute 98% of human genomes) can be described as epigenetic profile changes. 3D chromatin structure, as one of the major factors closely related to gene regulation, are barely used in previous methods. We have developed ncVarPred-1D3D, a multimodal learning model incorporating 1D genome sequence and 3D chromatin structure, for noncoding variant effect prediction. The usefulness of the proposed model is shown by correcting the inconsistency between 1D genome sequence and epigenetic profile, and improving the prediction of epigenetic profiles and variants effects. Our experiments have shown that most epigenetic events from most cell lines can benefit from the matched or even unmatched cell lines’ 3D chromatin structure information. This is critical, because only limited cell lines’ 3D structure have been experimentally measured. Meanwhile, our model can provide biological interpretation for the identified Transcription Factor (TF) binding motifs. Our model also outperforms the sequence-only state-of-the-art models in noncoding variant effect prediction, e.g., eQTL and pathogenicity prediction.
Unlike noncoding variant effect prediction, coding variant effects can be analyzed by modeling the protein structure and function changes. We focus on two therapeutic applications. Firstly, to contribute to the fight against the COVID-19 global pandemic, we have developed a pipeline to prioritize the antibodies broadly neutralizing wild-type SARS-CoV-2 and variants, utilizing the candidate antibody sequences, SARS-CoV-2 RBD-hACE2 complex structure and few variants’ neutralization-activity fold changes information only. We also redesigned antibodies for higher neutralization using a principle driven protein design method, interconnected cost function networks (iCFN). Secondly, we developed a machine learning model to generate and test mechanistic hypotheses for a newly discovered class of estrogen receptor activating variants found in metastatic breast cancer patients. Our pipelines for coding variant effect prediction can be applied to new fields to solve new challenges.
Description
Keywords
Noncoding variants, Protein design, 1D sequence, 3D structure