Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions
Abstract
Protein-protein interactions (PPIs) often underlie important biological processes. Due to the vast quantity of potential PPIs in living organisms, it can be an expensive if not daunting task to identify each PPI experimentally, thus computational methods have been developed in parallel to facilitate the task. Despite various experimental or computational methods to determine or predict PPIs, a knowledge gap is there to understand the 3-dimensional interactions in atomic level details. This research project aims to leverage the existing protein data and emerging tools of machine learning to both predict and explain protein-protein interactions. Specifically, using various modalities of protein data including 1D sequences and 2D structures, several hierarchical recurrent neural network (HRNN) and joint attention based models have been developed. These models predict whether two proteins interact (the probability of PPI) and, if they do, how they interact (the probabilities of their residue-residue contacts (RRC)). The prediction of PPI from model I (uses only 1D sequences) has Area under the Precision-Recall Curve (AUPRC) output of 0.738. In the comparative analysis of model I with state–of-the-art PPI-detect [1], the precision, sensitivity and accuracy increased 7.8%, 9.5% and 6.2% respectively. To predict inter RRC map, a gradual improvement has been observed from model I, model II(uses sequence pre-training and Inter RRC maps to fine-tune) and model III(uses both sequences and intra RRC maps). As a result, the best AUPRC reached 2.69e-4 (model III), from 2.49e-4 (model II) and 1.02e-4 (model I) for the validation set. Thus, model III shows 163% AUPRC improvement than model I and 8.03% than model II; additionally model II shows 144% improvement than model I. The performance evaluations of these models show that the advantage of big data for 1D modality is not good enough to predict inter RRC maps; rather a slight combination of structure information with sequence as done in model II gives a much better inter RRC predictions. The full combination of sequences and intra RRC maps show the best result.
Subject
Protein-protein interactionResidue-residue contact
Universal Protein Resource (UniProt)
Protein Data Bank (PDB)
Hierarchical recurrent neural network
Area Under the Precision-Recall Curve
Area Under Curve - Receiver Operating Characteristics
Citation
Talukder, Arghamitra (2021). Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions. Undergraduate Research Scholars Program. Available electronically from https : / /hdl .handle .net /1969 .1 /200589.
Related items
Showing items related by title, author, creator and subject.
-
Moon, Soon Young (2012-05-04)Rotavirus (RV) causes more than 2 million diarrhea incidents and more than 600,000 deaths around the world every year. In order to prevent and treat this fatal disease, we must study, in depth, the mechanisms and pathogenesis ...
-
Zeng, Bin (2009-05-15)From opportunistic protist Cryptosporidium parvum we identified and functionally assayed a fatty acyl-CoA-binding protein (ACBP) gene. The CpACBP1 gene encodes a protein of 268 aa that is three times larger than typical ...
-
Quinlan, Robert Jason (Texas A&M University, 2005-02-17)Protein-ligand and protein-protein interactions are critical to cellular function. Most cellular metabolic and signal tranduction pathways are influenced by these interactions, consequently molecular level understanding ...