dc.creator | Talukder, Arghamitra | |
dc.date.accessioned | 2023-11-15T14:12:33Z | |
dc.date.available | 2023-11-15T14:12:33Z | |
dc.date.created | 2021-05 | |
dc.date.issued | 2021-05-03 | |
dc.date.submitted | May 2021 | |
dc.identifier.uri | https://hdl.handle.net/1969.1/200589 | |
dc.description.abstract | Protein-protein interactions (PPIs) often underlie important biological processes. Due to the vast quantity of potential PPIs in living organisms, it can be an expensive if not daunting task to identify each PPI experimentally, thus computational methods have been developed in parallel to facilitate the task. Despite various experimental or computational methods to determine or predict PPIs, a knowledge gap is there to understand the 3-dimensional interactions in atomic level details. This research project aims to leverage the existing protein data and emerging tools of machine learning to both predict and explain protein-protein interactions. Specifically, using various modalities of protein data including 1D sequences and 2D structures, several hierarchical recurrent neural network (HRNN) and joint attention based models have been developed. These models predict whether two proteins interact (the probability of PPI) and, if they do, how they interact (the probabilities of their residue-residue contacts (RRC)). The prediction of PPI from model I (uses only 1D sequences) has Area under the Precision-Recall Curve (AUPRC) output of 0.738. In the comparative analysis of model I with state–of-the-art PPI-detect [1], the precision, sensitivity and accuracy increased 7.8%, 9.5% and 6.2% respectively. To predict inter RRC map, a gradual improvement has been observed from model I, model II(uses sequence pre-training and Inter RRC maps to fine-tune) and model III(uses both sequences and intra RRC maps). As a result, the best AUPRC reached 2.69e-4 (model III), from 2.49e-4 (model II) and 1.02e-4 (model I) for the validation set. Thus, model III shows 163% AUPRC improvement than model I and 8.03% than model II; additionally model II shows 144% improvement than model I. The performance evaluations of these models show that the advantage of big data for 1D modality is not good enough to predict inter RRC maps; rather a slight combination of structure information with sequence as done in model II gives a much better inter RRC predictions. The full combination of sequences and intra RRC maps show the best result. | |
dc.format.mimetype | application/pdf | |
dc.subject | Protein-protein interaction | |
dc.subject | Residue-residue contact | |
dc.subject | Universal Protein Resource (UniProt) | |
dc.subject | Protein Data Bank (PDB) | |
dc.subject | Hierarchical recurrent neural network | |
dc.subject | Area Under the Precision-Recall Curve | |
dc.subject | Area Under Curve - Receiver Operating Characteristics | |
dc.title | Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions | |
dc.type | Thesis | |
thesis.degree.department | Electrical and Computer Engineering | |
thesis.degree.discipline | Electrical Engineering | |
thesis.degree.grantor | Undergraduate Research Scholars Program | |
thesis.degree.name | B.S. | |
thesis.degree.level | Undergraduate | |
dc.contributor.committeeMember | Shen, Yang | |
dc.type.material | text | |
dc.date.updated | 2023-11-15T14:12:33Z | |