Show simple item record

dc.creatorTalukder, Arghamitra
dc.date.accessioned2023-11-15T14:12:33Z
dc.date.available2023-11-15T14:12:33Z
dc.date.created2021-05
dc.date.issued2021-05-03
dc.date.submittedMay 2021
dc.identifier.urihttps://hdl.handle.net/1969.1/200589
dc.description.abstractProtein-protein interactions (PPIs) often underlie important biological processes. Due to the vast quantity of potential PPIs in living organisms, it can be an expensive if not daunting task to identify each PPI experimentally, thus computational methods have been developed in parallel to facilitate the task. Despite various experimental or computational methods to determine or predict PPIs, a knowledge gap is there to understand the 3-dimensional interactions in atomic level details. This research project aims to leverage the existing protein data and emerging tools of machine learning to both predict and explain protein-protein interactions. Specifically, using various modalities of protein data including 1D sequences and 2D structures, several hierarchical recurrent neural network (HRNN) and joint attention based models have been developed. These models predict whether two proteins interact (the probability of PPI) and, if they do, how they interact (the probabilities of their residue-residue contacts (RRC)). The prediction of PPI from model I (uses only 1D sequences) has Area under the Precision-Recall Curve (AUPRC) output of 0.738. In the comparative analysis of model I with state–of-the-art PPI-detect [1], the precision, sensitivity and accuracy increased 7.8%, 9.5% and 6.2% respectively. To predict inter RRC map, a gradual improvement has been observed from model I, model II(uses sequence pre-training and Inter RRC maps to fine-tune) and model III(uses both sequences and intra RRC maps). As a result, the best AUPRC reached 2.69e-4 (model III), from 2.49e-4 (model II) and 1.02e-4 (model I) for the validation set. Thus, model III shows 163% AUPRC improvement than model I and 8.03% than model II; additionally model II shows 144% improvement than model I. The performance evaluations of these models show that the advantage of big data for 1D modality is not good enough to predict inter RRC maps; rather a slight combination of structure information with sequence as done in model II gives a much better inter RRC predictions. The full combination of sequences and intra RRC maps show the best result.
dc.format.mimetypeapplication/pdf
dc.subjectProtein-protein interaction
dc.subjectResidue-residue contact
dc.subjectUniversal Protein Resource (UniProt)
dc.subjectProtein Data Bank (PDB)
dc.subjectHierarchical recurrent neural network
dc.subjectArea Under the Precision-Recall Curve
dc.subjectArea Under Curve - Receiver Operating Characteristics
dc.titleMultimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions
dc.typeThesis
thesis.degree.departmentElectrical and Computer Engineering
thesis.degree.disciplineElectrical Engineering
thesis.degree.grantorUndergraduate Research Scholars Program
thesis.degree.nameB.S.
thesis.degree.levelUndergraduate
dc.contributor.committeeMemberShen, Yang
dc.type.materialtext
dc.date.updated2023-11-15T14:12:33Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record