Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions

Talukder, Arghamitra

dc.creator	Talukder, Arghamitra
dc.date.accessioned	2023-11-15T14:12:33Z
dc.date.available	2023-11-15T14:12:33Z
dc.date.created	2021-05
dc.date.issued	2021-05-03
dc.date.submitted	May 2021
dc.identifier.uri	https://hdl.handle.net/1969.1/200589
dc.description.abstract	Protein-protein interactions (PPIs) often underlie important biological processes. Due to the vast quantity of potential PPIs in living organisms, it can be an expensive if not daunting task to identify each PPI experimentally, thus computational methods have been developed in parallel to facilitate the task. Despite various experimental or computational methods to determine or predict PPIs, a knowledge gap is there to understand the 3-dimensional interactions in atomic level details. This research project aims to leverage the existing protein data and emerging tools of machine learning to both predict and explain protein-protein interactions. Speciﬁcally, using various modalities of protein data including 1D sequences and 2D structures, several hierarchical recurrent neural network (HRNN) and joint attention based models have been developed. These models predict whether two proteins interact (the probability of PPI) and, if they do, how they interact (the probabilities of their residue-residue contacts (RRC)). The prediction of PPI from model I (uses only 1D sequences) has Area under the Precision-Recall Curve (AUPRC) output of 0.738. In the comparative analysis of model I with state–of-the-art PPI-detect [1], the precision, sensitivity and accuracy increased 7.8%, 9.5% and 6.2% respectively. To predict inter RRC map, a gradual improvement has been observed from model I, model II(uses sequence pre-training and Inter RRC maps to ﬁne-tune) and model III(uses both sequences and intra RRC maps). As a result, the best AUPRC reached 2.69e-4 (model III), from 2.49e-4 (model II) and 1.02e-4 (model I) for the validation set. Thus, model III shows 163% AUPRC improvement than model I and 8.03% than model II; additionally model II shows 144% improvement than model I. The performance evaluations of these models show that the advantage of big data for 1D modality is not good enough to predict inter RRC maps; rather a slight combination of structure information with sequence as done in model II gives a much better inter RRC predictions. The full combination of sequences and intra RRC maps show the best result.
dc.format.mimetype	application/pdf
dc.subject	Protein-protein interaction
dc.subject	Residue-residue contact
dc.subject	Universal Protein Resource (UniProt)
dc.subject	Protein Data Bank (PDB)
dc.subject	Hierarchical recurrent neural network
dc.subject	Area Under the Precision-Recall Curve
dc.subject	Area Under Curve - Receiver Operating Characteristics
dc.title	Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions
dc.type	Thesis
thesis.degree.department	Electrical and Computer Engineering
thesis.degree.discipline	Electrical Engineering
thesis.degree.grantor	Undergraduate Research Scholars Program
thesis.degree.name	B.S.
thesis.degree.level	Undergraduate
dc.contributor.committeeMember	Shen, Yang
dc.type.material	text
dc.date.updated	2023-11-15T14:12:33Z

Files in this item

Name:: TALUKDER-FINALTHESIS-2021.pdf
Size:: 883.0Kb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Undergraduate Research Scholars Capstone (2006–present)

Show simple item record

Multimodal Data Fusion and Machine Learning for Deciphering Protein-Protein Interactions

Files in this item

This item appears in the following Collection(s)

Related items