Show simple item record

dc.contributor.advisorShen, Yang
dc.creatorZhou, Fangtong
dc.date.accessioned2021-01-08T20:46:31Z
dc.date.available2022-05-01T07:15:03Z
dc.date.created2020-05
dc.date.issued2020-04-16
dc.date.submittedMay 2020
dc.identifier.urihttps://hdl.handle.net/1969.1/191949
dc.description.abstractThis project focuses on developing machine learning methods to predict protein-ligand interactions. The unique proteins under study are nuclear receptors (NRs), which regulate hormone-triggered gene transcription and are often drug targets in cancer therapy. Compared to other proteins, the categorical labels for their ligand interactions are much more complex to annotate and learn in the framework of machine learning. The project aims at identifying ligands for NRs through sequence-based deep learning while addressing aforementioned challenges. The main contributions of this project include the following. (1) Data curation: Identification and curation of databases were performed. A rule was set up to deal with the complicated categorical labels for NR-ligand pairs. (2) Machine Learning Models: Shallow models, two-step deep model, and jointly trained deep model were trained. (3) Stratified Validation Sets: They were developed to tune the hyper-parameter of the model and improve model generalizability. (4) Transfer Learning: It was applied to tune models trained on other NRs so that novel ligands can be identified for orphan NRs. Specifically, categorical labels were first collected and curated for the identified data sets to enable model training and testing. Protein and ligand features were extracted by a pre-trained recurrent neural network (RNN) encoder using unlabeled data and then fed to various downstream supervised models, shallow or deep, for multi-class classification. Among shallow supervised models random forest showed the best results. For deep supervised models, a convolutional neural network (CNN) was trained subsequently or jointly with RNN. Comparisons between various shallow and deep models showed that although the way to train deep models, separately or jointly, did not make significant difference in model performance, there was an obvious improvement from shallow to deep models. Moreover, a stratified validation strategy was developed to further improve the generalizability of the model from the training set to test sets. Lastly, considering the very different distributions of biological features between training NRs and orphan NRs, transfer learning strategy was used to fine tune the model and improve the performance of ligand identification for two orphan NRs. Future plan includes the exploration of mutational effect, that is, the change in predicted label upon amino-acid substitutions, insertions, or deletions in NRs.en
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectNuclear Receptorsen
dc.subjectProtein-Ligand Interactionsen
dc.subjectSequence-Based Deep Learningen
dc.subjectRandom Foresten
dc.titleIdentifying Nuclear Receptor Ligands through Sequence-Based Deep Learningen
dc.typeThesisen
thesis.degree.departmentElectrical and Computer Engineeringen
thesis.degree.disciplineElectrical Engineeringen
thesis.degree.grantorTexas A&M Universityen
thesis.degree.nameMaster of Scienceen
thesis.degree.levelMastersen
dc.contributor.committeeMemberHu, Xia
dc.contributor.committeeMemberNarayanan, Krishna
dc.contributor.committeeMemberQian, Xiaoning
dc.contributor.committeeMemberTian, Chao
dc.type.materialtexten
dc.date.updated2021-01-08T20:46:31Z
local.embargo.terms2022-05-01
local.etdauthor.orcid0000-0003-0750-1441


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record