Abstract
In the biological sciences, there has been a tremendous growth in the amount of data generated by scientists, from the sequencing of DNA to the determination of three-dimensional structures of proteins. Due to the size and complexity of the data housed within these databases, there is an imminent need to be able to analyze this data in an automated fashion to extract knowledge from the raw data. Such knowledge could be used to accomplish a number of important tasks, from understanding the function of a biological process, to the synthesis of proteins to be used as effective pharmaceuticals. The goal of this research is to develop methods to analyze raw biological data and to extract useful biochemical knowledge that will benefit the scientific community. This research focused on two separate biological databases. The first database, a collection of genetic sequences of immunoglobulin proteins, was used to study the diversity of antibody molecules, based on analysis of nucleotide sequences. The second database used was a collection of three-dimensional protein structures. This data was used to study the contact environments surrounding disulfide bonds and to determine the constraints on the protein's three-dimensional structure. Through the development and application of machine learning techniques, we were able to extract knowledge from these data sets. We discovered that there are significant patterns in the types of amino acids found within the antigen binding regions of immunoglobulins and that the diversity within these regions are reduced from what is traditionally believed. Additionally, we propose a method to align and represent the contact environments of disulfide bonds. We then used this methodology to predict the conformation of a disulfide based on the biochemical composition of its surrounding contact environment. Finally, we outline how this methodology could be used to potentially help protein engineers create more stable disulfide bonds in proteins.
Hofle, Michael David (2001). Machine learning approaches to biochemical knowledge discovery. Master's thesis, Texas A&M University. Available electronically from
https : / /hdl .handle .net /1969 .1 /ETD -TAMU -2001 -THESIS -H634.