Assessing the Statistical Significance of Extreme Values in Clustered Data
MetadataShow full item record
A search through a large database for a match similar to the object being queried is commonplace. In the field of Bioinformatics, for example, this occurs during BLAST searches, in which an E-score is provided to reflect the significance of similarity of two gene sequences. In the case discussed in this work, the database consists of clusters (triplets) of amino acids found near the interface region of Protein Protein Interations (PPIs). The Exploring Key Orientations (EKO) algorithm needs to find similarity in structure between a peptidomimetic scaffold compound and a triplet present on such a PPI, and it is of interest to us to determine the triplets from within a large database of protein complexes that best fit the scaffold. It is our goal to determine when a "best match" thus acquired is statistically significant. We do this by parameterizing the space of triplets to find clusters, modeling a density distribution on the space, and fitting a Weibull distribution to determine a p value for a match. The inherently clustered nature of the triplet database affects the analysis of significance, and we propose a method to efficiently estimate the p value of a match score.
Parulekar, Advait U (2019). Assessing the Statistical Significance of Extreme Values in Clustered Data. Undergraduate Research Scholars Program. Available electronically from