Abstract
A framework for classifying clustering algorithms and a method for performing comparative analyses have been developed. Sixteen models are defined and twenty-four algorithms from the literature are reviewed and classified within the framework. Four of these algorithms are selected for a comparative analysis experiment involving clustering of document abstracts. Six objective functions are defined to measure the "goodness" of a cluster set produced by an algorithm. A detailed description is provided for the collection of 548 abstracts which were selected from the Arms Control and Disarmament abstract journal. The abstracts were prepared in machine-readable form and a series of computer programs written to perform automatic indexing of the abstracts. The indexed abstract data base, referred to as an abstract-concept matrix, is clustered by the four algorithms. Two algorithms are the Rocchio and Dattola routines from the SMART information retrieval system. Two other algorithms, the Single-Link and the Maximal Complete Subgraph, are programmed for use in the experiment. The manual classification provided by the Library of Congress comprises a fifth cluster set for use in the analysis. The six objective function means are computed for each of the five cluster sets, and scaled to a common mean and variance. The resulting 5X6 data matrix is subjected to an analysis of variance and rank tests. The results of the analyses suggest that the computer algorithms produced better clusters than the manual classification, and that the Single-Link and Maximal Complete Subgraph algorithms produced better clusters than the Rocchio and Dattola algorithms. Generally, the objective functions appear to be consistent judges of the "goodness" of the cluster sets. It should be noted that these conclusions are limited by the fact that only one set of clusters was tested in the experiment. The eight major computer programs written for this research are provided. Some recommendations for further research are presented.
Wright, William Randolph (1973). An experimental comparison of clustering techniques. Doctoral dissertation, Texas A&M University. Texas A&M University. Libraries. Available electronically from
https : / /hdl .handle .net /1969 .1 /DISSERTATIONS -158493.