Parallel Clustering and Classification

Clustering is the grouping of common documents into sets of similar documents. The belief is that similar documents, as defined by a similarity measure, are relevant, as a set, to similar queries. Given this assumption, queries can be processed more efficiently by comparing a sample representative document for each cluster, called a centroid, against the query. If the centroid is deemed relevant, so are the documents that comprise the corresponding cluster. Our approach describes an efficient and scalable parallel approach to cluster and classify a large document corpus.

Publicaton:
R. Cathey, E. Jensen, S. Beitzel, O. Frieder, and D. Grossman, "Exploiting Parallelism to Support Scalable Hierarchical Clustering," Journal of the American Society of Information Science and Technology, 58(8), June 2007.
E. Jensen, S. Beitzel, A. Pilotto, N. Goharian, O. Frieder, "Parallelizing the Buckshot Algorithm for Efficient Document Clustering", Proceedings of the 2002 ACM International Conference on Information and Knowledge Management (ACM-CIKM), Washington D.C., November 2002.
A. Ruocco and O. Frieder, "Clustering and Classification of Large Document Bases in a Parallel Environment," Journal of the American Society of Information Science, 48(10), October 1997.