Abstract Lightweight Document Clustering
Abstract:
Alightweight document clustering method is described that operates in high dimensions, processes tens of thousands of documents and groups them into several thousand clusters, or by varying a single parameter, into a few dozen clusters. The method uses a reduced indexing view of the original documents, where only the k best keywords of each document are indexed. An e cient procedure for clustering is speci ed in two parts (a) compute k most similar documents for each document in the collection and (b) group the documents into clusters using these similarity scores. The method has been evaluated on a database of over 50,000 customer service problem reports that are reduced to 3,000 clusters and 5,000 exemplar documents. Results demonstrate e cient clustering performance with excellent group similarity measures.
Citations
| 900 | Term-weighting approaches in automatic text retrieval – Salton, Buckley - 1988 |
| 431 | Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections – Cutting, Karger, et al. - 1992 |
| 67 | Fast and Intuitive Clustering of Web Documents – Zamir - 1997 |
| 33 | Using inter-document similarity information in document retrieval systems – Griffiths, Luckhurst, et al. - 1986 |
| 21 | Model selection in unsupervised learning with applications to document clustering – Vaithyanathan, Dom - 1999 |
| 14 | A K-Means Clustering Algorithm. Applied Statistics 28:100--108 – Hartigan, Wong - 1979 |
| 5 | Lightweight Document Matching for Help-Desk Applications – Weiss, White, et al. - 2000 |
| 1 | Fast and E ective Text Mining Using Linear-time Document Clustering – Larsen, Aone - 1999 |
| 1 | Chapter 6 - techniques – Willet - 1997 |

