In this paper we propose a new information-theoretic divisive algorithm for word clustering applied to text classication. In previous work, such \distributional clustering " of features has been found to achieve signicant improvements over feature selection in terms of classication accuracy, especially at lower number of features [2, 29]. However the existing clustering techniques are agglomerative in nature resulting in (i) sub-optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we rst derive a global criterion for feature clustering. We then present a fast, divisive algorithm that monotonically decreases this objective function value, thus converging to a local minimum. We show that our algorithm minimizes the \within-cluster Jensen-Shannon divergence " while simultaneously maximizing the \between-cluster Jensen-Shannon divergence". In comparison to the previously proposed agglomerative strategies our divisive algorithm achieves higher classication accuracy especially at lower number of features. We further show that feature clustering is an eective technique for building smaller class models in hierarchical classi cation. We present detailed experimental results on the 20 News groups data set and a 3-level hierarchy of HTML documents collected from Dmoz Open Directory.
|
4595
|
Statistical Learning Theory
– Vapnik
- 1998
|
|
4433
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
2789
|
A mathematical theory of communication
– Shannon
- 1948
|
|
1486
|
Indexing By Latent Semantic Analysis
– Deerwester, Dumais, et al.
- 1990
|
|
949
|
Pattern Classification
– Duda, Hart, et al.
- 2001
|
|
518
|
A comparative study on feature selection in text categorization
– Yang, Pedersen
- 1997
|
|
482
|
A Comparison of Event Models for Naive Bayes Text Classi cation
– McCallum, Nigam
- 1998
|
|
421
|
A re-examination of text categorization methods
– Yang, Liu
- 1999
|
|
418
|
1951]:‘On information and sufficiency
– Kullback, Leibler
|
|
405
|
Distributional Clustering of English Words
– Pereira, Tishby, et al.
- 1993
|
|
337
|
On the optimality of the simple Bayesian classifier under zero-one loss
– Domingos, Pazzani
- 1997
|
|
293
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
263
|
Probabilistic Latent Semantic Indexing
– Hofmann
- 1999
|
|
203
|
What Every Computer Scientist should know about Floating Point
– Goldberg
- 1991
|
|
180
|
Divergence measures based on the Shannon entropy
– Lin
- 1991
|
|
177
|
Concept decompositions for large sparse text data using clustering
– Dhillon, Modha
|
|
154
|
On bias, variance, 0/1-loss, and the curse-of-dimensionality
– FRIEDMAN
- 1997
|
|
153
|
Distributional clustering of words for text classification
– Baker, McCallum
- 1998
|
|
120
|
Measures of distributional similarity
– Sapporo, Lee, et al.
- 1999
|
|
112
|
A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow
– Bow
- 1996
|
|
97
|
A mathematical theory of communication,” The BEll System
– Shannon
- 1948
|
|
93
|
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality
– Selim, Ismail
- 1984
|
|
92
|
Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classifications (abstract
– Forgy
- 1965
|
|
59
|
Unsing taxonomy, discriminants, and signatures for navigating in text databases
– Charkabarti, Dom, et al.
- 1997
|
|
41
|
On information and suciency
– Kullback, Leibler
- 1951
|
|
38
|
Pattern classi
– Duda, Hart, et al.
- 2001
|
|
35
|
IEEE Standard for Binary Floating Point Arithmetic, Std 754-1985 edition
– ANSIIEEE
- 1985
|
|
35
|
On feature distributional clustering for text categorization
– Bekkerman, El-Yaniv, et al.
- 2001
|
|
33
|
Introduction to Modern Retrieval
– Salton, McGill
- 1983
|
|
32
|
The complexity of the generalized Lloyd–Max problem
– Garey, Johnson, et al.
- 1982
|
|
27
|
On the optimality of the simple Bayesian classi under zero-one loss
– Domingos, Pazzani
- 1997
|
|
9
|
Cluster analysis of multivariate data: Eciency vs. interpretability of classi
– Forgy
- 1965
|
|
7
|
The power of word clusters for text classi
– Slonim, Tishby
- 2001
|
|
6
|
Conditions for the equivalence of hierarchical and non-hierarchical bayesian classifiers
– Mitchell
- 1998
|
|
5
|
Hierarchical Classi of Web Content
– Dumais, Chen
- 2000
|
|
4
|
Introduction to Modern Retrieval. McGraw-Hill Book Company
– Salton, McGill
- 1983
|
|
2
|
Distributional clustering of English words
– McGraw-Hill
- 1997
|
|
1
|
Probabilistic latent semantic indexing
– Theory
- 1998
|