| N. Slonim and N. Tishby. The Power of Word Clusters for Text Classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001, Best Paper Award. |
....new information theoretic divisive algorithm for word clustering applied to text classi cation. In previous work, such distributional clustering of features has been found to achieve improvements over feature selection in terms of classi cation accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we rst derive a global criterion for feature ....
....Linear Discriminant Analysis, k nearest neighbor etc. The problem is compounded when the documents are arranged in a hierarchy of classes and a full feature classi er is applied at each node of the hierarchy. A way to reduce dimensionality is by the distributional clustering of words features [25, 2, 28]. Each word cluster can then be treated as a single feature and thus dimensionality can be drastically reduced. As shown by [2, 28] such feature clustering is more e ective than feature selection[30] especially at lower number of features. Also, feature clustering appears to preserve classi ....
[Article contains additional citation context not shown here]
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
....and Retrieval Clustering General Terms Algorithms Keywords Word Clustering, Feature Dimensionality Reduction 1. INTRODUCTION Word clustering techniques have been successfully used for text classification, with two main advantages: dimension reduction and improving classification accuracy [1, 2, 3, 6]. Information theoretic approach to word clustering considers the word distributions over categories to determine similar words. Such methods need labeled training data. Instead, we introduce a rule based, context dependent word clustering method, with the rules extracted from various domain ....
N. Slonim and N. Tishby. The power of word clusters for text classication. In ECIR, 2001.
....and preservation of mutual information. The Information Bottleneck algorithm yields a soft clustering of the data using a procedure similar to the deterministic annealing approach of [16] A greedy agglomerative hard clustering version of the Information Bottleneck algorithm was used in [1, 19] to cluster words in order to reduce feature size for supervised text classi cation. For this same task, recently [6] proposed a divisive hard clustering algorithm that directly minimizes the loss in mutual information and was found to result in higher classi cation accuracies than [1, 19] All ....
....in [1, 19] to cluster words in order to reduce feature size for supervised text classi cation. For this same task, recently [6] proposed a divisive hard clustering algorithm that directly minimizes the loss in mutual information and was found to result in higher classi cation accuracies than [1, 19]. All these algorithms were proposed for one sided clustering. An agglomerative hard clustering version of the Information Bottleneck algorithm was used in [18] to cluster documents after clustering words. The work in [8] extended the above work to repetitively cluster documents and then words. ....
[Article contains additional citation context not shown here]
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
....new information theoretic divisive algorithm for word clustering applied to text classi cation. In previous work, such distributional clustering of features has been found to achieve improvements over feature selection in terms of classi cation accuracy, especially at lower number of features [2, 28]. However the existing clustering techniques are agglomerative in nature and result in (i) sub optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimalityofword clusters in an information theoretic framework, we rst derive a global criterion for feature ....
....Linear Discriminant Analysis, k nearest neighbor etc. The problem is compounded when the documents are arranged in a hierarchy of classes and a full feature classi er is applied at each node of the hierarchy. A way to reduce dimensionality is by the distributional clustering of words features [25, 2, 28]. Eachword cluster can then be treated as a single feature and thus dimensionality can be drastically reduced. As shown by [2, 28] such feature clustering is more e ective than feature selection[30] especially at lower number of features. Also, feature clustering appears to preserve classi ....
[Article contains additional citation context not shown here]
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
....jwa ) P (c j jw b ) The authors test a naive Bayes classi er that uses clusters constructed by their algorithm as features. The accuracy stays similar as in the case of using single words as features, while the number of features is reduced up to three orders of magnitude. Slonim and Tishby [ST01] present essentially the same algorithm for word clustering by using an information bottleneck framework [TPB99] as a theoretical basis for it. The clusters are repeatedly joined, each time joining two clusters of the current partition into a single new cluster in a way that locally minimizes the ....
....Call one of the functions ClusterFeaturesAvg, ClusterFeaturesOr, ClusterFeaturesDep. Figure 3.5: Function ClusterFeatures 3. 5 Feature Clustering Algorithm using Probability Average In this section we describe a feature clustering algorithm that, similarly to the algorithm of Slonim and Tishby [ST01] presented in Section 1.2, merges 20 smaller clusters into larger ones by using an information loss criteria. In our classi ers, both the presence and the absence of a feature in a text is used as an evidence, while in both [BM98] and [ST01] only the presence of a feature is used as an evidence. ....
[Article contains additional citation context not shown here]
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research, 2001.
....Analysis, k nearest neighbor etc. The problem is compounded when the documents are arranged in a hierarchy of classes since a full feature classi er needs to be applied at each node of the hierarchy. A way to reduce dimensionality is by the distributional clustering of words features [25, 2, 28]. Each word cluster can be treated as a single feature and thus, dimensionality can be drastically reduced. As shown by [2, 28] such feature clustering is more effective than feature selection [30] especially at lower number of features. Also, feature clustering appears to preserve classi ....
....a full feature classi er needs to be applied at each node of the hierarchy. A way to reduce dimensionality is by the distributional clustering of words features [25, 2, 28] Each word cluster can be treated as a single feature and thus, dimensionality can be drastically reduced. As shown by [2, 28], such feature clustering is more effective than feature selection [30] especially at lower number of features. Also, feature clustering appears to preserve classi cation accuracy as compared to a full feature classi er. Indeed in some cases of small training sets and noisy features, word ....
[Article contains additional citation context not shown here]
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
....divisive algorithm for word clustering applied to text classi cation. In previous work, such distributional clustering of features has been found to achieve signi cant improvements over feature selection in terms of classi cation accuracy, especially at lower number of features [2, 29]. However the existing clustering techniques are agglomerative in nature resulting in (i) sub optimal word clusters and (ii) high computational cost. In order to explicitly capture the optimality of word clusters in an information theoretic framework, we rst derive a global criterion for ....
....Analysis, k nearest neighbor etc. The problem is compounded when the documents are arranged in a hierarchy of classes since a full feature classi er needs to be applied at each node of the hierarchy. A way to reduce dimensionality is by the distributional clustering of words features [25, 2, 29]. Each word cluster can be treated as a single feature and thus, dimensionality can be drastically reduced. As shown by [2, 29] such feature clustering is more e ective than feature selection [32] especially at lower number of features. Also, feature clustering appears to preserve classi cation ....
[Article contains additional citation context not shown here]
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
No context found.
N. Slonim and N. Tishby. The Power of Word Clusters for Text Classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001, Best Paper Award.
....overestimate of the performance of these algorithms, and thus penalizes the sequencial IB algorithm in the comparisons below. 6.2 The evaluation method Unfortunately there is no clear standard about what should be referred as a le header in this corpus. In particular, the results reported in [15, 16] stripped of the header including the subject line (as instructed in [9] On the other hand, the results reported in [1, 5] does make use of the subject line which in many cases contain useful information. To make our results comparable with [5] we decided to use the subject line in this ....
....15; 0 and maxL = 30. However, all algorithms except for sL1 attained full convergence in all 15 restarts and over all datasets after less than 30 loops. To gain some perspective about how hard is the classi cation task we also present results of a supervised Naive Bayes (NB) classi er (see [16] for the details of the implementation) The test set for this classi er consisted of the same 500 documents in each data set while the training set consisted of di erent 500 documents randomly chosen from the appropriate categories. We repeated this process 10 times and averaged the results. ....
N. Slonim and N. Tishby. The power of word clusters for text classi cation. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
....words with only one occurrence. The remaining words (in the training set) were sorted by their contribution to the information about the category identity, i.e. by I(w) P (w) P c2C P (cjw) log P (cjw) P (c) We used exactly the same Bayesian framework for the classi cation (see [16] for details) We emphasize that though in this representation one ignores the original order of the words (i.e. word context) there is massive empirical evidence for the high performance of this text classi cation scheme (e.g. 10] 4.1.2 Text classification results In gure 4 we present the ....
N. Slonim and N. Tishby. The Power of Word Clusters for Text Classication. In 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC