Feature selection for classification based on text hierarchy (1998) [30 citations — 2 self]
Abstract:
This paper describes automatic document categorization based on large text hierarchy. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature selection for large text data. We experimentally evaluate feature subset selection on real-world text data collected from the existing Web hierarchy named Yahoo. In our learning experiments naive Bayesian classifier was used on text data using featurevector document representation that includes word sequences (n-grams) instead of just single words (unigrams). Experimental evaluation on real-world data collected form the Web shows that our approach gives promising results and can potentially be used for document categorization on the Web. Additionally the best result on our data is achieved for relatively small feature subset, while for larger subset the performance substantially drops. The best performance among six tested feature scoring measure was achieved by the feature scoring measure called Odds ratio that is known from information retrieval. 1
Citations
| 2489 | Induction of Decision Trees – Quinlan - 1986 |
| 512 | A comparative study on feature selection in text categorization – Yang, Pedersen - 1997 |
| 490 | Irrelevant features and the subset selection problem – John, Kohavi - 1994 |
| 347 | Fast Discovery of Association Rules – Agrawal - 1995 |
| 334 | On the optimality of the simple bayesian classifier under zero-one loss – Domingos, Pazzani - 1997 |
| 289 | Hierarchically classifying documents using very few words – Koller, Sahami - 1997 |
| 256 | A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for text categorization – Joachims - 1997 |
| 57 | On Biases in Estimating Multi-Valued Attributes – Kononenko - 1995 |
| 57 | Feature subset selection in text-learning – Mladenic - 1998 |
| 26 | The selection of good search terms – Rijsbergen, Harper, et al. - 1981 |
| 3 | Learning Machine: design and implementation – Grobelnik, Mladeni'c - 1998 |
| 1 | Efficient text categorization, Text Mining workshop – Mladeni'c, Grobelnik - 1998 |

