MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Feature selection for classification based on text hierarchy (1998) [30 citations — 2 self]

Download:
Download as a PDF | Download as a PS
by Marko Grobelnik
Text and the Web, Conference on Automated Learning and Discovery CONALD-98
http://pecan.srv.cs.cmu.edu/afs/cs.cmu.edu/local/mosaic/common/omega/Web/People/TextLearning/pww/papers/PWW/pwwCONALD98.ps.gz
Add To MetaCart

Abstract:

This paper describes automatic document categorization based on large text hierarchy. We handle the large number of features and training examples by taking into account hierarchical structure of examples and using feature selection for large text data. We experimentally evaluate feature subset selection on real-world text data collected from the existing Web hierarchy named Yahoo. In our learning experiments naive Bayesian classifier was used on text data using featurevector document representation that includes word sequences (n-grams) instead of just single words (unigrams). Experimental evaluation on real-world data collected form the Web shows that our approach gives promising results and can potentially be used for document categorization on the Web. Additionally the best result on our data is achieved for relatively small feature subset, while for larger subset the performance substantially drops. The best performance among six tested feature scoring measure was achieved by the feature scoring measure called Odds ratio that is known from information retrieval. 1

Citations

2489 Induction of Decision Trees – Quinlan - 1986
512 A comparative study on feature selection in text categorization – Yang, Pedersen - 1997
490 Irrelevant features and the subset selection problem – John, Kohavi - 1994
347 Fast Discovery of Association Rules – Agrawal - 1995
334 On the optimality of the simple bayesian classifier under zero-one loss – Domingos, Pazzani - 1997
289 Hierarchically classifying documents using very few words – Koller, Sahami - 1997
256 A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for text categorization – Joachims - 1997
57 On Biases in Estimating Multi-Valued Attributes – Kononenko - 1995
57 Feature subset selection in text-learning – Mladenic - 1998
26 The selection of good search terms – Rijsbergen, Harper, et al. - 1981
3 Learning Machine: design and implementation – Grobelnik, Mladeni'c - 1998
1 Efficient text categorization, Text Mining workshop – Mladeni'c, Grobelnik - 1998