• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

An extensive empirical study of feature selection metrics for text classification (2003)

Cached

  • Download as a PDF

Download Links

  • [www.cmpe.boun.edu.tr]
  • [www.cmpe.boun.edu.tr]
  • [www.cs.utah.edu]
  • [www.hpl.hp.com]
  • [www.jmlr.org]
  • [www.ai.mit.edu]
  • [jmlr.csail.mit.edu]
  • [www-ai.informatik.uni-dortmund.de]
  • [www-ai.cs.uni-dortmund.de]
  • [jmlr.org]
  • [www.ai.mit.edu]
  • [www.ai.mit.edu]
  • [jmlr.org]
  • [sfb876.tu-dortmund.de]
  • [www.infoautoclassification.org]
  • [machinelearning.wustl.edu]
  • [jmlr.csail.mit.edu]
  • [www.jmlr.org]
  • [www.ai.mit.edu]
  • [jmlr.csail.mit.edu]
  • [jmlr.org]
  • [www.ai.mit.edu]
  • [www.ai.mit.edu]
  • [jmlr.org]
  • [machine-learning.martinsewell.com]
  • [jmlr.csail.mit.edu]
  • [www.jmlr.org]
  • [www.jmlr.org]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by George Forman , Isabelle Guyon , André Elisseeff
Venue:J. of Machine Learning Research
Citations:495 - 15 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@ARTICLE{Forman03anextensive,
    author = {George Forman and Isabelle Guyon and André Elisseeff},
    title = {An extensive empirical study of feature selection metrics for text classification},
    journal = {J. of Machine Learning Research},
    year = {2003},
    pages = {3--1289}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives—accuracy, F-measure, precision, and recall—since each is appropriate in different situations. The results reveal that a new feature selection metric we call ‘Bi-Normal Separation ’ (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair—e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

Keyphrases

text classification    extensive empirical study    feature selection metric    information gain    pair bns f1-measure    top single choice    multiple goal perspective    new feature selection    text domain    single dataset    new evaluation methodology    bi-normal separation    considerable margin    high class skew    document routing    induction algorithm    learning task efficient    optimal pair    twelve feature selection method    document categorization    text classification problem    substantial margin    empirical comparison    different situation    news filtering    data mining practitioner    text classification problem instance    effective feature selection    performance goal   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University