Abstract:
A re-examination of text categorization methods This paper reports a controlled study with statistical signi-cance tests on ve text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classi er, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are su ciently common (over 300 instances). 1
Citations
|
982
|
Support-vector networks
– Cortes, Vapnik
- 1995
|
|
961
|
Text Categorization with Support Vector Machines
– Joachims
- 1997
|
|
328
|
An evaluation of statistical approaches to text categorization
– Yang
- 1999
|
|
289
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
212
|
Training algorithms for linear text classifiers
– Lewis, Schapire, et al.
- 1996
|
|
211
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
194
|
Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques
– Dasarathy
- 1991
|
|
188
|
Context-sensitive learning methods for text categorization
– Cohen, SINGER
- 1999
|
|
129
|
Sequenital minimal optimization: A fast algorithm for training support vector machines
– Platt
- 1998
|
|
115
|
Support vector machines: Training and applications
– Osuna, Freund, et al.
- 1997
|
|
111
|
A.S.Weigend. A neural network approach to topic spotting
– Wiener
- 1995
|
|
86
|
Feature selection, perceptron learning, and a usability case study for text categorization
– Ng, Goh, et al.
- 1997
|
|
76
|
An example-based mapping method for text categorization and retrieval
– Yang, Chute
- 1994
|
|
75
|
Classifying News Stories Using Memory Based Reasoning
– Masand, Linoff, et al.
- 1992
|
|
74
|
S.M.: Towards Language Independent Automated Learning of Text Categorization Models
– Apt'e, Damerau, et al.
- 1994
|
|
58
|
Feature selection in statistical learning of text categorization
– Yang, Pedersen
- 1997
|
|
57
|
Text categorization and relational learning
– Cohen
- 1995
|
|
56
|
CONSTRUE: A System for Content-Based Indexing of a Database of News Stories
– Hayes, Weinstein
- 1990
|
|
45
|
Using a generalized instance set for automatic text categorization
– Lam, Ho
- 1998
|
|
44
|
Air/X - A rule-based multi-stage indexing system for lage subject fields
– Fuhr, Hartmanna, et al.
- 1991
|
|
44
|
Cluster-Based Text Categorization: A Comparison of Category Search Strategies
– Iwayama, Tokunaga
- 1995
|
|
38
|
Automatic indexing based on bayesian inference networks
– Tzeras, Hartmann
- 1993
|
|
37
|
Text categorization: a symbolic approach
– Moulinier, Ra˘skinis, et al.
- 1996
|
|
21
|
Text mining with decision rules and decision trees
– Apte, Damerau, et al.
- 1998
|
|
13
|
The Nature of Statistical Learning Theory
– Vapnic
- 1995
|
|
12
|
Is learning bias an issue on the text categorization problem
– Moulinier
- 1997
|
|
10
|
Distributional clustering of words for text categorisation
– Baker, McCallum
- 1998
|
|
4
|
Expert network: E ective and e cient learning from human decisions in text categorization and retrieval
– Yang
- 1994
|
|
3
|
A comparison of event models for naivebayes text classi
– McCallum, Nigam
- 1998
|
|
1
|
Statistics: Theory and Methods. Brooks/Cole, Paci c
– Berry, Lindgren
- 1990
|
|
1
|
Sampling strategies and learning e ciency in text categorization
– Yang
- 1996
|