This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text, based on the combination of Expectation-Maximization with a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
|
4364
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
4345
|
Maximum likelihood from incomplete data via the EM algorithm
– Dempster, Laird, et al.
- 1977
|
|
961
|
Text Categorization with Support Vector Machines
– Joachims
- 1997
|
|
573
|
A Probabilistic Theory of Pattern Recognition
– Devroye, Gyorfi, et al.
- 1996
|
|
559
|
Relevance feedback in information retrieval, The
– Rocchio
- 1971
|
|
477
|
A comparison of event models for Naive Bayes text classification
– McCallum, Nigam
- 1998
|
|
406
|
Relevance weighting of search terms
– Robertson, Jones
- 1988
|
|
347
|
Finite Mixture Models
– McLachlan, Peel
- 2000
|
|
339
|
Bayesian Classification (AutoClass): Theory and Results
– Cheeseman, Stutz
- 1996
|
|
289
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
282
|
NewsWeeder: learning to filter netnews
– Lang
- 1995
|
|
277
|
A sequential algorithm for training text classifiers
– Lewis, Gale
- 1994
|
|
256
|
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for text categorization
– Joachims
- 1997
|
|
255
|
Learning to extract symbolic knowledge from the World Wide Web
– Craven, DiPasquo, et al.
- 1998
|
|
250
|
Syskill & webert: Identifying interesting web sites
– Pazzani, Muramatsu, et al.
- 1996
|
|
234
|
Beyond independence: Conditions for the optimality of the simple bayesian classifier
– Domingos, Pazzani
- 1996
|
|
211
|
A comparison of two learning algorithms for text categorization
– Lewis, Ringuette
- 1994
|
|
188
|
Context-sensitive learning methods for text categorization
– Cohen, SINGER
- 1999
|
|
168
|
An evaluation of phrasal and clustered representations on a text categorization task
– Lewis
- 1992
|
|
151
|
On bias, variance, 0/1 - loss, and the curse-of-dimensionality
– Friedman
- 1997
|
|
143
|
Employing em and pool-based active learning for text classification
– McCallum, Nigam
- 1998
|
|
141
|
Developments in automatic text retrieval
– Salton
- 1991
|
|
124
|
Supervised learning from incomplete data via an EM approach
– Ghahramani, Jordan
- 1994
|
|
90
|
Combining classifiers in text categorization
– Larkey, Croft
- 1996
|
|
80
|
A mixture of experts classifier with learning based on both labelled and unlabelled data
– Miller, Uyar
- 1997
|
|
75
|
Active Learning with Committees for Text Categorization
– Liere, Tadepelli
- 1997
|
|
71
|
The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing param
– Castelli, Cover
- 1996
|
|
67
|
The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon
– Shahshahani, Landgrebe
- 1994
|
|
58
|
Feature selection in statistical learning of text categorization
– Yang, Pedersen
- 1997
|
|
52
|
On the exponential value of labeled samples
– Castelli, Cover
- 1995
|
|
49
|
Threading electronic mail: A preliminary study
– Lewis, Knowles
|
|
36
|
Bayesian classification theory
– Hanson, Stutz, et al.
- 1991
|
|
32
|
A new probabilistic model of text classification and retrieval
– Kalt
- 1996
|
|
29
|
Improving text clasification by shrinkage in a hierarchy of classes
– McCallum, Rosenfeld, et al.
- 1998
|
|
21
|
Document classification using a finite mixture model
– Li, Yamanishi
- 1997
|
|
16
|
Estimations of dependences based on statistical data
– Vapnik
- 1982
|
|
14
|
An Application of Least Squares Fit Mapping to Text Information Retrieval
– Yang, Chute
- 1993
|