• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A sequential algorithm for training text classifiers (1994)

by D Lewis, W Gale
Venue:In SIGIR-94
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 612
Next 10 →

Machine Learning in Automated Text Categorization

by Fabrizio Sebastiani - ACM COMPUTING SURVEYS , 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract - Cited by 1658 (22 self) - Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Text Classification from Labeled and Unlabeled Documents using EM

by Kamal Nigam, Andrew Kachites Mccallum, Sebastian Thrun, Tom Mitchell - MACHINE LEARNING , 1999
"... This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large qua ..."
Abstract - Cited by 1033 (19 self) - Add to MetaCart
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve ...

A comparison of event models for Naive Bayes text classification

by Andrew McCallum, Kamal Nigam , 1998
"... Recent work in text classification has used two different first-order probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey ..."
Abstract - Cited by 1002 (27 self) - Add to MetaCart
Recent work in text classification has used two different first-order probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts (e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes---providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.

Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

by David D. Lewis , 1998
"... The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made abou ..."
Abstract - Cited by 496 (1 self) - Add to MetaCart
The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made about word occurrences in documents.
(Show Context)

Citation Context

...rrange to compare posterior log odds of classes for each document individually, without comparisons across documents. Indeed, we know of many applications of multinomial models to text categorization =-=[3, 14, 15, 25, 32, 34]-=- but none to text retrieval. 5.3 Non-Distributional Approaches A variety of ad hoc approaches have been developed that more or less gracefully integrate term frequency and document length information ...

Support vector machine active learning for image retrieval

by Simon Tong , 2001
"... Relevance feedback is often a critical component when designing image databases. With these databases it is difficult to specify queries directly and explicitly. Relevance feedback interactively determinines a user’s desired output or query concept by asking the user whether certain proposed images ..."
Abstract - Cited by 448 (29 self) - Add to MetaCart
Relevance feedback is often a critical component when designing image databases. With these databases it is difficult to specify queries directly and explicitly. Relevance feedback interactively determinines a user’s desired output or query concept by asking the user whether certain proposed images are relevant or not. For a relevance feedback algorithm to be effective, it must grasp a user’s query concept accurately and quickly, while also only asking the user to label a small number of images. We propose the use of a support vector machine active learning algorithm for conducting effective relevance feedback for image retrieval. The algorithm selects the most informative images to query a user and quickly learns a boundary that separates the images that satisfy the user’s query concept from the rest of the dataset. Experimental results show that our algorithm achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
(Show Context)

Citation Context

...ch a query re nement scheme (we call it a query concept learner or learner hereafter) can be regarded as a machine learning task. In particular, it can be seen as a case of pool-based active learning =-=[19, 23]-=-. In pool-based active learning the learner has access to a pool of unlabeled data and can request the user's label for a certain number of instances in the pool. In the image retrieval domain, the un...

Learning Information Extraction Rules for Semi-structured and Free Text

by Stephen Soderland, Claire Cardie, Raymond Mooney - Machine Learning , 1999
"... . A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extract ..."
Abstract - Cited by 428 (9 self) - Add to MetaCart
. A wealth of on-line text information can be made available to automatic processing by information extraction (IE) systems. Each IE application needs a separate set of rules tuned to the domain and writing style. WHISK helps to overcome this knowledge-engineering bottleneck by learning text extraction rules automatically. WHISK is designed to handle text styles ranging from highly structured to free text, including text that is neither rigidly formatted nor composed of grammatical sentences. Such semistructured text has largely been beyond the scope of previous systems. When used in conjunction with a syntactic analyzer and semantic tagging, WHISK can also handle extraction from free text such as news stories. Keywords: natural language processing, information extraction, rule learning 1. Information extraction As more and more text becomes available on-line, there is a growing need for systems that extract information automatically from text data. An information extraction (IE) sys...

Learning to Extract Symbolic Knowledge from the World Wide Web

by Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, Sean Slattery , 1998
"... The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a ..."
Abstract - Cited by 405 (30 self) - Add to MetaCart
The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a
(Show Context)

Citation Context

.... Another useful line of research in text classification comes from basic ideas in probability and information theory. Bayes Rule has been the starting point for a number of classification algorithms =-=[5, 6, 33, 35, 43, 46]-=-, and the Minimum Description Length principle has been used as the basis of an algorithm as well [32]. Another line of research has been to use symbolic learning methods for text classification. Nume...

LUBM: A benchmark for OWL knowledge base systems

by Yuanbo Guo, Zhengxiang Pan, Jeff Heflin - Semantic Web Journal , 2005
"... We describe our method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications. We present the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks. The LUBM features an ontology for the university domain, synthetic OWL data sca ..."
Abstract - Cited by 362 (10 self) - Add to MetaCart
We describe our method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications. We present the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks. The LUBM features an ontology for the university domain, synthetic OWL data scalable to an arbitrary size, fourteen extensional queries representing a variety of properties, and several performance metrics. The LUBM can be used to evaluate systems with different reasoning capabilities and storage mechanisms. We demonstrate this with an evaluation of two memorybased systems and two systems with persistent storage.
(Show Context)

Citation Context

... systems considered. Thus, this metric should never be used without careful consideration of the parameters used in its calculation. We will come back to this in Section 5. First, we use an F-Measure =-=[30, 24]-=- like metric to compute the tradeoff between query completeness and soundness, since essentially they are analogous to recall and precision in Information Retrieval respectively. In the formula below,...

Toward Optimal Active Learning through Sampling Estimation of Error Reduction

by Nicholas Roy, Andrew Mccallum - In Proc. 18th International Conf. on Machine Learning , 2001
"... This paper presents an active learning method that directly optimizes expected future error. This is in contrast to many other popular techniques that instead aim to reduce version space size. These other methods are popular because for many learning models, closed form calculation of the expec ..."
Abstract - Cited by 352 (2 self) - Add to MetaCart
This paper presents an active learning method that directly optimizes expected future error. This is in contrast to many other popular techniques that instead aim to reduce version space size. These other methods are popular because for many learning models, closed form calculation of the expected future error is intractable. Our approach is made feasible by taking a sampling approach to estimating the expected reduction in error due to the labeling of a query. In experimental results on two real-world data sets we reach high accuracy very quickly, sometimes with four times fewer labeled examples than competing methods. 1.
(Show Context)

Citation Context

...on cannot efficiently be found in closed form. Other, more widely used active learning methods attain practicality by optimizing a different, non-optimal criterion. For example, uncertainty sampling (=-=Lewis & Gale, 1994-=-) selects the example on which the current learner has lowest certainty; Query-by-Committee (Seung et al., 1992; Freund et al., 1997) selects examples that reduce the size of the version space (Mitche...

Learning Collaborative Information Filters

by Daniel Billsus, Michael J. Pazzani - In Proc. 15th International Conf. on Machine Learning , 1998
"... Predicting items a user would like on the basis of other users’ ratings for these items has become a well-established strategy adopted by many recommendation services on the Internet. Although this can be seen as a classification problem, algo-rithms proposed thus far do not draw on results from the ..."
Abstract - Cited by 345 (4 self) - Add to MetaCart
Predicting items a user would like on the basis of other users’ ratings for these items has become a well-established strategy adopted by many recommendation services on the Internet. Although this can be seen as a classification problem, algo-rithms proposed thus far do not draw on results from the ma-chine learning literature. We propose a representation for collaborative filtering tasks that allows the application of virtually any machine learning algorithm. We identify the shortcomings of current collaborative filtering techniques and propose the use of learning algorithms paired with feature extraction techniques that specifically address the limitations of previous approaches. Our best-performing algorithm is based on the singular value decomposition of an initial matrix of user ratings, exploiting latent structure that essentially eliminates the need for users to rate common items in order to become predictors for one another's preferences. We evaluate the proposed algorithm on a large database of user ratings for motion pictures and find that our approach significantly out-performs current collaborative filtering algorithms.
(Show Context)

Citation Context

...and recall, commonly used for information retrieval tasks. It is important to evaluate precision and recall in conjunction, because it is easy to optimize either one separately. In order to do this, (=-=Lewis and Gale, 1994-=-) proposed the F-measure, a weighted combination of precision and recall that produces scores ranging from 0 to 1. Here we assign equal importance to precision and recall, i.e. F = (2⋅precision⋅recall...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University