| David Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the Eleventh Annual Conference, New Brunswick, New Jersey, 1994. Morgan Kaufmann. |
....classification problems taken from the UC Irvine Repository [Cohen, 1995a] 2.1.2 Extensions to Ripper motivated by text Before running these experiments, Ripper was modified so as to be more appropriate for text categorization problems. One extension allows the user to specify a loss ratio [Lewis and Catlett, 1994]. A loss ratio indicates the ratio of the cost of a false negative to the cost 3 In Quinlan s scheme, Gamma log 2 p Delta e Gamma log 2 (1 Gamma p) Delta (n Gamma e) bits are used to encode a subset of e elements from a set of size n, where p represents the expected value of the fraction ....
....corpus. 3 Experimental results 3.1 The AP titles corpus The first benchmark we will use is a corpus of AP newswire headlines, tagged as being relevant or irrelevant to topics like federal budget and Nielsens ratings . This dataset is described in more detail elsewhere [Lewis and Gale, 1994; Lewis and Catlett, 1994] The corpus contains 319,463 documents in the training set and 51,991 documents in the test set. The headlines are an average of nine words long, with a total vocabulary is 67,331 words. No preprocessing of the text was done, other than to convert all words to lower case and remove punctuation ....
[Article contains additional citation context not shown here]
David Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the Eleventh Annual Conference, New Brunswick, New Jersey, 1994. Morgan Kaufmann.
....117 165 0.376 0.324 165 0.368 0.321 quayle 133 160 0.684 0.435 90 0.722 0.644 average 110.4 208.9 0.465 0.302 197.7 0.470 0.354 3. 2 The Testbed To evaluate these learning methods we used ten text categorization problems described from the information retrieval literature [ Lewis and Gale, 1994; Lewis and Catlett, 1994 ] In each of these problems, the goal is to classify AP newswire headlines as relevant or irrelevant to topics like federal budget and nielson ratings . The corpus of 371,454 pre classified headlines is split into a training set of 319,463 titles and a test set of 51,991 titles. The ....
....encoding would likely be competitive even for large values of k. 3 3.5 Comparing FOIL and propositional learners We now turn to a crucial question: how does FOIL compare to existing propositional methods on these problems We will focus initially on results appearing elsewhere in the literature. Lewis and Catlett [ 1994 ] reported results obtained with a Bayesian probabilistic classifier, and also with an extension of C4.5 that allows the user to specify a loss ratio. A loss ratio indicates the ratio of the cost of a false negative to the cost of a false positive; the goal of learning is to minimize ....
David Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the Eleventh Annual Conference, New Brunswick, New Jersey, 1994. Morgan Kaufmann.
....j #true positives #true positives #false positives For convenience, we will define the precision of a classifier that always predicts negative as 1.00. 3 AN EXPERIMENTAL TESTBED In our experiments we will use the ten text categorization problems described described in [ Lewis and Gale, 1994; Lewis and Catlett, 1994 ] In each of these problems, the goal is to classify the AP newswire headlines as relevant or irrelevant to topics like federal budget and nielson ratings . A large corpus of 371,454 pre classified headlines were split into a training set of 319,463 titles and a test set of 51,991 titles. ....
....the product of the number of examples and the number of features. Using too many features thus can be impractical for large datasets; for instance, Lewis and Catlett were forced for efficiency reasons to use only a subset of the possible features in their experiments on these datasets with C4.5 [ Lewis and Catlett, 1994 ] The storage required by FOIL6 is only linear in the size of the original corpus, which makes it computationally feasible to use the full set of word relations. However, although it is feasible, it may be unwise. Certainly not all of the features are necessary to define the concept, and it is ....
[Article contains additional citation context not shown here]
David Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the Eleventh Annual Conference, New Brunswick, New Jersey, 1994. Morgan Kaufmann.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC