#### DMCA

## Toward Optimal Active Learning through Sampling Estimation of Error Reduction (2001)

### Cached

### Download Links

- [www.cs.wustl.edu]
- [www.vis.uky.edu]
- [vis.uky.edu]
- [www-connex.lip6.fr]
- [www-poleia.lip6.fr]
- [csce.uark.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. 18th International Conf. on Machine Learning |

Citations: | 353 - 2 self |

### Citations

3648 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...ning class tends to be very close to 1, and the losing classes have probabilities close to 0. We address this problem with a sampling-based approach to variance reduction, otherwise known as baggings(=-=Breiman, 1996-=-). From our original labeled training set of size s, a different training set is created by sampling s times with replacement from the original. The learner then creates a new classifier from this sam... |

2302 | Text categorization with support vector machines: learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...ne such classification method that performs surprisingly well given its simplicity is naive Bayes. Naive Bayes is not always the best performing classification algorithm for text (Nigam et al., 1999; =-=Joachims, 1998-=-), but it continues to be widely used for the purpose because it is efficient and simple to implement, and even against significantly more complex methods, it rarely trails far behind in accuracy. Thi... |

1024 | A Comparison of Event Models for Naive Bayes Text Classification
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...ance, x 2 X , independently given the class. For text classification, the common variant of naive Bayes has unordered word counts for features, and uses a per-class multinomial to generate the words (=-=McCallum & Nigam, 1998-=-a). Let w t be the tth word in the dictionary of words V , and = ( y j ; w t jy j ) y j 2Y;w t 2V be the parameters of the model, where y j is the prior probability of class y j (otherwise writ... |

734 | Support vector machine active learning with applications to text classification - Tong, Koller |

679 | Active learning with statistical models.
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...ory from 5 others. The ErrorReduction Sampling algorithm reaches 82% accuracy in 5 documents, compared to 36 documents for both the Density-Weighted QBC and Random algorithms. statistical techniques (=-=Cohn et al., 1996-=-) that compute the reduction in error (or some equivalent quantity) in closed form; however, we approximate the reduction in error by repeated sampling. In this respect, we have attempted to bridge th... |

666 | Divergence measures based on the Shannon entropy
- Lin
- 1991
(Show Context)
Citation Context ...on-Engelson and Dagan (1999) suggest using a probabilistic measure based on vote-entropy of the committee, whereas McCallum & Nigam explicitly measure disagreement using the JensenShannon divergence (=-=Lin, 1991-=-; Pereira et al., 1993). However, they recognize that this error metric does not measure the impact that a labeled document had on classifier uncertainty on other unlabeled documents. They therefore f... |

653 |
Generalization as Search
- Mitchell
- 1982
(Show Context)
Citation Context ..., 1994) selects the example on which the current learner has lowest certainty; Query-by-Committee (Seung et al., 1992; Freund et al., 1997) selects examples that reduce the size of the version space (=-=Mitchell, 1982-=-) (the size of the subset of parameter space that correctly classifies the labeled examples). Tong and Koller's Support Vector Machine method (2000a) is also based on reducing version space size. None... |

631 | A Sequential Algorithm for Training Text Classifiers
- Lewis, Gale
- 1994
(Show Context)
Citation Context ...on cannot efficiently be found in closed form. Other, more widely used active learning methods attain practicality by optimizing a different, non-optimal criterion. For example, uncertainty sampling (=-=Lewis & Gale, 1994-=-) selects the example on which the current learner has lowest certainty; Query-by-Committee (Seung et al., 1992; Freund et al., 1997) selects examples that reduce the size of the version space (Mitche... |

629 | Distributional clustering of English words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ... and Dagan (1999) suggest using a probabilistic measure based on vote-entropy of the committee, whereas McCallum & Nigam explicitly measure disagreement using the JensenShannon divergence (Lin, 1991; =-=Pereira et al., 1993-=-). However, they recognize that this error metric does not measure the impact that a labeled document had on classifier uncertainty on other unlabeled documents. They therefore factored document densi... |

432 | Selective sampling using the query by committee algorithm
- Freund, Seung, et al.
- 1997
(Show Context)
Citation Context ...different, non-optimal criterion. For example, uncertainty sampling (Lewis & Gale, 1994) selects the example on which the current learner has lowest certainty; Query-by-Committee (Seung et al., 1992; =-=Freund et al., 1997-=-) selects examples that reduce the size of the version space (Mitchell, 1982) (the size of the subset of parameter space that correctly classifies the labeled examples). Tong and Koller's Support Vect... |

431 | Query by committee
- Seung, Opper, et al.
- 1992
(Show Context)
Citation Context ...ity by optimizing a different, non-optimal criterion. For example, uncertainty sampling (Lewis & Gale, 1994) selects the example on which the current learner has lowest certainty; Query-by-Committee (=-=Seung et al., 1992-=-; Freund et al., 1997) selects examples that reduce the size of the version space (Mitchell, 1982) (the size of the subset of parameter space that correctly classifies the labeled examples). Tong and ... |

326 | Using maximum entropy for text classification.
- Nigam, Lafferty, et al.
- 1999
(Show Context)
Citation Context ...plex interactions. One such classification method that performs surprisingly well given its simplicity is naive Bayes. Naive Bayes is not always the best performing classification algorithm for text (=-=Nigam et al., 1999-=-; Joachims, 1998), but it continues to be widely used for the purpose because it is efficient and simple to implement, and even against significantly more complex methods, it rarely trails far behind ... |

320 | Employing em and pool-based active learning for text classification.
- McCallum, Nigam
- 1998
(Show Context)
Citation Context ...ance, x 2 X , independently given the class. For text classification, the common variant of naive Bayes has unordered word counts for features, and uses a per-class multinomial to generate the words (=-=McCallum & Nigam, 1998-=-a). Let w t be the tth word in the dictionary of words V , and = ( y j ; w t jy j ) y j 2Y;w t 2V be the parameters of the model, where y j is the prior probability of class y j (otherwise writ... |

251 | Incremental and decremental support vector machine learning.
- Cauwenberghs, Poggio
- 2001
(Show Context)
Citation Context ...We describe an implementation in terms of naive Bayes, but the same technique could apply to any learning method in which incremental training is efficient—for example support vector machines (SVMs) (=-=Cauwenberghs & Poggio, 2000-=-). Our method estimates future error rate either by log-loss, using the entropy of the posterior class distribution on a sample of the unlabeled examples, or by 0-1 loss, using the posterior probabili... |

129 |
Query learning strategies using boosting and bagging,
- Abe, Mamitsuka
- 1998
(Show Context)
Citation Context ... at all, or for which the distribution over classifier parameters is unclear. This “bagging approach” to sampling from the distribution over classifiers has been used in previous work related to QBC (=-=Abe & Mamitsuka, 1998-=-); see the related work section for more details. 4. Related Work Cohn et al. (1996) propose one of the first statistical analyses of active learning, demonstrating how to construct queries that maxim... |

124 | Boosting in the limit: Maximizing the margin of learned ensembles.
- Grove, Schuurmans
- 1998
(Show Context)
Citation Context ...ximizing the classifier accuracy on the test data. This approach suggests that by maximizing the margin on training data, accuracy on test data is improved, an approach that is not always successful (=-=Grove & Schuurmans, 1998-=-). Furthermore, like the QBC algorithms before it, the QBC-by-boosting approach fails to maximize the margin on all unlabeled data, instead choosing to query the single instance with the smallest marg... |

100 | Active learning with committees for text categorization. - Liere, Tadepalli - 1997 |

81 | Selective sampling for nearest neighbor classifiers. - Lindenbaum, Markovitch, et al. - 2004 |

65 | Committee-based sample selection for probabilistic classifiers - Argamon-Engelson, Dagan - 1999 |

55 | Bayesian averaging of classifiers and the overfitting problem,”
- Domingos
- 2000
(Show Context)
Citation Context ... from any individual classifier are completely extreme, the bagged posterior is more smooth and reflective of the true uncertainty. This approach has been shown not necessarily to reduce overfitting (=-=Domingos, 2000-=-), but it does certainly give better posterior probabilities. One interesting aspect of this approach is that it can be applied to any classifier---even ones that don't give class posterior probabilit... |