Bayesian Methods for Frequent Terms in Text: Models of Contagion and the Δ² Statistic (2005)
| Venue: | In Proceedings of the CSNA & INTERFACE Annual Meetings |
| Citations: | 4 - 1 self |
BibTeX
@INPROCEEDINGS{Airoldi05bayesianmethods,
author = {Edoardo M. Airoldi and William W. Cohen and Stephen E. Fienberg},
title = {Bayesian Methods for Frequent Terms in Text: Models of Contagion and the Δ² Statistic},
booktitle = {In Proceedings of the CSNA & INTERFACE Annual Meetings},
year = {2005}
}
OpenURL
Abstract
Most statistical approaches to modeling text implicitly assume that informative words are rare. This assumption is usually appropriate for topical retrieval and classification tasks; however, in non-topical classification and soft-clustering problems where classes and latent variables relate to sentiment or author, informative words can be frequent. In this paper we present a comprehensive set of statistical learning tools which treat words with higher frequencies of occurrence in a sensible manner. We introduce probabilistic models of contagion for classification and soft-clustering based on the Poisson and Negative-Binomial distributions, which share with the Multinomial the desirable properties of simplicity and analytic tractability. We then introduce statistic to select features and avoid over-fitting.







