ii Statistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a di#erent solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively
|
4398
|
Maximum likelihood from incomplete data via the EM algorithm
– Dempster, Laird, et al.
- 1977
|
|
683
|
Finding Groups in Data: An Introduction to Cluster Analysis
– Kaufman, Rousseeuw
- 1990
|
|
679
|
WordNet: A lexical database for english
– Miller
- 1995
|
|
647
|
Pattern Recognition with Fuzzy Objective Function Algorithms
– Bezdek
- 1981
|
|
623
|
A stochastic parts program and noun phrase parser for unrestricted text
– Church
- 1988
|
|
596
|
Information Theory and Statistics
– Kullback
- 1959
|
|
520
|
Estimation of probabilities from sparse data for the language model component of a speech recognizer
– Katz
- 1987
|
|
500
|
The use of multiple measurements in taxonomic problems
– Fisher
- 1936
|
|
435
|
Word association norms, mutual information, and lexicography
– Church, Hanks
- 1990
|
|
419
|
Scatter/Gather: A clusterbased approach to browsing large document collections
– Cutting, Karger, et al.
- 1992
|
|
405
|
Distributional Clustering of English Words
– Pereira, Tishby, et al.
- 1993
|
|
394
|
Classbased n-gram models of natural language
– Brown, deSouza, et al.
- 1992
|
|
346
|
Bayesian Classification (AutoClass): Theory and Results
– Cheeseman, Stutz
- 1995
|
|
339
|
An empirical study of smoothing techniques for language modeling
– Chen, Goodman
- 1998
|
|
318
|
A maximum likelihood approach to continuous speech recognition
– Bahl, Jelinek, et al.
- 1983
|
|
284
|
Information theory and statistical mechanics
– Jaynes
- 1957
|
|
252
|
Syntactic Structures
– Chomsky
- 1957
|
|
238
|
The population frequencies of species and the estimation of population parameters
– Good
- 1953
|
|
238
|
Interpolated estimation of markov source parameters from sparse data
– Jelinek, Mercer
- 1980
|
|
227
|
Word-sense disambiguation using statistical models of Roget's categories trained on large corpora
– Yarowsky
- 1992
|
|
213
|
AUTOCLASS: A Bayesian classification system
– Cheeseman, Kelly, et al.
- 1988
|
|
190
|
Selection and Information: A Class-Based Approach to Lexical Relationships
– Resnik
- 1993
|
|
159
|
Elements of Information Theory. Wiley Series in Telecommunications
– Cover, Thomas
- 1991
|
|
155
|
Noun Classification from Predicate-Argument Structures
– Hindle
- 1990
|
|
143
|
Pairwise data clustering by deterministic annealing
– Hofmann, Buhmann
- 1997
|
|
125
|
A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language 5
– Church, Gale
- 1991
|
|
115
|
Statistical mechanics and phase transitions in clustering
– Rose, Fox
- 1990
|
|
106
|
Probability theory
– Rényi
- 1970
|
|
95
|
The Art of Computer Programming, volume 1
– Knuth
- 1973
|
|
75
|
1992] Contextual word similarity and estimation from sparse data
– Dagan, Marcus, et al.
|
|
75
|
Improved clustering techniques for class-based statistical language modeling
– Kneser, Ney
- 1993
|
|
56
|
Principles of lexical language modeling for speech recognition
– Jelinek, Mercer, et al.
- 1992
|
|
54
|
Similaritybased estimation of word co-occurrence probabilities
– Dagan, Pereira, et al.
- 1994
|
|
52
|
Statistical methods and linguistics
– Abney
- 1996
|
|
50
|
Towards the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning
– Hatzivassiloglou, McKeown
- 1993
|
|
44
|
On the Estimation of ’Small’ Probabilities by Leaving-One-Out
– ESSEN
|
|
44
|
Wordnet and distributional analysis: A class-based approach to lexical discovery
– Resnik
- 1992
|
|
40
|
On the complexity of clustering problems
– Brucker
- 1977
|
|
40
|
Word space
– SCHÜTZE
- 1993
|
|
39
|
Cooccurrence smoothing for stochastic language modeling
– Essen, Steinbiss
- 1992
|
|
39
|
A synopsis of linguistic theory. 1930-1955
– Firth
- 1957
|
|
39
|
Work on statistical methods for word sense disambiguation
– Gale, Church, et al.
- 1992
|
|
35
|
Similarity-based methods for word sense disambiguation
– Dagan, Lee, et al.
- 1997
|
|
33
|
Statistical sense disambiguation with relatively small corpora using dictionary definitions
– Luk
- 1995
|
|
30
|
Baysian Classification with Correlation and Inheritance
– Hanson, Stutz, et al.
- 1991
|
|
27
|
Part-of-speech induction from scratch
– Schütze
- 1993
|
|
27
|
Intrinsic classification by MML - the Snob Program
– Wallace, Dowe
- 1994
|
|
24
|
A parser for text corpora
– Hindle
- 1993
|
|
23
|
Bootstrapping syntactic categories
– Finch, Chater
- 1992
|
|
22
|
Learning similarity-based word sense disambiguation from sparse data
– Karov, Edelman
- 1996
|