This thesis presents a new general probabilistic framework for text retrieval based on Bayesian decision theory. In this framework, queries and documents are modeled using statistical language models, user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. This risk minimization framework not only unifies several existing retrieval models within one general probabilistic framework, but also facilitates the development of new principled approaches to text retrieval through the use of statistical language models. We explore three interesting special cases of the framework. In the case of a two-stage language modeling approach, we show that it is possible to achieve excellent retrieval performance without any ad hoc parameter tuning by exploiting statistical estimation methods to set the retrieval parameters completely automatically. In another case of a KL-divergence retrieval model, we demonstrate that it is possible to improve retrieval performance by using improved language models estimated based on feedback documents. Finally, in the case of non-traditional aspect retrieval models, we show that it is possible to use language models to capture redundancy and sub-topics in documents, and to perform “contextsensitive” ranking of documents based on both relevance and novelty of documents. 1
|
4364
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
4344
|
Maximum likelihood from incomplete data via the EM algorithm
– Dempster, Laird, et al.
- 1977
|
|
2217
|
J.: Introduction to Modern Information Retrieval
– Salton, Macgill
- 1983
|
|
1463
|
Indexing by Latent Semantic Analysis
– Deerwester, Dumais, et al.
- 1990
|
|
957
|
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer
– Salton
|
|
881
|
Term-Weighting Approaches in Automatic Text Retrieval
– Salton, Buckley
- 1988
|
|
628
|
Statistical decision theory and Bayesian analysis
– Berger
- 1985
|
|
559
|
Relevance feedback in information retrieval, The
– Rocchio
- 1971
|
|
518
|
Estimation of probabilities from sparse data for the language model component of speech recogniser
– Katz
- 1987
|
|
462
|
Improving retrieval performance by relevance feedback
– Salton, Buckley
- 1990
|
|
452
|
Statistical Methods for Speech Recognition
– Jelinek
- 1998
|
|
418
|
A language modeling approach to information retrieval
– Ponte, Croft
- 1998
|
|
406
|
Relevance weighting of search terms
– Robertson, Jones
- 1988
|
|
373
|
Latent Dirichlet allocation
– Blei, Ng, et al.
- 2003
|
|
335
|
An empirical study of smoothing techniques for language modeling
– Chen, Goodman
- 1996
|
|
329
|
A vector space model for automatic indexing
– Salton, Wong, et al.
- 1975
|
|
296
|
The INQUERY retrieval system
– Callan, Croft, et al.
- 1992
|
|
283
|
Query expansion using local and global document analysis
– Xu, Croft
- 1996
|
|
259
|
1999] Probabilistic latent semantic indexing
– Hofmann
|
|
251
|
Okapi at TREC-3
– Robertson, Walker, et al.
- 1994
|
|
245
|
Pivoted document length normalization
– Singhal, Buckley, et al.
- 1996
|
|
238
|
The population frequencies of species and the estimation of population parameters
– Good
- 1953
|
|
231
|
A study of smoothing methods for language models applied to ad hoc information retrieval
– Zhai, Lafferty
- 2001
|
|
225
|
Naive (bayes) at forty: The independence assumption in information retrieval
– Lewis
- 1998
|
|
196
|
Some simple effective approximations to the 2–poisson model for probabilistic weighted retrieval
– Robertson, Walker
- 1994
|
|
178
|
Divergence Measures Based on the Shannon Entropy
– Lin
- 1991
|
|
174
|
The use of MMR, Diversity-based Reranking for Reordering Documents and Producing Summaries
– Carbonell, Goldstein
- 1998
|
|
173
|
Evaluation of an inference network-based retrieval model
– Turtle, Croft
- 1991
|
|
151
|
Information Retrieval as statistical translation
– Berger, Lafferty
- 1999
|
|
150
|
The probability ranking principle in ir
– Robertson, Sparck-Jones
- 1977
|
|
148
|
A non-classical logic for information retrieval
– Rijsbergen
- 1986
|
|
144
|
Document language models, query models, and risk minimization for information retrieval
– Lafferty, Zhai
- 2001
|
|
141
|
Representation and learning in information retrieval
– Lewis
- 1992
|
|
140
|
Extended boolean information retrieval
– Salton, Fox, et al.
- 1983
|
|
133
|
On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language 8:1–38
– Ney, Essen, et al.
- 1994
|
|
132
|
A probabilistic model of information retrieval: development and comparative experiments
– Jones, Walker, et al.
- 2000
|
|
124
|
Relevance-based language models
– Lavrenko, Croft
- 2001
|
|
124
|
On relevance, probabilistic indexing and information retrieval
– Maron, Kuhns
- 1960
|
|
123
|
A hidden markov model information retrieval
– Miller, Leek, et al.
- 1999
|
|
122
|
Using Language Models for Information Retrieval
– Hiemstra
- 2001
|
|
120
|
Using probabilistic models of document retrieval without relevance information
– Croft, Harper
- 1979
|
|
109
|
Language Modeling for Information Retrieval
– Croft, Lafferty
- 2003
|
|
103
|
Okapi/Keenbow at TREC-8
– Robertson, Walker
- 1999
|
|
100
|
Automatic query expansion using SMART: TREC 3. Paper presented at the NIST Special Publication 500–225: The Third Text REtrieval Conference (TREC-3
– Buckley, Salton, et al.
- 1995
|
|
93
|
A general language model for information retrieval
– Song, Croft
- 1999
|
|
86
|
Twenty-one at TREC-7: Ad-hoc and cross-language track
– Hiemstra, Kraaij
- 1998
|
|
83
|
On modeling information retrieval with probabilsitic inference
– Wong, Yao
- 1995
|
|
82
|
Two decades of statistical language modeling: Where do we go from here
– Rosenfeld
|
|
80
|
A probabilistic learning approach for document indexing
– Fuhr, Buckley
- 1991
|
|
75
|
Improved backing-off for m-gram language modeling
– Kneser, Ney
- 1995
|