Results 1 - 10
of
245
A Formal Study of Information Retrieval Heuristics
- SIGIR '04
, 2004
"... Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TF-IDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retriev ..."
Abstract
-
Cited by 105 (19 self)
- Add to MetaCart
Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TF-IDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retrieval performance. In this paper, we present a formal study of retrieval heuristics. We formally define a set of basic desirable constraints that any reasonable retrieval function should satisfy, and check these constraints on a variety of representative retrieval functions. We find that none of these retrieval functions satisfies all the constraints unconditionally. Empirical results show that when a constraint is not satisfied, it often indicates non-optimality of the method, and when a constraint is satisfied only for a certain range of parameter values, its performance tends to be poor when the parameter is out of the range. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies these constraints. Thus the proposed constraints provide a good explanation of many empirical observations and make it possible to evaluate any existing or new retrieval formula analytically.
Inferring query performance using pre-retrieval predictors
- In Proc. Symposium on String Processing and Information Retrieval
, 2004
"... Abstract. The prediction of query performance is an interesting and important issue in Information Retrieval (IR). Current predictors involve the use of relevance scores, which are time-consuming to compute. Therefore, current predictors are not very suitable for practical applications. In this pape ..."
Abstract
-
Cited by 104 (6 self)
- Add to MetaCart
(Show Context)
Abstract. The prediction of query performance is an interesting and important issue in Information Retrieval (IR). Current predictors involve the use of relevance scores, which are time-consuming to compute. Therefore, current predictors are not very suitable for practical applications. In this paper, we study a set of predictors of query performance, which can be generated prior to the retrieval process. The linear and non-parametric correlations of the predictors with query performance are thoroughly assessed on the TREC disk4 and disk5 (minus CR) collections. According to the results, some of the proposed predictors have significant correlation with query performance, showing that these predictors can be useful to infer query performance in practical applications. 1
An information-theoretic approach to automatic query expansion
- ACM Transactions on Information Systems
, 2001
"... Techniques for automatic query expansion from top retrieved documents have shown promise for improving retrieval effectiveness on large collections; however, they often rely on an empirical ground, and there is a shortage of cross-system comparisons. Using ideas from Information Theory, we present a ..."
Abstract
-
Cited by 96 (1 self)
- Add to MetaCart
Techniques for automatic query expansion from top retrieved documents have shown promise for improving retrieval effectiveness on large collections; however, they often rely on an empirical ground, and there is a shortage of cross-system comparisons. Using ideas from Information Theory, we present a computationally simple and theoretically justified method for assigning scores to candidate expansion terms. Such scores are used to select and weight expansion terms within Rocchio’s framework for query reweighting. We compare ranking with information-theoretic query expansion versus ranking with other query expansion techniques, showing that the former achieves better retrieval effectiveness on several performance measures. We also discuss the effect on retrieval effectiveness of the main parameters involved in automatic query expansion, such as data sparseness, query difficulty, number of selected documents, and number of selected terms, pointing out interesting relationships.
An effective approach to document retrieval via utilizing wordnet and recognizing phrases
- In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
, 2004
"... Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different ..."
Abstract
-
Cited by 89 (11 self)
- Add to MetaCart
(Show Context)
Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different and are determined using a decision tree. Phrases are more important than individual terms. Consequently, documents in response to a query are ranked with matching phrases given a higher priority. We utilize WordNet to disambiguate word senses of query terms. Whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. Experimental results show that our approach yields between 23 % and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.
A Survey of Automatic Query Expansion in Information Retrieval
"... The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original que ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning. AQE has a long history in the information retrieval community but it is only in the last years that it has reached a level of scientific and experimental maturity, especially in laboratory settings such as TREC. This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques. The following questions are addressed: Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?
Exploiting the potential of concept lattices for information retrieval with CREDO
- JOURNAL OF UNIVERSAL COMPUTER SCIENCE
, 2004
"... The recent advances in Formal Concept Analysis (FCA) together with the major changes faced by modern Information Retrieval (IR) provide new unprecedented challenges and opportunities for FCA-based IR applications. The main advantage of FCA for IR is the possibility of creating a conceptual represe ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
The recent advances in Formal Concept Analysis (FCA) together with the major changes faced by modern Information Retrieval (IR) provide new unprecedented challenges and opportunities for FCA-based IR applications. The main advantage of FCA for IR is the possibility of creating a conceptual representation of a given document collection in the form of a document lattice, which may be used both to improve the retrieval of specific items and to drive the mining of the collection’s contents. In this paper, we will examine the best features of FCA for solving IR tasks that could not be easily addressed by conventional systems, as well as the most critical aspects for building FCA-based IR applications. These observations have led to the development of CREDO, a system that allows the user to query Web documents and see retrieval results organized in a browsable concept lattice. This is the second major focus of the paper. We will show that CREDO is especially useful for quickly locating the documents corresponding to the meaning of interest among those retrieved in response to an ambiguous query, or for mining the contents of the documents that reference a given entity. An on-line version of the system is available for testing at
Automatic Feature Selection in the Markov Random Field Model for Information Retrieval
- In Proceedings of CIKM’07
, 2007
"... Previous applications of the Markov random field model for information retrieval have used manually chosen features. However, it is often difficult or impossible to know, a priori, the best set of features to use for a given task or data set. Therefore, there is a need to develop automatic feature s ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
(Show Context)
Previous applications of the Markov random field model for information retrieval have used manually chosen features. However, it is often difficult or impossible to know, a priori, the best set of features to use for a given task or data set. Therefore, there is a need to develop automatic feature selection techniques. In this paper we describe a greedy procedure for automatically selecting features to use within the Markov random field model for information retrieval. We also propose a novel, robust method for describing classes of textual information retrieval features. Experimental results, evaluated on standard TREC test collections, show that our feature selection algorithm produces models that are either significantly more effective than, or equally effective as, models with manually selected features, such as those used in the past.
Length Normalization in XML Retrieval
, 2004
"... XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document colle ..."
Abstract
-
Cited by 37 (16 self)
- Add to MetaCart
XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a potentially retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length bias introduced by the amount of smoothing, and show the importance of extreme length priors for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-o# value) does not create an appropriate document length normalization. Even after increasing the minimal size of XML elements occurring in the index, the importance of an extreme length bias remains.
University of glasgow at trec 2005: Experiments in terabyte and enterprise tracks with terrier
- In Proceedings of TREC-05
, 2005
"... With our participation in TREC 2005, we continue experiments using Terrier, a modular and scalable Information Retrieval (IR) framework, in 4 tasks from the Terabyte and Enterprise tracks. In the Terabyte track, we investigate new Divergence From Randomness weighting models, and a novel query expa ..."
Abstract
-
Cited by 32 (17 self)
- Add to MetaCart
(Show Context)
With our participation in TREC 2005, we continue experiments using Terrier, a modular and scalable Information Retrieval (IR) framework, in 4 tasks from the Terabyte and Enterprise tracks. In the Terabyte track, we investigate new Divergence From Randomness weighting models, and a novel query expansion approach that can take into account various document fields, namely content, title and anchor text. In addition, we test a new selective query expansion mechanism which determines the appropriateness of using query expansion on a per-query basis, using statistical information from a low-cost query performance predictor. In the Enterprise track, we investigate combining document fields evidence with other information occurring in an Enterprise setting. In the email known item task, we also investigate temporal and thread priors suitable for email search. In the expert search task, for each candidate, we generate profiles of expertise evidence from the W3C collection. Moreover, we propose a model for ranking these candidate profiles in response to a query.