Results 1 - 10
of
100
Expected Reciprocal Rank for Graded Relevance
- CIKM'09, NOVEMBER 2–6, 2009, HONG KONG, CHINA.
, 2009
"... While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assumption: a document in ..."
Abstract
-
Cited by 158 (12 self)
- Add to MetaCart
While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assumption: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the “cascade ” user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical reciprocal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.
Comparing the sensitivity of information retrieval metrics.
- In Proc. of SIGIR,
, 2010
"... ABSTRACT Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating informa ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
(Show Context)
ABSTRACT Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods. We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.
Analysis of long queries in a large scale search log
- In WSCD
, 2009
"... We propose to use the search log to study long queries, in order to understand the types of information needs that are behind them, and to design techniques to improve search effectiveness when they are used. Long queries arise in many different applications, such as CQA (community-based question an ..."
Abstract
-
Cited by 37 (7 self)
- Add to MetaCart
(Show Context)
We propose to use the search log to study long queries, in order to understand the types of information needs that are behind them, and to design techniques to improve search effectiveness when they are used. Long queries arise in many different applications, such as CQA (community-based question answering) and literature search, and they have been studied to some extent using TREC data. They are also, however, quite common in web search, as can be seen by looking at the distribution of query lengths in a large scale search log. In this paper we analyze the long queries in the search log with the aim of identifying the characteristics of the most commonly occurring types of queries, and the issues involved with using them effectively in a search engine. In addition, we propose a simple yet effective method for evaluating the performance of the queries in the search log using a combination of the click data in the search log with the existing TREC corpora.
The K-armed Dueling Bandits Problem
, 2009
"... We study a partial-information online-learning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
(Show Context)
We study a partial-information online-learning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure (e.g., user-perceived quality of a set of retrieval results, taste of food, product attractiveness), but where pairwise comparisons are easy to make. We propose a novel regret formulation in this setting, as well as present an algorithm that achieves (almost) information-theoretically optimal regret bounds (up to a constant factor).
Personalizing web search using long term browsing history
- In Proceedings of the fourth ACM international conference on Web Search and Data Mining, WSDM ’11
, 2011
"... Personalizing web search results has long been recognized as an avenue to greatly improve the search experience. We present a personalization approach that builds a user interest profile using users ’ complete browsing behavior, then uses this model to rerank web results. We show that using a combin ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
(Show Context)
Personalizing web search results has long been recognized as an avenue to greatly improve the search experience. We present a personalization approach that builds a user interest profile using users ’ complete browsing behavior, then uses this model to rerank web results. We show that using a combination of content and previously visited websites provides effective personalization. We extend previous work by proposing a number of techniques for filtering previously viewed content that greatly improve the user model used for personalization. Our approaches are compared to previous work in offline experiments and are evaluated against unpersonalized web search in large scale online tests. Large improvements are found in both cases.
Large-Scale Validation and Analysis of Interleaved Search Evaluation
, 2012
"... Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture o ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.
Model Adaptation via Model Interpolation and Boosting for Web Search Ranking
"... This paper explores two classes of model adaptation methods for Web search ranking: Model Interpolation and error-driven learning approaches based on a boosting algorithm. The results show that model interpolation, though simple, achieves the best results on all the open test sets where the test dat ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
(Show Context)
This paper explores two classes of model adaptation methods for Web search ranking: Model Interpolation and error-driven learning approaches based on a boosting algorithm. The results show that model interpolation, though simple, achieves the best results on all the open test sets where the test data is very different from the training data. The tree-based boosting algorithm achieves the best performance on most of the closed test sets where the test data and the training data are similar, but its performance drops significantly on the open test sets due to the instability of trees. Several methods are explored to improve the robustness of the algorithm, with limited success. 1
Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking
"... The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable altern ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
(Show Context)
The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable alternative that scales with modest costs. However, crowdsourcing suffers from undesirable worker practices and low quality contributions. In this paper we investigate the design and implementation of effective crowdsourcing tasks in the context of book search evaluation. We observe the impact of aspects of the Human Intelligence Task (HIT) design on the quality of relevance labels provided by the crowd. We assess the output in terms of label agreement with a gold standard data set and observe the effect of the crowdsourced relevance judgments on the resulting system rankings. This enables us to observe the effect of crowdsourcing on the entire IR evaluation process. Using the test set and experimental runs from the INEX 2010 Book Track, we find that varying the HIT design, and the pooling and document ordering strategies leads to considerable differences in agreement with the gold set labels. We then observe the impact of the crowdsourced relevance label sets on the relative system rankings using four IR performance metrics. System rankings based on MAP and Bpref remain less affected by different label sets while the Precision@10 and nDCG@10 lead to dramatically different system rankings, especially for labels acquired from HITs with weaker quality controls. Overall, we find that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.
Rijke. A probabilistic method for inferring preferences from clicks
- In CIKM ’11
, 2011
"... Evaluating rankers using implicit feedback, such as clicks on docu-ments in a result list, is an increasingly popular alternative to tradi-tional evaluation methods based on explicit relevance judgments. Previous work has shown that so-called interleaved comparison methods can utilize click data to ..."
Abstract
-
Cited by 24 (15 self)
- Add to MetaCart
(Show Context)
Evaluating rankers using implicit feedback, such as clicks on docu-ments in a result list, is an increasingly popular alternative to tradi-tional evaluation methods based on explicit relevance judgments. Previous work has shown that so-called interleaved comparison methods can utilize click data to detect small differences between rankers and can be applied to learn ranking functions online. In this paper, we analyze three existing interleaved comparison methods and find that they are all either biased or insensitive to some differences between rankers. To address these problems, we present a new method based on a probabilistic interleaving process. We derive an unbiased estimator of comparison outcomes and show how marginalizing over possible comparison outcomes given the observed click data can make this estimator even more effective. We validate our approach using a recently developed simulation framework based on a learning to rank dataset and a model of click behavior. Our experiments confirm the results of our analysis and show that our method is both more accurate and more robust to noise than existing methods.
Learning to Re-Rank: Query-Dependent Image Re-Ranking Using Click Data
"... Our objective is to improve the performance of keyword based image search engines by re-ranking their baseline results. To this end, we address three limitations of existing search engines in this paper. First, there is no straightforward, fully automated way of going from textual queries to visual ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
(Show Context)
Our objective is to improve the performance of keyword based image search engines by re-ranking their baseline results. To this end, we address three limitations of existing search engines in this paper. First, there is no straightforward, fully automated way of going from textual queries to visual features. Image search engines are therefore forced to rely on static and textual features alone for ranking. Visual features are used only for secondary tasks such as finding similar images. Second, image rankers are trained on query-image pairs labeled with relevance judgments determined by human experts. Such labels are well known to be noisy due to various factors including ambiguous queries, unknown user intent and subjectivity in human judgments. This leads to learning a sub-optimal ranker. Finally, a static ranker is typically built to handle disparate user queries. The ranker is therefore unable to adapt its parameters to suit the query at hand which again leads to sub-optimal results. We demonstrate that all of these problems can be mitigated by employing a re-ranking algorithm that leverages aggregate user click data. We hypothesize that images clicked in response to a query are mostly relevant to the query. We therefore re-rank the original search results so as to promote images that are likely to be clicked to the top of the ranked list. Our re-ranking algorithm employs Gaussian Process regression to predict the normalized click count for each image, and combines it with the original ranking score. Our approach is shown to significantly boost the performance of the Bing image search engine on a wide range of queries, especially the tail queries.