Results 1 - 10
of
36
How Does Clickthrough Data Reflect Retrieval Quality?
"... Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Automatically judging the quality of retrieval functions based on observable user behavior holds promise for making retrieval evaluation faster, cheaper, and more user centered. However, the relationship between observable user behavior and retrieval quality is not yet fully understood. We present a sequence of studies investigating this relationship for an operational search engine on the arXiv.org e-print archive. We find that none of the eight absolute usage metrics we explore (e.g., number of clicks, frequency of query reformulations, abandonment) reliably reflect retrieval quality for the sample sizes we consider. However, we find that paired experiment designs adapted from sensory analysis produce accurate and reliable statements about the relative quality of two retrieval functions. In particular, we investigate two paired comparison tests that analyze clickthrough data from an interleaved presentation of ranking pairs, and we find that both give accurate and consistent results. We conclude that both paired comparison tests give substantially more accurate and sensitive evaluation results than absolute usage metrics in our domain.
Learning diverse rankings with multi-armed bandits
- In Proceedings of the 25 th ICML
, 2008
"... Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We presen ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We present two online learning algorithms that directly learn a diverse ranking of documents based on users ’ clicking behavior. We show that these algorithms minimize abandonment, or alternatively, maximize the probability that a relevant document is found in the top k positions of a ranking. Moreover, one of our algorithms asymptotically achieves optimal worst-case performance even if users’ interests change. 1.
Modelling A User Population for Designing Information Retrieval Metrics
- Proceedings of the Second Workshop on Evaluating Information Access (EVIA 2008
, 2008
"... Although Average Precision (AP) has been the most widely-used retrieval effectiveness metric since the advent of Text Retrieval Conference (TREC), the general belief among researchers is that it lacks a user model. In light of this, Robertson recently pointed out that AP can be interpreted as a spec ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Although Average Precision (AP) has been the most widely-used retrieval effectiveness metric since the advent of Text Retrieval Conference (TREC), the general belief among researchers is that it lacks a user model. In light of this, Robertson recently pointed out that AP can be interpreted as a special case of Normalised Cumulative Precision (NCP), computed as an expectation of precision over a population of users who eventually stop at different ranks in a list of retrieved documents. He regards AP as a crude version of NCP, in that the probability distribution of the user’s stopping behaviour is uniform across all relevant documents. In this paper, we generalise NCP further and demonstrate that AP and its graded-relevance version Q-measure are in fact reasonable metrics despite the above uniform probability assumption. From a probabilistic perspective, these metrics emphasise long-tail users who tend to dig deep into the ranked list, and thereby achieve high reliability. We also demonstrate that one of our new metrics, called NCU gu,β=1, maintains high correlation with AP and shows the highest discriminative power, i.e., the proportion of statistically significantly different system pairs given a confidence level, by utilising graded relevance in a novel way. Our experimental results are consistent across NTCIR and TREC.
The Good, the Bad, and the Random: An Eye-Tracking Study of Ad Quality in Web Search
"... We investigate how people interact with Web search engine result pages using eye-tracking. While previous research has focused on the visual attention devoted to the 10 organic search results, this paper examines other components of contemporary search engines, such as ads and related searches. We s ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We investigate how people interact with Web search engine result pages using eye-tracking. While previous research has focused on the visual attention devoted to the 10 organic search results, this paper examines other components of contemporary search engines, such as ads and related searches. We systematically varied the type of task (informational or navigational), the quality of the ads (relevant or irrelevant to the query), and the sequence in which ads of different quality were presented. We measured the effects of these variables on the distribution of visual attention and on task performance. Our results show significant effects of each variable. The amount of visual attention that people devote to organic results depends on both task type and ad quality. The amount of visual attention that people devote to ads depends on their quality, but not the type of task. Interestingly, the sequence and predictability of ad quality is also an important factor in determining how much people attend to ads. When the quality of ads varied randomly from task to task, people paid little attention to the ads, even when they were good. These results further our understanding of how attention devoted to search results is influenced by other page elements, and how previous search experiences influence how people attend to the current page.
A Probability Ranking Principle for Interactive Information Retrieval
, 2008
"... The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR) models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold, and its view is too narrow for interactive informatio ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR) models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold, and its view is too narrow for interactive information retrieval (IIR). In this paper, a new theoretical framework for interactive retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation. Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering of the choices can the derived- the PRP for IIR. The relationship of this rule to the classical PRP is described, and issues of further research are pointed out. 1
Including summaries in system evaluation
- Proc. SIGIR
, 2009
"... In batch evaluation of retrieval systems, performance is calculated based on predetermined relevance judgements applied to a list of documents returned by the system for a query. This evaluation paradigm, however, ignores the current standard operation of search systems which require the user to vie ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In batch evaluation of retrieval systems, performance is calculated based on predetermined relevance judgements applied to a list of documents returned by the system for a query. This evaluation paradigm, however, ignores the current standard operation of search systems which require the user to view summaries of documents prior to reading the documents themselves. In this paper we modify the popular IR metrics MAP and P@10 to incorporate the summary reading step of the search process, and study the effects on system rankings using TREC data. Based on a user study, we establish likely disagreements between relevance judgements of summaries and of documents, and use these values to seed simulations of summary relevance in the TREC data. Re-evaluating the runs submitted to the TREC Web Track, we find the average correlation between system rankings and the original TREC rankings is 0.8 (Kendall τ), which is lower than commonly accepted for system orderings to be considered equivalent. The system that has the highest MAP in TREC generally remains amongst the highest MAP systems when summaries are taken into account, but many other systems become equivalent to the top ranked system depending on the simulated summary relevance. Given that system orderings alter when summaries are taken into account, the small amount of effort required to judge summaries in addition to documents (19 seconds vs 88 seconds on average in our data) should be undertaken when constructing test collections.
How do users find things with PubMed? Towards automatic utility evaluation with user simulations
- In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008
, 2008
"... In the context of document retrieval in the biomedical domain, this paper explores the complex relationship between the quality of initial query results and the overall utility of an interactive retrieval system. We demonstrate that a content-similarity browsing tool can compensate for poor retrieva ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In the context of document retrieval in the biomedical domain, this paper explores the complex relationship between the quality of initial query results and the overall utility of an interactive retrieval system. We demonstrate that a content-similarity browsing tool can compensate for poor retrieval results, and that the relationship between retrieval performance and overall utility is non-linear. Arguments are advanced with user simulations, which characterize the relevance of documents that a user might encounter with different browsing strategies. With broader implications to IR, this work provides a case study of how user simulations can be exploited as a formative tool for automatic utility evaluation. Simulation-based studies provide researchers with an additional evaluation tool to complement interactive and Cranfield-style experiments.
Score standardization for inter-collection comparison of retrieval systems
- In Proceedings of the 31st ACM Conference on Research and Development in Information Retrieval (SIGIR
, 2008
"... The goal of system evaluation in information retrieval has always been to determine which of a set of systems is superior on a given collection. The tool used to determine system ordering is an evaluation metric such as average precision, which computes relative, collection-specific scores. We argue ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The goal of system evaluation in information retrieval has always been to determine which of a set of systems is superior on a given collection. The tool used to determine system ordering is an evaluation metric such as average precision, which computes relative, collection-specific scores. We argue that a broader goal is achievable. In this paper we demonstrate that, by use of standardization, scores can be substantially independent of a particular collection, allowing systems to be compared even when they have been tested on different collections. Compared to current methods, our techniques provide richer information about system performance, improved clarity in outcome reporting, and greater simplicity in reviewing results from disparate sources. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and software—performance evaluation.
User Adaptation: Good Results from Poor Systems
- Laboratory for Advanced Information Research (LAIR) Technical Report
, 2008
"... Several recent studies have found only a weak relationship between the performance of a retrieval system and the “success” achievable by human searchers. We hypothesize that searchers are successful precisely because they alter their behavior. To explore the possible causal relation between system p ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Several recent studies have found only a weak relationship between the performance of a retrieval system and the “success” achievable by human searchers. We hypothesize that searchers are successful precisely because they alter their behavior. To explore the possible causal relation between system performance and search behavior, we control system performance, hoping to elicit adaptive search behaviors. 36 subjects each completed 12 searches using either a standard system or one of two degraded systems. Using a general linear model, we isolate the main effect of system performance, by measuring and removing main effects due to searcher variation, topic difficulty, and the position of each search in the time series. We find that searchers using our degraded systems are as successful as those using the standard system, but that, in achieving this success, they alter their behavior in ways that could be measured, in real time, by a suitably instrumented system. Our findings suggest, quite generally, that some aspects of behavioral dynamics may provide unobtrusive indicators of system performance.
Recent Developments in the Evaluation of Information Retrieval Systems: Moving Towards Diversity and Practical Relevance
, 2007
"... The evaluation of information retrieval systems has gained considerable momentum in the last few years. Several evaluation initiatives are concerned with diverse retrieval applications, innovative usage scenarios and different aspects of system performance. These evaluation initiatives have led to a ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The evaluation of information retrieval systems has gained considerable momentum in the last few years. Several evaluation initiatives are concerned with diverse retrieval applications, innovative usage scenarios and different aspects of system performance. These evaluation initiatives have led to a considerable increase in system performance. Data for evaluation efforts include multilingual corpora, structured data, scientific documents, Web pages as well as multimedia objects. This paper gives an overview of the current activities of the major evaluation initiatives. Special attention is given to the current tracks and developments within TREC, CLEF and NTCIR. The evaluation tasks and issues, as well as some results, will be presented. Povzetek: Pregledni članek opisuje usmeritve v informacijskih povpraševalnih sistemih. 1 Information retrieval and its evaluation Information retrieval is the key technology for knowledge management which guarantees access to large corpora of unstructured data. Very often, text collections need to be processed by retrieval systems. Information

