• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Overview of the TREC 2004 Robust Retrieval Track

by Ellen M. Voorhees
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 58
Next 10 →

Diversifying Search Results

by Rakesh Agrawal, Alan Halverson , 2009
"... We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the r ..."
Abstract - Cited by 283 (5 self) - Add to MetaCart
We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the risk of dissatisfaction of the average user. We propose an algorithm that well approximates this objective in general, and is provably optimal for a natural special case. Furthermore, we generalize several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. We demonstrate empirically that our algorithm scores higher in these generalized metrics compared to results produced by commercial search engines.

Less is More -- Probabilistic Models for Retrieving Fewer Relevant Documents

by Harr Chen, et al. , 2006
"... Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there ..."
Abstract - Cited by 148 (1 self) - Add to MetaCart
Traditionally, information retrieval systems aim to maximize the number of relevant documents returned to a user within some window of the top. For that goal, the probability ranking principle, which ranks documents in decreasing order of probability of relevance, is provably optimal. However, there are many scenarios in which that ranking does not optimize for the user’s information need. One example is when the user would be satisfied with some limited number of relevant documents, rather than needing all relevant documents. We show that in such a scenario, an attempt to return many relevant documents can actually reduce the chances of finding any relevant documents. We consider a number of information retrieval metrics from the literature, including the rank of the first relevant result, the %no metric that penalizes a system only for retrieving no relevant results near the top, and the diversity of retrieved results when queries have multiple interpretations. We observe that given a probabilistic model of relevance, it is appropriate to rank so as to directly optimize these metrics in expectation. While doing so may be computationally intractable, we show that a simple greedy optimization algorithm that approximately optimizes the given objectives produces rankings for TREC queries that outperform the standard approach based on the probability ranking principle.

What Makes a Query Difficult?

by David Carmel, Elad Yom-Tov, Adam Darlow, Dan Pelleg , 2006
"... This work tries to answer the question of what makes a query difficult. It addresses a novel model that captures the main components of a topic and the relationship between those components and topic difficulty. The three components of a topic are the textual expression describing the information ne ..."
Abstract - Cited by 71 (6 self) - Add to MetaCart
This work tries to answer the question of what makes a query difficult. It addresses a novel model that captures the main components of a topic and the relationship between those components and topic difficulty. The three components of a topic are the textual expression describing the information need (the query or queries), the set of documents relevant to the topic (the Qrels), and the entire collection of documents. We show experimentally that topic difficulty strongly depends on the distances between these components. In the absence of knowledge about one of the model components, the model is still useful by approximating the missing component based on the other components. We demonstrate the applicability of the difficulty model for several uses such as predicting query difficulty, predicting the number of topic aspects expected to be covered by the search results, and analyzing the findability of a specific domain.

Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval

by Elad Yom-tov, Shai Fine, David Carmel, Adam Darlow - In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval , 2005
"... In this article we present novel learning methods for estimating the quality of results returned by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. We demonstrate the usefulness of quality ..."
Abstract - Cited by 59 (5 self) - Add to MetaCart
In this article we present novel learning methods for estimating the quality of results returned by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. We demonstrate the usefulness of quality estimation for several applications, among them improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval. Experiments on TREC data demonstrate the robustness and the effectiveness of our learning algorithms.
(Show Context)

Citation Context

...ever, many information retrieval (IR) systems suffer from a radical variance in performance; Even for systems that succeed very well on average, the quality of results is poor for some of the queries =-=[22, 11]-=-. Thus, it is desirable that IR systems be able to identify “difficult” queries in order to handle them properly. Estimating query difficulty is an attempt to quantify the quality of results returned ...

AN AXIOMATIC APPROACH TO INFORMATION RETRIEVAL

by Hui Fang , 2007
"... ..."
Abstract - Cited by 42 (10 self) - Add to MetaCart
Abstract not found

Semantic Term Matching in Axiomatic Approaches to Information Retrieval

by Hui Fang, ChengXiang Zhai - IN SIGIR , 2006
"... A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing distinct but semantically related terms to match each other and co ..."
Abstract - Cited by 40 (10 self) - Add to MetaCart
A common limitation of many retrieval models, including the recently proposed axiomatic approaches, is that retrieval scores are solely based on exact (i.e., syntactic) matching of terms in the queries and documents, without allowing distinct but semantically related terms to match each other and contribute to the retrieval score. In this paper, we show that semantic term matching can be naturally incorporated into the axiomatic retrieval model through defining the primitive weighting function based on a semantic similarity function of terms. We define several desirable retrieval constraints for semantic term matching and use such constraints to extend the axiomatic model to directly support semantic term matching based on the mutual information of terms computed on some document set. We show that such extension can be e#ciently implemented as query expansion. Experiment results on several representative data sets show that, with mutual information computed over the documents in either the target collection for retrieval or an external collection such as the Web, our semantic expansion consistently and substantially improves retrieval accuracy over the baseline axiomatic retrieval model. As a pseudo feedback method, our method also outperforms a state-ofthe -art language modeling feedback method.

Document Representation and Query Expansion Models for Blog Recommendation

by Jaime Arguello, Jonathan L. Elsas, Jamie Callan, Jaime G. Carbonell
"... We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (th ..."
Abstract - Cited by 39 (5 self) - Add to MetaCart
We explore several different document representation models and two query expansion models for the task of recommending blogs to a user in response to a query. Blog relevance ranking differs from traditional document ranking in ad-hoc information retrieval in several ways: (1) the unit of output (the blog) is composed of a collection of documents (the blog posts) rather than a single document, (2) the query represents an ongoing – and typically multifaceted – interest in the topic rather than a passing ad-hoc information need and (3) due to the propensity of spam, splogs, and tangential comments, the blogosphere is particularly challenging to use as a source for high-quality query expansion terms. We address these differences at the document representation level, by comparing retrieval models that view either the blog or its constituent posts as the atomic units of retrieval, and at the query expansion level, by making novel use of the links and anchor text in Wikipedia 1 to expand a user’s initial query. We develop two complementary models of blog retrieval that perform at comparable levels of precision and recall. We also show consistent and significant improvement across all models using our Wikipedia expansion strategy.

Concept-Based Information Retrieval Using Explicit Semantic Analysis

by Ofer Egozi, Shaul Markovitch, Evgeniy Gabrilovich
"... Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship ..."
Abstract - Cited by 30 (0 self) - Add to MetaCart
Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), arecentlyproposedmethodthataugmentskeywordbased text representation with concept-based features, automaticallyextractedfrommassivehumanknowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional featureselectionmethodscannotbeused,hencewe propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.

Concept-Based Feature Generation and Selection for Information Retrieval

by Ofer Egozi, Evgeniy Gabrilovich, Shaul Markovitch - In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence , 2008
"... Traditional information retrieval systems use query words to identify relevant documents. In difficult retrieval tasks, however, one needs access to a wealth of background knowledge. We present a method that uses Wikipedia-based feature generation to improve retrieval performance. Intuitively, we ex ..."
Abstract - Cited by 20 (5 self) - Add to MetaCart
Traditional information retrieval systems use query words to identify relevant documents. In difficult retrieval tasks, however, one needs access to a wealth of background knowledge. We present a method that uses Wikipedia-based feature generation to improve retrieval performance. Intuitively, we expect that using extensive world knowledge is likely to improve recall but may adversely affect precision. High quality feature selection is necessary to maintain high precision, but here we do not have the labeled training data for evaluating features, that we have in supervised learning. We present a new feature selection method that is inspired by pseudorelevance feedback. We use the top-ranked and bottomranked documents retrieved by the bag-of-words method as representative sets of relevant and non-relevant documents. The generated features are then evaluated and filtered on the basis of these sets. Experiments on TREC data confirm the superior performance of our method compared to the previous state of the art.
(Show Context)

Citation Context

... values. We used paired t-test to assess the statistical significance of results; significant improvements (p < 0.05) are shown in bold. We also conducted experiments on the TREC Robust 2004 dataset (=-=Voorhees 2005-=-), and achieved similarly significant improvements (10.7% over the BOW baseline on the set of 50 new queries, and 29.4% over baseline for the set of 50 hard queries). We omit additional details of thi...

Search result diversity for informational queries

by Michael J Welch , Junghoo Cho , Christopher Olston - In WWW , 2011
"... ABSTRACT Ambiguous queries constitute a significant fraction of search instances and pose real challenges to web search engines. With current approaches the top results for these queries tend to be homogeneous, making it difficult for users interested in less popular aspects to find relevant docume ..."
Abstract - Cited by 17 (0 self) - Add to MetaCart
ABSTRACT Ambiguous queries constitute a significant fraction of search instances and pose real challenges to web search engines. With current approaches the top results for these queries tend to be homogeneous, making it difficult for users interested in less popular aspects to find relevant documents. While existing research in search diversification offers several solutions for introducing variety into the results, the majority of such work is predicated, implicitly or otherwise, on the assumption that a single relevant document will fulfill a user&apos;s information need, making them inadequate for many informational queries. In this paper we present a searchdiversification algorithm particularly suitable for informational queries by explicitly modeling that the user may need more than one page to satisfy their need. This modeling enables our algorithm to make a well-informed tradeoff between a user&apos;s desire for multiple relevant documents, probabilistic information about an average user&apos;s interest in the subtopics of a multifaceted query, and uncertainty in classifying documents into those subtopics. We evaluate the effectiveness of our algorithm against commercial search engine results and other modern ranking strategies, demonstrating notable improvement in multiple document scenarios.
(Show Context)

Citation Context

... documents belong to a single category. Their algorithm does, however, contain potential weaknesses, which we explore in more depth in Section 5. Researchers have also considered meaningful ways to evaluate the performance of search diversification and subtopic retrieval algorithms. Classic ranked retrieval metrics such as NDCG, MRR, and MAP have been augmented [6, 1] to take user intent into account. Metrics such as search length (SL) [7] and k-call [5], and their aggregated forms, are well suited to evaluate diversification of search systems under single document assumptions. The %no metric [22] measures the ability of a system to retrieve at least one relevant result in the top ten. Other metrics, such as Subtopic recall and Subtopic precision [24], explicitly measure the subtopic coverage of a result set or the efficiency at which an algorithm represents the relevant subtopics. We use several of these existing metrics to evaluate the performance of our algorithm under single document scenarios. We also define the expected hits metric to evaluate diversification algorithms under the more general assumption that a user may require multiple documents. We will detail our metric in the ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University