Results 1 - 10
of
37
Sources of Evidence for Vertical Selection
"... Web search providers often include search services for domainspecific subcollections, called verticals, such as news, images, videos, job postings, company summaries, and artist profiles. We address the problem of vertical selection, predicting relevant verticals (if any) for queries issued to the s ..."
Abstract
-
Cited by 59 (14 self)
- Add to MetaCart
(Show Context)
Web search providers often include search services for domainspecific subcollections, called verticals, such as news, images, videos, job postings, company summaries, and artist profiles. We address the problem of vertical selection, predicting relevant verticals (if any) for queries issued to the search engine’s main web search page. In contrast to prior query classification and resource selection tasks, vertical selection is associated with unique resources that can inform the classification decision. We focus on three sources of evidence: (1) the query string, from which features are derived independent of external resources, (2) logs of queries previously issued directly to the vertical, and (3) corpora representative of vertical content. We focus on 18 different verticals, which differ in terms of semantics, media type, size, and level of query traffic. We compare our method to prior work in federated search and retrieval effectiveness prediction. An in-depth error analysis reveals unique challenges across different verticals and provides insight into vertical selection for future work.
Classification-enhanced ranking
, 2010
"... Many have speculated that classifying web pages can improve a search engine’s ranking of results. Intuitively results should be more relevant when they match the class of a query. We present a simple framework for classification-enhanced ranking that uses clicks in combination with the classificatio ..."
Abstract
-
Cited by 36 (13 self)
- Add to MetaCart
Many have speculated that classifying web pages can improve a search engine’s ranking of results. Intuitively results should be more relevant when they match the class of a query. We present a simple framework for classification-enhanced ranking that uses clicks in combination with the classification of web pages to derive a class distribution for the query. We then go on to define a variety of features that capture the match between the class distributions of a web page and a query, the ambiguity of a query, and the coverage of a retrieved result relative to a query’s set of classes. Experimental results demonstrate that a ranker learned with these features significantly improves ranking over a competitive baseline. Furthermore, our methodology is agnostic with respect to the classification space and can be used to derive query classes for a variety of different taxonomies.
Earlybird: Real-Time Search at Twitter
"... Abstract — The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Earlybird, the core retrieval engine that powers Twitter’s realtime search service. Although Earlybird builds and maint ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
(Show Context)
Abstract — The web today is increasingly characterized by social and real-time signals, which we believe represent two frontiers in information retrieval. In this paper, we present Earlybird, the core retrieval engine that powers Twitter’s realtime search service. Although Earlybird builds and maintains inverted indexes like nearly all modern retrieval engines, its index structures differ from those built to support traditional web search. We describe these differences and present the rationale behind our design. A key requirement of real-time search is the ability to ingest content rapidly and make it searchable immediately, while concurrently supporting low-latency, highthroughput query evaluation. These demands are met with a single-writer, multiple-reader concurrency model and the targeted use of memory barriers. Earlybird represents a point in the design space of real-time search engines that has worked well for Twitter’s needs. By sharing our experiences, we hope to spur additional interest and innovation in this exciting space. I.
The metadata triumvirate: Social annotations, anchor texts and search queries
- In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
, 2008
"... In this paper, we study and compare three different but related types of “metadata ” about web documents: social annotations provided by readers of web documents, hyper-link anchor text provided by authors of web documents, and search queries of users trying to find web documents. We in-troduce a la ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we study and compare three different but related types of “metadata ” about web documents: social annotations provided by readers of web documents, hyper-link anchor text provided by authors of web documents, and search queries of users trying to find web documents. We in-troduce a large research data set called CABS120k08 which we have created for this study from a variety of informa-tion sources such as AOL500k, the Open Directory Project, del.icio.us/Yahoo!, Google and the WWW in general. We use this data set to investigate several characteristics of said metadata including length, novelty, diversity, and similarity and discuss theoretical and practical implications. 1
Classifying Search Queries Using the Web as a Source of Knowledge ∗
"... We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real-time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classify ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real-time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregation account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.
An analysis framework for search sequences
- In CIKM ’09: Proceeding of the 18th ACM conference on Information and knowledge management
, 1991
"... In this paper we present a general framework to study sequences of search activities performed by a user. Our framework provides (i) a vocabulary to discuss types of features, models, and tasks, (ii) straightforward feature re-use across problems, (iii) realistic base-lines for many sequence analysi ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
In this paper we present a general framework to study sequences of search activities performed by a user. Our framework provides (i) a vocabulary to discuss types of features, models, and tasks, (ii) straightforward feature re-use across problems, (iii) realistic base-lines for many sequence analysis tasks we study, and (iv) a simple mechanism to develop baselines for sequence analysis tasks beyond those studied in this paper. Using this framework we study a set of fourteen sequence analysis tasks with a range of features and mod-els. While we show that most tasks benefit from features based on recent history, we also identify two categories of “sequence-resistant ” tasks for which simple classes of local features perform as well as richer features and models.
Behavior-driven clustering of queries into topics
- In Proc. of CIKM
, 2011
"... Categorization of web-search queries in semantically coher-ent topics is a crucial task to understand the interest trends of search engine users and, therefore, to provide more in-telligent personalization services. Query clustering usually relies on lexical and clickthrough data, while the informat ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Categorization of web-search queries in semantically coher-ent topics is a crucial task to understand the interest trends of search engine users and, therefore, to provide more in-telligent personalization services. Query clustering usually relies on lexical and clickthrough data, while the information originating from the user actions in submitting their queries is currently neglected. In particular, the intent that drives users to submit their requests is an important element for meaningful aggregation of queries. We propose a new intent-centric notion of topical query clusters and we define a query clustering technique that differs from existing algorithms in both methodology and nature of the resulting clusters. Our method extracts topics from the query log by merging mis-sions, i.e., activity fragments that express a coherent user intent, on the basis of their topical affinity. Our approach works in a bottom-up way, without any a-priori knowledge of topical categorization, and produces good quality topics compared to state-of-the-art clustering techniques. It can also summarize topically-coherent missions that occur far away from each other, thus enabling a more compact user profiling on a topical basis. Furthermore, such a topical user profiling discriminates the stream of activity of a particular user from the activity of others, with a potential to predict future user search activity.
Mining historic query trails to label long and rare . . .
, 2010
"... Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user’s search intent that may not be apparent from the terms in th ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user’s search intent that may not be apparent from the terms in the query. This affords search result matching or reranking based on queries and labels rather than the query text alone. Query labels can be derived from interaction logs generated from many users ’ search result clicks or from query trails comprising the chain of URLs visited following query submission. However, since long queries are typically rare, they are difficult to label in this way because little or no historic log data exists for them. A subset of these queries may be amenable to labeling by detecting similarities between parts of a long and rare query and the queries which appear in logs. In this article, we present the comparison of four similarity algorithms for the automatic assignment of Open Directory Project category labels to long and rare queries, based solely on matching against similar satisfied query trails extracted from log data. Our findings show that although the similarity-matching algorithms we investigated have tradeoffs in terms of coverage and accuracy, one algorithm that bases similarity on a popular search result ranking function (effectively regarding potentially-similar queries as “documents”)
Searchable Web Sites Recommendations
- In Proceedings of the Fourth ACM international conference on Web Search and Data Mining (WSDM
, 2011
"... In this paper, we propose a new framework for searchable web sites recommendation. Given a query, our system will recommend a list of searchable web sites ranked by rele-vance, which can be used to complement the web page re-sults and ads from a search engine. We model the condi-tional probability o ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a new framework for searchable web sites recommendation. Given a query, our system will recommend a list of searchable web sites ranked by rele-vance, which can be used to complement the web page re-sults and ads from a search engine. We model the condi-tional probability of a searchable web site being relevant to a given query in term of three main components: the language model of the query, the language model of the content within the web site, and the reputation of the web site searching capability (static rank). The language models for queries and searchable sites are built using information mined from client-side browsing logs. The static rank for each searchable site leverages features extracted from these client-side logs such as number of queries that are submit-ted to this site, and features extracted from general search engines such as the number of web pages that indexed for this site, number of clicks per query, and the dwell-time that a user spends on the search result page and on the clicked result web pages. We also learn a weight for each kind of feature to optimize the ranking performance. In our exper-iment, we discover 10.5 thousand searchable sites and use 5 million unique queries, extracted from one week of log data to build and demonstrate the effectiveness of our searchable web site recommendation system.
Normalized Web Distance Based Web Query Classification
- Journal of Computer Science
, 2012
"... Abstract: Problem statement: The problem is to classify a given web query to a set of 67 target categories. The target categories are ranked based on the degree of similarity to a given query. Approach: The feature set is the set of intermediate categories retrieved from a directory search engine fo ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract: Problem statement: The problem is to classify a given web query to a set of 67 target categories. The target categories are ranked based on the degree of similarity to a given query. Approach: The feature set is the set of intermediate categories retrieved from a directory search engine for a given query. Using direct mapping and Normalized Web Distance (NWD) the intermediate categories are mapped to the required target categories. The categories are then ranked based on three parameters of the intermediate categories namely, position, frequency and a combination of frequency and position. Results: The results proved that the third parameter gave a better result and a maximum of 40 search result pages ensure better results. Conclusion: With NWD as the similarity measure, the precision and recall is found to increase by 10 % over the previous methods.