Results 1 - 10
of
17
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis
"... Computing the degree of semantic relatedness of words is a key functionality of many language applications such as search, clustering, and disambiguation. Previous approaches to computing semantic relatedness mostly used static language resources, while essentially ignoring their temporal aspects. W ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
(Show Context)
Computing the degree of semantic relatedness of words is a key functionality of many language applications such as search, clustering, and disambiguation. Previous approaches to computing semantic relatedness mostly used static language resources, while essentially ignoring their temporal aspects. We believe that a considerable amount of relatedness information can also be found in studying patterns of word usage over time. Consider, for instance, a newspaper archive spanning many years. Two words such as “war” and “peace ” might rarely co-occur in the same articles, yet their patterns of use over time might be similar. In this paper, we propose a new semantic relatedness model, Temporal Semantic Analysis (TSA), which captures this temporal information. The previous state of the art method, Explicit Semantic Analysis (ESA), represented word semantics as a vector of concepts. TSA uses a more refined representation, where each concept is no longer scalar, but is instead represented as time series over a corpus of temporally-ordered documents. To the best of our knowledge, this is the first attempt to incorporate temporal evidence into models of semantic relatedness. Empirical evaluation shows that TSA provides consistent improvements over the state of the art ESA results on multiple benchmarks.
Repeatable Evaluation of Search Services in Dynamic Environments
, 2007
"... In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.
Mining Search and Browse Logs for Web Search: A Survey
"... Huge amounts of search log data have been accumulated at web search engines. Currently, a popular web search engine may every day receive billions of queries and collect tera-bytes of records about user search behavior. Beside search log data, huge amounts of browse log data have also been collected ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Huge amounts of search log data have been accumulated at web search engines. Currently, a popular web search engine may every day receive billions of queries and collect tera-bytes of records about user search behavior. Beside search log data, huge amounts of browse log data have also been collected through client-side browser plug-ins. Such massive amounts of search and browse log data provide great opportunities for mining the wisdom of crowds and improving web search. At the same time, designing e↵ective and e cient methods to clean, process, and model log data also presents great challenges. In this survey, we focus on mining search and browse log data for web search. We start with an introduction to search and browse log data and an overview of frequently-used data summarizations in log mining. We then elaborate how log mining applications enhance the five major components of a search engine, namely, query understanding, document understanding, document ranking, user understanding, and monitoring & feedbacks. For each aspect, we survey the major tasks, fundamental principles, and state-of-the-art methods.
Identifying Web search session patterns using cluster analysis: A comparison of three search environments
- J. Am. Soc. Inform. Sci. Tech. 2009
"... Session characteristics taken from large transaction logs of three Web search environments (academic Web site, public search engine, consumer health information por-tal) were modeled using cluster analysis to determine if coherent session groups emerged for each environ-ment and whether the types of ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Session characteristics taken from large transaction logs of three Web search environments (academic Web site, public search engine, consumer health information por-tal) were modeled using cluster analysis to determine if coherent session groups emerged for each environ-ment and whether the types of session groups are similar across the three environments. The analysis revealed three distinct clusters of session behaviors common to each environment: “hit and run ” sessions on focused topics, relatively brief sessions on popular topics, and sustained sessions using obscure terms with greater query modification. The findings also revealed shifts in session characteristics over time for one of the datasets, away from “hit and run ” sessions toward more popular search topics. A better understanding of session char-acteristics can help system designers to develop more responsive systems to support search features that cater to identifiable groups of searchers based on their search behaviors. For example, the system may identify strug-gling searchers based on session behaviors that match those identified in the current study to provide context sensitive help.
The Effects of Query Bursts on Web Search
- WEB INTELLIGENCE
, 2010
"... A query burst is a period of heightened interest of users on a topic which yields a higher frequency of the search queries related to it. In this paper we examine the behavior of search engine users during a query burst, compared to before and after this period. The purpose of this study is to get i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A query burst is a period of heightened interest of users on a topic which yields a higher frequency of the search queries related to it. In this paper we examine the behavior of search engine users during a query burst, compared to before and after this period. The purpose of this study is to get insights about how search engines and content providers should respond to a query burst. We analyze one year of web-search logs, looking at query bursts from two perspectives. First, we adopt the user’s perspective describing changes in user’s effort and interest while searching. Second, we look at the burst from the general content providers ’ view, answering the question of under which conditions a content provider should “ride ” a wave of increased interest to obtain a significant share of clicks.
Website community mining from query logs with two-phase clustering
- In Computational Linguistics and Intelligent Text Processing - Proceedings of 15th International Conference (Part II). CICLing
, 2014
"... Abstract. A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website may contain information about several topics. Accordingly, the website community mining method should be able to capture such phenomena and assigns such website into different communities. In this paper, we propose a method to automatically mine website communities by exploiting the query log data in Web search. Query log data can be regarded as a comprehensive summarization of the real Web. The queries that result in a particular website clicked can be regarded as the summarization of that website content. The websites in the same topic are indirectly connected by the queries that convey information need in this topic. This observation can help us overcome the first challenge. The proposed two-phase method can tackle the second challenge. In the first phase, we cluster the queries of the same host to obtain different content aspects of the host. In the second phase, we further cluster the obtained content aspects from different hosts. Because of the two-phase clustering, one host may appear in more than one website communities.
and
"... Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organizat ..."
Abstract
- Add to MetaCart
Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it). Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logswhenworking directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10–30 % improvement on distance estimations, which in turn leads to 25–80 % improvement on the search performance.
COLLECTION-LEVEL SUBJECT ACCESS IN AGGREGATIONS OF DIGITAL COLLECTIONS: METADATA APPLICATION AND USE BY
"... ii Problems in subject access to information organization systems have been under investigation for a long time. Focusing on item-level information discovery and access, researchers have identified a range of subject access problems, including quality and application of metadata, as well as the comp ..."
Abstract
- Add to MetaCart
(Show Context)
ii Problems in subject access to information organization systems have been under investigation for a long time. Focusing on item-level information discovery and access, researchers have identified a range of subject access problems, including quality and application of metadata, as well as the complexity of user knowledge required for successful subject exploration. While aggregations of digital collections built in the United States and abroad generate collection-level metadata of various levels of granularity and richness, no research has yet focused on the role of collection-level metadata in user interaction with these aggregations. This dissertation research sought to bridge this gap by answering the question “How does collection-level metadata mediate scholarly subject access to aggregated digital collections?” This goal was achieved using three research methods: in-depth comparative content analysis of collection-level metadata in three large-scale aggregations of cultural heritage digital collections: Opening History, American Memory, and The European Library
Exploring the role of scale: comparative analysis of digital library user searching
"... ABSTRACT This poster reports preliminary results of a comparative study of user searching in two large-scale digital libraries with history focus in the United States: the federal-level and the state-level digital library. Similarities were observed in search query lengths and average numbers of se ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT This poster reports preliminary results of a comparative study of user searching in two large-scale digital libraries with history focus in the United States: the federal-level and the state-level digital library. Similarities were observed in search query lengths and average numbers of search categories per search query. At the same time, the study reveals significant differences in the level of use of advanced search options and in search query frequencies, as well as in distribution of most search categories in user queries. The empirical data obtained in this exploratory study will inform further research and will be useful for professionals making decisions on resource description and user interfaces for digital libraries. Keywords Search queries, information behavior, federal-level digital libraries, state-level digital libraries, search log analysis.
Exploring Real-Time Temporal Query Auto-Completion
"... Query auto-completion (QAC) is a common interactive feature for assisting users during query formulation. Following each query input keystroke, QAC suggests queries prefixed by the input characters; allowing the user to avoid further cognitive and physical effort if any are acceptable. To rank sugge ..."
Abstract
- Add to MetaCart
(Show Context)
Query auto-completion (QAC) is a common interactive feature for assisting users during query formulation. Following each query input keystroke, QAC suggests queries prefixed by the input characters; allowing the user to avoid further cognitive and physical effort if any are acceptable. To rank suggestions, QAC approaches typically aggregate past query popularity to determine the likelihood of a query being used again. Hence, QAC is usually very effective for consistently popular queries. However, as the web becomes increasingly real-time, more people are turning to search engines to find out about unpredictable emerging and ongoing events and phenomena. QAC approaches reliant on aggregating long-term historic query-logs are not sensitive to very recent real-time events, because newly popular queries will be outweighed by long-term popular queries, especially for less-specific prefix lengths (e.g. 2 or 3 characters). We explore limiting the aggregation period of past querylog evidence to increase the temporal sensitivity of QAC. We vary the query-log aggregation period between 2 and 14 days, for prefix lengths of 2 to 5 characters. Experimentation simulates a realtime environment using openly available MSN and AOL query-log datasets. Analysis indicates a linear relationship between prefix length and QAC performance when using different query-log aggregation periods. In particular, we find QAC for shorter prefix lengths is optimal when a shorter query-log aggregation period is used, and vice-versa, longer prefix lengths benefit from a longer query-log aggregation period. 1.