Results 1 - 10
of
20
Mining query logs: Turning search usage data into knowledge
- Foundations and Trends in Information Retrieval
, 2008
"... ..."
(Show Context)
Search-based Query Suggestion
"... In this paper, we proposed a unified strategy to combine query log and search results for query suggestion. In this way, we leverage both the users ’ search intentions for popular queries and the power of search engines for unpopular queries. The suggested queries are also ranked according to their ..."
Abstract
-
Cited by 34 (8 self)
- Add to MetaCart
(Show Context)
In this paper, we proposed a unified strategy to combine query log and search results for query suggestion. In this way, we leverage both the users ’ search intentions for popular queries and the power of search engines for unpopular queries. The suggested queries are also ranked according to their relevance and qualities; and each suggestion is described with a rich snippet including a photo and related description.
Event detection from evolution of click-through data
- Department of Computer Science and Technology of Tsinghua University. His
"... Previous efforts on event detection from the web have fo-cused primarily on web content and structure data ignoring the rich collection of web log data. In this paper, we propose the first approach to detect events from the click-through data, which is the log data of web search engines. The in-tuit ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
Previous efforts on event detection from the web have fo-cused primarily on web content and structure data ignoring the rich collection of web log data. In this paper, we propose the first approach to detect events from the click-through data, which is the log data of web search engines. The in-tuition behind event detection from click-through data is that such data is often event-driven and each event can be represented as a set of query-page pairs that are not only semantically similar but also have similar evolution pattern over time. Given the click-through data, in our proposed approach, we first segment it into a sequence of bipartite graphs based on the user-defined time granularity. Next, the sequence of bipartite graphs is represented as a vector-based graph, which records the semantic and evolutionary relationships between queries and pages. After that, the vector-based graph is transformed into its dual graph, where each node is a query-page pair that will be used to represent real world events. Then, the problem of event detection is equivalent to the problem of clustering the dual graph of the vector-based graph. The clustering process is based on a two-phase graph cut algorithm. In the first phase, query-page pairs are clustered based on the semantic-based simi-larity such that each cluster in the result corresponds to a specific topic. In the second phase, query-page pairs related to the same topic are further clustered based on the evo-lution pattern-based similarity such that each cluster is ex-pected to represent a specific event under the specific topic. Experiments with real click-through data collected from a commercial web search engine show that the proposed ap-proach produces high quality results.
Efficient Anomaly Monitoring Over Moving Object Trajectory Streams
"... Lately there exist increasing demands for online abnormality monitoring over trajectory streams, which are obtained from moving object tracking devices. This problem is challenging due to the requirement of high speed data processing within limited space cost. In this paper, we present a novel frame ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
Lately there exist increasing demands for online abnormality monitoring over trajectory streams, which are obtained from moving object tracking devices. This problem is challenging due to the requirement of high speed data processing within limited space cost. In this paper, we present a novel framework for monitoring anomalies over continuous trajectory streams. First, we illustrate the importance of distance-based anomaly monitoring over moving object trajectories. Then, we utilize the local continuity characteristics of trajectories to build local clusters upon trajectory streams and monitor anomalies via efficient pruning strategies. Finally, we propose a piecewise metric index structure to reschedule the joining order of local clusters to further reduce the time cost. Our extensive experiments demonstrate the effectiveness and efficiency of our methods.
On Nonmetric Similarity Search Problems in Complex Domains
, 2010
"... The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
The task of similarity search is widely used in various areas of computing, including multimedia databases, data mining, bioinformatics, social networks, etc. In fact, retrieval of semantically unstructured data entities requires a form of aggregated qualification that selects entities relevant to a query. A popular type of such a mechanism is similarity querying. For a long time, the database-oriented applications of similarity search employed the definition of similarity restricted to metric distances. Due to its topological properties, metric similarity can be effectively used to index a database which can be then queried efficiently by so-called metric access methods. However, together with the increasing complexity of data entities across various domains, in recent years there appeared many similarities that were not metrics – we call them nonmetric similarity functions. In this paper we survey domains employing nonmetric functions for effective similarity search, and methods for efficient nonmetric similarity search. First, we show that the ongoing research in many of these domains requires complex representations of data entities. Simultaneously, such complex representations allow us to model also complex and computationally expensive similarity functions (often represented by various matching algorithms). However, the more complex similarity function one develops, the more likely it will be a nonmetric. Second, we review the state-of-the-art techniques for efficient (fast) nonmetric similarity search, concerning both exact and approximate search. Finally, we discuss some open problems and possible future research trends.
Optimal Distance Bounds on Time-Series Data
"... Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, ther ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitating mining operations directly on the compressed data. Naturally, the quality or tightness of the estimated distances on the compressed objects directly affects the search performance. We motivate our work within the setting of search engine weblog repositories, where keyword demand trends over time are represented and stored as compressed timeseries data. Search and analysis over such sequence data has important applications for the search engines, including discovery of important news events, keyword recommendation and efficient keyword-to-advertisement mapping. We present new mechanisms for very fast search operations over the compressed time-series data, with specific focus on weblog data. An important contribution of this work is the derivation of optimally tight bounds on the Euclidean distance estimation between compressed sequences. Since our methodology is applicable to sequential data in general, the proposed technique is of independent interest. Additionally, our distance estimation strategy is not tied to a specific compression methodology, but can be applied on top of any orthonormal based compression technique (Fourier, Wavelet, PCA, etc). The experimental results indicate that the new optimal bounds lead to a significant improvement in the pruning power of search compared to previous state-of-the-art, in many cases eliminating more than 80 % of the candidate search sequences. 1
Mining Related Queries from Web Search Engine Query Logs Using an Improved Association Rule Mining Model
"... With the overwhelming volume of information, the task of finding relevant information on a given topic on the Web is becoming increasingly difficult. Web search engines hence become one of the most popular solutions available on the Web. However, it has never been easy for novice users to organize a ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
With the overwhelming volume of information, the task of finding relevant information on a given topic on the Web is becoming increasingly difficult. Web search engines hence become one of the most popular solutions available on the Web. However, it has never been easy for novice users to organize and represent their information needs using simple queries. Users have to keep modifying their input queries until they get expected results. Therefore, it is often desirable for search engines to give suggestions on related queries to users. Besides, by identifying those related queries, search engines can potentially perform optimizations on their systems, such as query expansion and file indexing. In this work we propose a method that suggests a list of related queries given an initial input query. The related queries are based in the query log of previously submitted queries by human users, which can be identified using an enhanced model of association rules. Users can utilize the suggested related queries to tune or redirect the search process. Our method not only discovers the related queries, but also ranks them according to the degree of their relatedness. Unlike many other rival techniques, it also performs reasonably well on less frequent input queries.
The Effects of Query Bursts on Web Search
- WEB INTELLIGENCE
, 2010
"... A query burst is a period of heightened interest of users on a topic which yields a higher frequency of the search queries related to it. In this paper we examine the behavior of search engine users during a query burst, compared to before and after this period. The purpose of this study is to get i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A query burst is a period of heightened interest of users on a topic which yields a higher frequency of the search queries related to it. In this paper we examine the behavior of search engine users during a query burst, compared to before and after this period. The purpose of this study is to get insights about how search engines and content providers should respond to a query burst. We analyze one year of web-search logs, looking at query bursts from two perspectives. First, we adopt the user’s perspective describing changes in user’s effort and interest while searching. Second, we look at the burst from the general content providers ’ view, answering the question of under which conditions a content provider should “ride ” a wave of increased interest to obtain a significant share of clicks.
Consistent Phrase Relevance Measures
"... Measuring the relevance between a document and a phrase is fundamental to many information retrieval and matching tasks including on-line advertising. In this paper, we explore two approaches for measuring the relevance between a document and a phrase aiming to provide consistent relevance scores fo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Measuring the relevance between a document and a phrase is fundamental to many information retrieval and matching tasks including on-line advertising. In this paper, we explore two approaches for measuring the relevance between a document and a phrase aiming to provide consistent relevance scores for both in and out-of document phrases. The first approach is a similarity-based method which represents both the document and phrase as term vectors to derive a real-valued relevance score. The second approach takes as input the relevance estimates of some in-document phrases and uses Gaussian Process Regression to predict the score of a target out-of-document phrase. While both of these two approaches work well, the best result is given by a Gaussian Process Regression model, which is significantly better than the similarity-based approach and 10 % better than a baseline similarity method using bag-of-word vectors.
and
"... Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organizat ..."
Abstract
- Add to MetaCart
Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it). Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logswhenworking directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10–30 % improvement on distance estimations, which in turn leads to 25–80 % improvement on the search performance.