Results 1 - 10
of
14
Gossiping personalized queries
- EDBT, volume 426 of ACM International Conference Proceeding Series
, 2010
"... This paper presents P3Q, a fully decentralized gossip-based proto-col to personalize query processing in social tagging systems. P3Q dynamically associates each user with social acquaintances shar-ing similar tagging behaviours. Queries are gossiped among such acquaintances, computed on the fly in a ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
(Show Context)
This paper presents P3Q, a fully decentralized gossip-based proto-col to personalize query processing in social tagging systems. P3Q dynamically associates each user with social acquaintances shar-ing similar tagging behaviours. Queries are gossiped among such acquaintances, computed on the fly in a collaborative, yet parti-tioned manner, and results are iteratively refined and returned to the querier. Analytical and experimental evaluations convey the scalability of P3Q for top-k query processing. More specifically, we show that on a 10,000-user delicious trace, with little storage at each user, the queries are accurately computed within reasonable time and bandwidth consumption. We also report on the inherent ability of P3Q to cope with users updating profiles and departing. 1.
Efficient Distributed Top-k Query Processing with Caching
"... Abstract. Recently, there has been an increased interest in incorporating in database management systems rank-aware query operators, such as top-k queries, that allow users to retrieve only the most interesting data objects. In this paper, we propose a cache-based approach for efficiently supporting ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Recently, there has been an increased interest in incorporating in database management systems rank-aware query operators, such as top-k queries, that allow users to retrieve only the most interesting data objects. In this paper, we propose a cache-based approach for efficiently supporting top-k queries in distributed database management systems. In large distributed systems, the query performance depends mainly on the network cost, measured as the number of tuples transmitted over the network. Ideally, only the k tuples that belong to the query result set should be transmitted. Nevertheless, a server cannot decide based only on its local data which tuples belong to the result set. Therefore, in this paper, we use caching of previous results to reduce the number of tuples that must be fetched over the network. To this end, our approach always delivers as many tuples as possible from cache and constructs a remainder query to fetch the remaining tuples. This is different from the existing distributed approaches that need to re-execute the entire top-k query when the cached entries are not sufficient to provide the result set. We demonstrate the feasibility and efficiency of our approach through implementation in a distributed database management system. 1
Multidimensional routing indices for efficient distributed query processing
- In Proc. of CIKM
, 2009
"... Traditional routing indices in peer-to-peer (P2P) networks are mainly designed for document retrieval applications and maintain aggregated one-dimensional values representing the number of documents that can be obtained in a certain direction in the network. In this paper, we introduce the concept o ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Traditional routing indices in peer-to-peer (P2P) networks are mainly designed for document retrieval applications and maintain aggregated one-dimensional values representing the number of documents that can be obtained in a certain direction in the network. In this paper, we introduce the concept of multidimensional routing indices (MRIs), which are suitable for handling multidimensional data represented by minimum bounding regions (MBRs). Depending on data distribution on peers, the aggregation of the MBRs may lead to MRIs that exhibit extremely poor performance, which renders them ineffective. Thus, focusing on a hybrid unstructured P2P network, we analyze the parameters for building MRIs of high selectivity. We present techniques that boost the query routing performance by detecting similar peers and grouping and reassigning these peers to other parts of the hybrid network in a distributed and scalable way. We demonstrate the advantages of our approach using largescale simulations.
Efficient Processing of Exact Top-k Queries over Sorted Lists
, 2010
"... The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with perattribute sorted lists, and a thresho ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with perattribute sorted lists, and a threshold algorithm is applied on the lists involved in each query. The threshold algorithm executes in two phases – find a cut-off threshold for the top-k result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact top-k queries that involve monotonic linear scoring functions over diskresident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed
On Saying "Enough Already! " in MapReduce
"... The MapReduce framework for parallel processing of massive data sets has attracted considerable attention recently, mainly due to its salient features that include scalability, simplicity, and faulttolerance. However, despite its merits, MapReduce follows a bruteforce approach, which often results i ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
The MapReduce framework for parallel processing of massive data sets has attracted considerable attention recently, mainly due to its salient features that include scalability, simplicity, and faulttolerance. However, despite its merits, MapReduce follows a bruteforce approach, which often results in performing redundant work. This is particularly evident in the case of rank-aware queries, such as top-k, where a bounded set of k tuples comprise the result set. To process such queries in MapReduce, the input data needs to be accessed in its entirety, in order to produce the correct result set. To address this limitation of lack of early termination, in this paper, we investigate on different techniques that allow efficient processing of rank-aware queries, without accessing the input data exhaustively. We present various individual approaches that can be combined and demonstrate their advantages and shortcomings. Thus, we provide the first steps towards integrating efficient rank-aware processing in MapReduce.
Distributed Top-k Query Processing by Exploiting Skyline Summaries
"... Recently, a trend has been observed towards supporting rank-aware query operators, such as top-k, that enable users to retrieve only a limited set of the most interesting data objects. As data nowadays is commonly stored distributed over multiple servers, a challenging problem is to support rank-awa ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recently, a trend has been observed towards supporting rank-aware query operators, such as top-k, that enable users to retrieve only a limited set of the most interesting data objects. As data nowadays is commonly stored distributed over multiple servers, a challenging problem is to support rank-aware queries in distributed environments. In this paper, we propose a novel approach, called DiTo, for efficient top-k processing over multiple servers, where each server stores autonomously a fraction of the data. Towards this goal, we exploit the inherent relationship of top-k and skyline objects, and we employ the skyline objects of servers as a data summarization mechanism for efficiently identifying the servers that store top-k results. Relying on a thresholding scheme, DiTo retrieves the top-k result set progressively, while the number of queried servers and transferred data is minimized. Furthermore, we extend DiTo to support data summarizations of bounded size, thus restricting the cost of summary distribution and maintenance. To this end, we study the challenging problem of finding an abstraction of the skyline set of fixed size that influences the performance of DiTo only slightly. Our experimental evaluation shows that DiTo performs efficiently and provides a viable solution when a high degree of distribution is required.
RIPPLE: A scalable framework for . . .
, 2014
"... We introduce a generic framework, termed RIPPLE, for processing rank queries in decentralized systems. Rank queries are particularly challenging, since the search area (i.e., which tuples qualify) can-not be determined by any peer individually. While our proposed framework is generic enough to apply ..."
Abstract
- Add to MetaCart
We introduce a generic framework, termed RIPPLE, for processing rank queries in decentralized systems. Rank queries are particularly challenging, since the search area (i.e., which tuples qualify) can-not be determined by any peer individually. While our proposed framework is generic enough to apply to all decentralized structured systems, we show that when coupled with a particular distributed hash table (DHT) topology, it offers guaranteed worst-case performance. Specifically, rank query processing in our framework exhibits tunable polylogarithmic latency, in terms of the network size. Additionally we provide a means to trade-off latency for communication and processing cost. As a proof of concept, we apply RIPPLE for top-k query processing. Then, we consider skyline queries, and demonstrate that our framework results in a method that has better latency and lower overall communication cost than existing approaches over DHTs. Finally, we provide a RIPPLE-based approach for constructing a k-diversified set, which, to the best of our knowledge, is the first distributed solution for this problem. Extensive experiments with real and synthetic datasets validate the effectiveness of our framework.
Efficient Early Top-k Query Processing in Overloaded P2P Systems
, 1970
"... Abstract. Top-k query processing in P2P systems has focused on effi-ciently computing the top-k results while reducing network traffic and query response time. However, in overloaded P2P systems (with very high query loads), some peers may take a long time to answer, thus making the user wait a long ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Top-k query processing in P2P systems has focused on effi-ciently computing the top-k results while reducing network traffic and query response time. However, in overloaded P2P systems (with very high query loads), some peers may take a long time to answer, thus making the user wait a long time to obtain the final top-k result. In this paper, we address this problem, which we reformulate as early top-k query processing in P2P systems. First, to complement response time, we introduce two new metrics, stabilization time and cumulative quality gap, with which we formally define the problem. Then, we propose an efficient algorithm that dynamically adapts to query loads of peers in order to return to the user top-k results as soon as possible, without waiting for the final result. We validated our solution through simula-tions over a real dataset. The results show that our solution significantly outperforms baseline solutions by returning high quality top-k results to users in much better times. 1
Processing Top-N Relational Queries by Learning
"... A top-N selection query against a relation is to find the N tuples that satisfy the query condition the best but not necessarily completely. In this paper, we propose a new method for evaluating top-N queries against a relation. This method employs a learning-based strategy. Initially, this method ..."
Abstract
- Add to MetaCart
A top-N selection query against a relation is to find the N tuples that satisfy the query condition the best but not necessarily completely. In this paper, we propose a new method for evaluating top-N queries against a relation. This method employs a learning-based strategy. Initially, this method finds and saves the optimal search spaces for a small number of random top-N queries. The learned knowledge is then used to evaluate new queries. Extensive experiments are carried out to measure the performance of this strategy and the results indicate that it is highly competitive with existing techniques for both low-dimensional and high-dimensional data. Furthermore, the knowledge base can be updated based on new user queries to reflect new query patterns so that frequently submitted queries can be processed most efficiently. The maintenance and stability of the knowledge base are also addressed in the paper.