Results 1  10
of
25
Probabilistic skylines on uncertain data
 In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract

Cited by 103 (19 self)
 Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a pskyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottomup algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The topdown algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
Computing all skyline probabilities for uncertain data
 In PODS
, 2009
"... Skyline computation is widely used in multicriteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires “thresholding ” for its efficiency – the efficiency ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
Skyline computation is widely used in multicriteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires “thresholding ” for its efficiency – the efficiency relies on the assumption that points with skyline probabilities below a certain threshold can be ignored. But there are situations where “thresholding”is not desirable – low probability events cannot be ignored when their consequences are significant. In such cases it is necessary to compute skyline probabilities of all data items. We provide the first algorithm for this problem whose worstcase time complexity is subquadratic. The techniques we use are interesting in their own right, as they rely on a space partitioning technique combined with using the existing dominance counting algorithm. The effectiveness of our algorithm is experimentally verified. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query processing;
Stochastic Skyline Operator
"... Abstract — In many applications involving the multiple criteria optimal decision making, users may often want to make a personal tradeoff among all optimal solutions. As a key feature, the skyline in a multidimensional space provides the minimum set of candidates for such purposes by removing all ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Abstract — In many applications involving the multiple criteria optimal decision making, users may often want to make a personal tradeoff among all optimal solutions. As a key feature, the skyline in a multidimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NPcomplete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multidimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs. I.
Randomized Multipass Streaming Skyline Algorithms
 VLDB'09
, 2009
"... We consider external algorithms for skyline computation without preprocessing. Our goal is to develop an algorithm with a good worst case guarantee while performing well on average. Due to the nature of disks, it is desirable that such algorithms access the input as a stream (even if in multiple pa ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We consider external algorithms for skyline computation without preprocessing. Our goal is to develop an algorithm with a good worst case guarantee while performing well on average. Due to the nature of disks, it is desirable that such algorithms access the input as a stream (even if in multiple passes). Using the tools of randomness, proved to be useful in many applications, we present an efficient multipass streaming algorithm, RAND, for skyline computation. As far as we are aware, RAND is the first randomized skyline algorithm in the literature. RAND is nearoptimal for the streaming model, which we prove via a simple lower bound. Additionally, our algorithm is distributable and can handle partially ordered domains on each attribute. Finally, we demonstrate the robustness of RAND via extensive experiments on both real and synthetic datasets. RAND is comparable to the existing algorithms in average case and additionally tolerant to simple modifications of the data, while other algorithms degrade considerably with such variation.
A Unified Approach for Computing Topk Pairs in Multidimensional Space
"... Abstract—Topk pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the topk pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
(Show Context)
Abstract—Topk pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the topk pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports topk pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of topk pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any prebuilt indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach. I.
(Approximate) uncertain skylines
 IN ICDT
, 2011
"... Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, nearlinear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can efficiently determine the probability of a query point lying on the skyline.
Efficiently Monitoring Topk Pairs over Sliding Windows
"... Abstract—Topk pairs queries have received significant attention by the research community. kclosest pairs queries, kfurthest pairs queries and their variants are among the most well studied special cases of the topk pairs queries. In this paper, we present the first approach to answer a broad cl ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Topk pairs queries have received significant attention by the research community. kclosest pairs queries, kfurthest pairs queries and their variants are among the most well studied special cases of the topk pairs queries. In this paper, we present the first approach to answer a broad class of topk pairs queries over sliding windows. Our framework handles multiple topk pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the topk pairs query by maintaining a small subset of pairs called Kskyband which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only oneKskyband. We present efficient techniques for the Kskyband maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. I.
Threshold Query Optimization for Uncertain Data
"... The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearestneighbor queries, range queries, ranking queries, etc. In this p ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearestneighbor queries, range queries, ranking queries, etc. In this paper, we investigate the general PTQ for arbitrary SQL queries that involve selections, projections and joins. The uncertain database model that we use is one that combines both attribute and tuple uncertainty as well as correlations between arbitrary attribute sets. We address the PTQ optimization problem that aims at improving the efficiency of PTQ query execution by enabling alternative query plan enumeration for optimization. We propose general optimization rules as well as rules specifically for selections, projections and joins. We introduce a threshold operator (τoperator) to the query plan and show it is generally desirable to push down the τoperator as much as possible. Our PTQ optimizations are evaluated in a real uncertain database management system. Our experiments on both real and synthetic data sets show that the optimizations improve the PTQ query processing time.
Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce
"... The skyline operator and its variants such as dynamic skyline and reverse skyline operators have attracted considerable attention recently due to their broad applications. However, computations of such operators are challenging today since there is an increasing trend of applications to deal with bi ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The skyline operator and its variants such as dynamic skyline and reverse skyline operators have attracted considerable attention recently due to their broad applications. However, computations of such operators are challenging today since there is an increasing trend of applications to deal with big data. For such dataintensive applications, the MapReduce framework has been widely used recently. In this paper, we propose efficient parallel algorithms for processing the skyline and its variants using MapReduce. We first build histograms to effectively prune out nonskyline (nonreverse skyline) points in advance. We next partition data based on the regions divided by the histograms and compute candidate (reverse) skyline points for each region independently using MapReduce. Finally, we check whether each candidate point is actually a (reverse) skyline point in every region independently. Our performance study confirms the effectiveness and scalability of the proposed algorithms. 1.
A Generic Framework for Topk Pairs and Topk Objects Queries over Sliding Windows
 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
"... Topk pairs and topk objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of topk pairs and topk objects queries over sliding windows. Our framework handles multiple topk queries and each query is allo ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Topk pairs and topk objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of topk pairs and topk objects queries over sliding windows. Our framework handles multiple topk queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Furthermore, the framework allows the users to define arbitrarily complex scoring functions and supports outoforder data streams. For all the queries that use the same scoring function, we need to maintain only one Kskyband. We present efficient techniques for the Kskyband maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. For topk pairs queries, we demonstrate the efficiency of our approach by comparing it with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. For topk objects queries, our experimental results demonstrate the superiority of our algorithm over the stateoftheart algorithm.