Results 1  10
of
70
Probabilistic skylines on uncertain data
 In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract

Cited by 103 (19 self)
 Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a pskyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottomup algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The topdown algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
Selecting Stars: The k Most Representative Skyline Operator
 In Proc. of the Int. IEEE Conf. on Data Engineering (ICDE
, 2007
"... Skyline computation has many applications including multicriteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic progr ..."
Abstract

Cited by 93 (3 self)
 Add to MetaCart
(Show Context)
Skyline computation has many applications including multicriteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We first present an efficient dynamic programming based exact algorithm in a 2dspace. Then, we show that the problem is NPhard when the dimensionality is 3 or more and it can be approximately solved by a polynomial time algorithm with the guaranteed approximation ratio 1 − 1 e. To speedup the computation, an efficient, scalable, indexbased randomized algorithm is developed by applying the FM probabilistic counting technique. A comprehensive performance evaluation demonstrates that our randomized technique is very efficient, highly accurate, and scalable. 1.
On High Dimensional Skylines
 EDBT 2006
, 2006
"... In many decisionmaking applications, the skyline query is frequently used to find a set of dominating data points (called skyline points) in a multidimensional dataset. In a highdimensional space skyline points no longer offer any interesting insights as there are too many of them. In this paper ..."
Abstract

Cited by 52 (6 self)
 Add to MetaCart
(Show Context)
In many decisionmaking applications, the skyline query is frequently used to find a set of dominating data points (called skyline points) in a multidimensional dataset. In a highdimensional space skyline points no longer offer any interesting insights as there are too many of them. In this paper, we introduce a novel metric, called skyline frequency that compares and ranks the interestingness of data points based on how often they are returned in the skyline when different number of dimensions (i.e., subspaces) are considered. Intuitively, a point with a high skyline frequency is more interesting as it can be dominated on fewer combinations of the dimensions. Thus, the problem becomes one of finding topk frequent skyline points. But the algorithms thus far proposed for skyline computation typically do not scale well with dimensionality. Moreover, frequent skyline computation requires that skylines be computed for each of an exponential number of subsets of the dimensions. We present efficient approximate algorithms to address these twin difficulties. Our extensive performance study shows that our approximate algorithm can run fast and compute the correct result on large data sets in highdimensional spaces.
Parallelizing skyline queries for scalable distribution
 In EDBT’06
, 2006
"... Abstract. Skyline queries help users make intelligent decisions over complex data, where different and often conflicting criteria are considered. Current skyline computation methods are restricted to centralized query processors, limiting scalability and imposing a single point of failure. In this p ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Skyline queries help users make intelligent decisions over complex data, where different and often conflicting criteria are considered. Current skyline computation methods are restricted to centralized query processors, limiting scalability and imposing a single point of failure. In this paper, we address the problem of parallelizing skyline query execution over a large number of machines by leveraging contentbased data partitioning. We present a novel distributed skyline query processing algorithm (DSL) that discovers skyline points progressively. We propose two mechanisms, recursive region partitioning and dynamic region encoding, to enforce a partial order on query propagation in order to pipeline query execution. Our analysis shows that DSL is optimal in terms of the total number of local query invocations across all machines. In addition, simulations and measurements of a deployed system show that our system load balances communication and processing costs across cluster machines, providing incremental scalability and significant performance improvement over alternative distribution mechanisms. 1
Robust cardinality and cost estimation for skyline operator
 In ICDE
, 2006
"... Incorporating the skyline operator inside the relational engine requires solving the cardinality estimation and the cost estimation problem, hitherto unaddressed. We propose robust techniques to estimate the cardinality and the computational cost of Skyline, and through an empirical comparison, show ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
(Show Context)
Incorporating the skyline operator inside the relational engine requires solving the cardinality estimation and the cost estimation problem, hitherto unaddressed. We propose robust techniques to estimate the cardinality and the computational cost of Skyline, and through an empirical comparison, show that our technique is substantially more effective than traditional approaches. Finally, we show through an implementation in Microsoft SQL Server that skyline queries can substantially benefit from our techniques. 1
SUBSKY: Efficient computation of skylines in subspaces
 In ICDE
, 2006
"... Given a set of multidimensional points, the skyline contains the best points according to any preference function that is monotone on all axes. In practice, applications that require skyline analysis usually provide numerous candidate attributes, and various users depending on their interests may i ..."
Abstract

Cited by 49 (7 self)
 Add to MetaCart
(Show Context)
Given a set of multidimensional points, the skyline contains the best points according to any preference function that is monotone on all axes. In practice, applications that require skyline analysis usually provide numerous candidate attributes, and various users depending on their interests may issue queries regarding different (small) subsets of the dimensions. Formally, given a relation with a large number (e.g.,> 10) of attributes, a query aims at finding the skyline in an arbitrary subspace with a low dimensionality (e.g., 2). The existing algorithms do not support subspace skyline retrieval efficiently because they (i) require scanning the entire database at least once, or (ii) are optimized for one particular subspace but incur significant overhead for other subspaces. In this paper, we propose a technique SUBSKY which settles the problem using a single Btree, and can be implemented in any relational database. The core of SUBSKY is a transformation that converts multidimensional data to 1D values, and enables several effective pruning heuristics. Extensive experiments with real data confirm that SUBSKY outperforms alternative approaches significantly in both efficiency and scalability. 1
Efficient Skyline Computation over LowCardinality Domains
, 2007
"... Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution (i.e. whether the dataset ..."
Abstract

Cited by 49 (1 self)
 Add to MetaCart
(Show Context)
Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution (i.e. whether the dataset attributes are correlated, independent, or anticorrelated). In this paper, we propose the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from lowcardinality domains. LS continues to apply even if one attribute has high cardinality. Many skyline applications naturally have such data characteristics, and previous skyline methods have not exploited this property. We show that for typical dimensionalities, the complexity of LS is linear in the number of input tuples. Furthermore, we show that the performance of LS is independent of the input data distribution. Finally, we demonstrate through extensive experimentation on both real and synthetic datasets that LS can result in a significant performance advantage over existing techniques.
Refreshing the sky: the compressed skycube with efficient support for frequent updates
 In SIGMOD
, 2006
"... The skyline query is important in many applications such as multicriteria decision making, data mining, and userpreference queries. Given a set of ddimensional objects, the skyline query finds the objects that are not dominated by others. In practice, different users may be interested in different ..."
Abstract

Cited by 45 (0 self)
 Add to MetaCart
(Show Context)
The skyline query is important in many applications such as multicriteria decision making, data mining, and userpreference queries. Given a set of ddimensional objects, the skyline query finds the objects that are not dominated by others. In practice, different users may be interested in different dimensions of the data, and issue queries on any subset of d dimensions. This paper focuses on supporting concurrent and unpredictable subspace skyline queries in frequent updated databases. Simply to compute and store the skyline objects of every subspace in a skycube will incur expensive update cost. In this paper, we investigate the important issue of updating the skycube in a dynamic environment. To balance the query cost and update cost, we propose a new structure, the compressed skycube, which concisely represents the complete skycube. We thoroughly explore the properties of the compressed skycube and provide an efficient objectaware update scheme. Experimental results show that the compressed skycube is both query and update efficient. 1.
Distancebased Representative Skyline
"... Abstract — Given an integer k, arepresentative skyline contains the k skyline points that best describe the tradeoffs among different dimensions offered by the full skyline. Although this topic has been previously studied, the existing solution may sometimes produce k points that appear in an arbitr ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Given an integer k, arepresentative skyline contains the k skyline points that best describe the tradeoffs among different dimensions offered by the full skyline. Although this topic has been previously studied, the existing solution may sometimes produce k points that appear in an arbitrarily tiny cluster, and therefore, fail to be representative. Motivated by this, we propose a new definition of representative skyline that minimizes the distance between a nonrepresentative skyline point and its nearest representative. We also study algorithms for computing distancebased representative skylines. In 2D space, there is a dynamic programming algorithm that guarantees the optimal solution. For dimensionality at least 3, we prove that the problem is NPhard, and give a 2approximate polynomial time algorithm. Using a multidimensional access method, our algorithm can directly report the representative skyline, without retrieving the full skyline. We show that our representative skyline not only better captures the contour of the entire skyline than the previous method, but also can be computed much faster. I.
Efficient Processing of Topk Dominating Queries on MultiDimensional Data
, 2007
"... The topk dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of topk and sky ..."
Abstract

Cited by 40 (2 self)
 Add to MetaCart
The topk dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of topk and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, topk dominating queries have not received adequate attention from the research community. In this paper, we design specialized algorithms that apply on indexed multidimensional data and fully exploit the characteristics of the problem. Experiments on synthetic datasets demonstrate that our algorithms significantly outperform a previous skylinebased approach, while our results on real datasets show the meaningfulness of topk dominating queries.