Results 1  10
of
10
Probabilistic skylines on uncertain data
 In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract

Cited by 103 (19 self)
 Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a pskyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottomup algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The topdown algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
Randomized Multipass Streaming Skyline Algorithms
 VLDB'09
, 2009
"... We consider external algorithms for skyline computation without preprocessing. Our goal is to develop an algorithm with a good worst case guarantee while performing well on average. Due to the nature of disks, it is desirable that such algorithms access the input as a stream (even if in multiple pa ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We consider external algorithms for skyline computation without preprocessing. Our goal is to develop an algorithm with a good worst case guarantee while performing well on average. Due to the nature of disks, it is desirable that such algorithms access the input as a stream (even if in multiple passes). Using the tools of randomness, proved to be useful in many applications, we present an efficient multipass streaming algorithm, RAND, for skyline computation. As far as we are aware, RAND is the first randomized skyline algorithm in the literature. RAND is nearoptimal for the streaming model, which we prove via a simple lower bound. Additionally, our algorithm is distributable and can handle partially ordered domains on each attribute. Finally, we demonstrate the robustness of RAND via extensive experiments on both real and synthetic datasets. RAND is comparable to the existing algorithms in average case and additionally tolerant to simple modifications of the data, while other algorithms degrade considerably with such variation.
Representative Skylines using Thresholdbased Preference Distributions
"... Abstract — The study of computing skylines and their variants has received considerable attention in recent years. Skylines are essentially sets of most interesting (undominated) tuples in a database. However, since the number of tuples in a skyline is often too large to be useful to potential users ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract — The study of computing skylines and their variants has received considerable attention in recent years. Skylines are essentially sets of most interesting (undominated) tuples in a database. However, since the number of tuples in a skyline is often too large to be useful to potential users, much research effort has been devoted to identifying a smaller subset of (say k) “representative skyline ” points. Several different definitions/formulations of representative skylines have been considered in the literature. Most of these formulations (i.e., objective functions) are intuitive in the sense they try to achieve some kind of clustering “spread” over the entire skyline, with k representative points. In this work, we have taken a more principled approach in defining the representative skyline objective. One of our major contributions is to formulate and solve the problem of displaying k representative skyline points such that the probability that a random user would
Prominent Streak Discovery in Sequence Data
"... This paper studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values. For finding prominent streaks, we make the observation that prominent streaks are skyline points in ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
This paper studies the problem of prominent streak discovery in sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values. For finding prominent streaks, we make the observation that prominent streaks are skyline points in two dimensions – streak interval length and minimum value in the interval. Our solution thus hinges upon the idea to separate the two steps in prominent streak discovery – candidate streak generation and skyline operation over candidate streaks. For candidate generation, we propose the concept of local prominent streak (LPS). We prove that prominent streaks are a subset of LPSs and the number of LPSs is less than the length of a data sequence, in comparison with the quadratic number of candidates produced by a bruteforce baseline method. We develop efficient algorithms based on the concept of LPS. The nonlinear LPSbased method (NLPS) considers a superset of LPSs as candidates, and the linear LPSbased method (LLPS) further guarantees to consider only LPSs. The results of experiments using multiple real datasets verified the effectiveness of the proposed methods and showed orders of magnitude performance improvement against the baseline method.
Stable matching
"... Recommendations for twoway selections using skyline view queries ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Recommendations for twoway selections using skyline view queries
Ranking Large Temporal Data
"... Ranking temporal data has not been studied until recently, even though ranking is an important operator (being promoted as a firstclass citizen) in database systems. However, only the instant topk queries on temporal data were studied in, where objects with the k highest scores at a query time inst ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Ranking temporal data has not been studied until recently, even though ranking is an important operator (being promoted as a firstclass citizen) in database systems. However, only the instant topk queries on temporal data were studied in, where objects with the k highest scores at a query time instance t are to be retrieved. The instant topk definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate topk query on temporal data. For example, return the top10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top20 stocks having the largest total transaction volumes from 02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods. 1.
Identifying Interesting Instances for Probabilistic Skylines
, 2009
"... Uncertain data arises from various applications such as sensor networks, scientific data management, data integration, and location based applications. While significant research efforts have been dedicated to modeling, managing and querying uncertain data, advanced analysis of uncertain data is s ..."
Abstract
 Add to MetaCart
Uncertain data arises from various applications such as sensor networks, scientific data management, data integration, and location based applications. While significant research efforts have been dedicated to modeling, managing and querying uncertain data, advanced analysis of uncertain data is still in its early stages. In this paper, we focus on skyline analysis of uncertain data, modeled as uncertain objects with probability distributions over a set of possible values called instances. Computing the exact skyline probabilities of instances is expensive, and unnecessary when the user is only interested in instances with skyline probabilities over a certain threshold. We propose two filtering schemes for this case: a preliminary scheme that bounds an instance’s skyline probability for filtering, and an elaborate scheme that
Skyline · Skyline distance · Skyline boundary
, 2011
"... Abstract Skyline has been widely recognized as being useful for multicriteria decisionmaking applications. While most of the existing work computes skylines in various contexts, in this paper, we consider a novel problem: how far away a point is from the skyline? We propose a novel notion of skylin ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Skyline has been widely recognized as being useful for multicriteria decisionmaking applications. While most of the existing work computes skylines in various contexts, in this paper, we consider a novel problem: how far away a point is from the skyline? We propose a novel notion of skyline distance that measures the minimum cost of upgrading a point to the skyline given a cost function. Skyline distance can be regarded as a measure of multidimensional competence and can be used to rank possible choices in recommendation systems. Computing skyline distances efficiently is far from trivial and cannot be handled by any straightforward extension of the existing skyline computation methods. To tackle this problem, we systematically explore several directions. We first present a dynamic programming method. Then, we investigate the boundary of skylines and develop a sortprojection method that utilizes the skyline boundary in calculating skyline distances. Last, we develop a space partitioning method to further improve the performance. We report extensive experiment results which show that our methods are efficient and scalable.
Noname manuscript No. (will be inserted by the editor) Topk Queries on Temporal Data
"... Abstract The database community has devoted extensive amount of efforts to indexing and querying temporal data in the past decades. However, insufficient amount of attention has been paid to temporal ranking queries. More precisely, given any time instance t, the query asks for the topk objects at ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract The database community has devoted extensive amount of efforts to indexing and querying temporal data in the past decades. However, insufficient amount of attention has been paid to temporal ranking queries. More precisely, given any time instance t, the query asks for the topk objects at time t with respect to some score attribute. Some generic indexing structures based on Rtrees do support ranking queries on temporal data, but as they are not tailored for such queries, the performance is far from satisfactory. We present the Sebtree, a simple indexing scheme that supports temporal ranking queries much more efficiently. The Sebtree answers a topk query for any time instance t in the optimal number of I/Os in expectation, namely, N k O(logB B B) I/Os, where N is the size of the data set and B is the disk block size. The index has nearlinear size (for constant and reasonable kmax values, where kmax is the maximum value for the possible values of the query parameter k), can be constructed in nearlinear time, and also supports insertions and deletions without affecting its query performance guarantee. Most of all, the Sebtree is especially appealing in practice due to its simplicity as it uses the Btree as the only building block. Extensive experiments on a number of large data sets, show that the Sebtree is more than an order of magnitude faster than the Rtree based indexes for temporal ranking queries.
Identifying Interesting Instances for Probabilistic Skylines
"... Significant research efforts have recently been dedicated to modeling and querying uncertain data. In this paper, we focus on skyline analysis of uncertain data, modeled as uncertain objects with probability distributions over a set of possible values called instances. Computing the exact skyline pr ..."
Abstract
 Add to MetaCart
(Show Context)
Significant research efforts have recently been dedicated to modeling and querying uncertain data. In this paper, we focus on skyline analysis of uncertain data, modeled as uncertain objects with probability distributions over a set of possible values called instances. Computing the exact skyline probabilities of instances is expensive, and unnecessary when the user is only interested in instances with skyline probabilities over a certain threshold. We propose two filtering schemes for this case: a preliminary scheme that bounds an instance’s skyline probability for filtering, and an elaborate scheme that uses an instance’s bounds to filter other instances based on the dominance relationship. We experimentally demonstrate the effectiveness of our filtering schemes on both real and synthetic data sets and show the efficiency of our schemes compared with other algorithms.