Results 1 - 10
of
11
Probabilistic skylines on uncertain data
- In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract
-
Cited by 39 (10 self)
- Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a p-skyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottom-up algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The top-down algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
Computing all skyline probabilities for uncertain data
- In PODS
, 2009
"... Skyline computation is widely used in multi-criteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires “thresholding ” for its efficiency – the efficiency ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Skyline computation is widely used in multi-criteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires “thresholding ” for its efficiency – the efficiency relies on the assumption that points with skyline probabilities below a certain threshold can be ignored. But there are situations where “thresholding”is not desirable – low probability events cannot be ignored when their consequences are significant. In such cases it is necessary to compute skyline probabilities of all data items. We provide the first algorithm for this problem whose worst-case time complexity is sub-quadratic. The techniques we use are interesting in their own right, as they rely on a space partitioning technique combined with using the existing dominance counting algorithm. The effectiveness of our algorithm is experimentally verified. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query processing;
A Unified Approach for Computing Top-k Pairs in Multidimensional Space
"... Abstract—Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are some of the examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic scoring functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows the users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of each pair by combining its scores on different attributes. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not require any pre-built indexes, is easy to implement and has low memory requirement. We conduct extensive experiments to demonstrate the efficiency of our proposed approach. I.
Stochastic Skyline Operator
"... Abstract — In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — In many applications involving the multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions. As a key feature, the skyline in a multi-dimensional space provides the minimum set of candidates for such purposes by removing all points not preferred by any (monotonic) utility/scoring functions; that is, the skyline removes all objects not preferred by any user no mater how their preferences vary. Driven by many applications with uncertain data, the probabilistic skyline model is proposed to retrieve uncertain objects based on skyline probabilities. Nevertheless, skyline probabilities cannot capture the preferences of monotonic utility functions. Motivated by this, in this paper we propose a novel skyline operator, namely stochastic skyline. In the light of the expected utility principle, stochastic skyline guarantees to provide the minimum set of candidates for the optimal solutions over all possible monotonic multiplicative utility functions. In contrast to the conventional skyline or the probabilistic skyline computation, we show that the problem of stochastic skyline is NP-complete with respect to the dimensionality. Novel and efficient algorithms are developed to efficiently compute stochastic skyline over multidimensional uncertain data, which run in polynomial time if the dimensionality is fixed. We also show, by theoretical analysis and experiments, that the size of stochastic skyline is quite similar to that of conventional skyline over certain data. Comprehensive experiments demonstrate that our techniques are efficient and scalable regarding both CPU and IO costs. I.
Threshold Query Optimization for Uncertain Data
"... The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearest-neighbor queries, range queries, ranking queries, etc. In this p ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearest-neighbor queries, range queries, ranking queries, etc. In this paper, we investigate the general PTQ for arbitrary SQL queries that involve selections, projections and joins. The uncertain database model that we use is one that combines both attribute and tuple uncertainty as well as correlations between arbitrary attribute sets. We address the PTQ optimization problem that aims at improving the efficiency of PTQ query execution by enabling alternative query plan enumeration for optimization. We propose general optimization rules as well as rules specifically for selections, projections and joins. We introduce a threshold operator (τ-operator) to the query plan and show it is generally desirable to push down the τ-operator as much as possible. Our PTQ optimizations are evaluated in a real uncertain database management system. Our experiments on both real and synthetic data sets show that the optimizations improve the PTQ query processing time.
Probabilities and Sets in Preference Querying
, 2010
"... I am deeply indebted to my adviser Dr. Jan Chomicki for his continuous encouragement, guidance and support in every stage of my PhD study. His enthusiasm, knowledge and diligence have been a constant source of inspiration for me. Through him, I see what a great researcher should be. His vision and e ..."
Abstract
- Add to MetaCart
I am deeply indebted to my adviser Dr. Jan Chomicki for his continuous encouragement, guidance and support in every stage of my PhD study. His enthusiasm, knowledge and diligence have been a constant source of inspiration for me. Through him, I see what a great researcher should be. His vision and experience guided me consistently through the dissertation writing and made this experience enjoyable. I am very grateful to the other brilliant faculty members who served on my committee: Dr. Hung Q. Ngo and Dr. Michalis Petropoulos. Dr. Hung Q. Ngo, whose lectures were by far my favorites, enlightened me on numerous math puzzles, one of which later turned out to be very useful in this dissertation. Dr. Michalis Petropoulos provided warm encouragement and valuable comments throughout the course of writing this dissertation. I would also like to thank the distinguished database researchers at AT&T Labs, Dr. Graham Cormode, Dr. Lukasz Golab, Dr. Flip Korn and Dr. Divesh Srivastava, with whom I had the honor to work with. Their passion, curiosity and spirits are contagious. They generously shared their vision and provided insightful comments on early drafts of my work and other research problems. This dissertation is dedicated to my beloved husband Chao Chen, my most loving parents Ling Wei and Zhemin Zhang, and my most loyal friends Lan, Jing, Michelle, Sixia and Slawek. It simply would not have been possible without their unconditional love and support.
k-Selection Query over Uncertain Data
"... Abstract. This paper studies a new query on uncertain data, called k-selection query. Given an uncertain dataset of N objects, where each object is associated with a preference score and a presence probability, a k-selection query returns k objects such that the expected score of the “best available ..."
Abstract
- Add to MetaCart
Abstract. This paper studies a new query on uncertain data, called k-selection query. Given an uncertain dataset of N objects, where each object is associated with a preference score and a presence probability, a k-selection query returns k objects such that the expected score of the “best available ” objects is maximized. This query is useful in many application domains such as entity web search and decision making. In evaluating k-selection queries, we need to compute the expected best score (EBS) for candidate k-selection sets and search for the optimal selection set with the highest EBS. Those operations are costly due to the extremely large search space. In this paper, we identify several important properties of k-selection queries, including EBS decomposition, query recursion, and EBS bounding. Based upon these properties, we first present a dynamic programming (DP) algorithm that answers the query in O(k · N) time. Further, we propose a Bounding-and-Pruning (BP) algorithm, that exploits effective search space pruning strategies to find the optimal selection without accessing all objects. We evaluate the DP and BP algorithms using both synthetic and real data. The results show that the proposed algorithms outperform the baseline approach by several orders of magnitude. 1
Identifying Interesting Instances for Probabilistic Skylines ⋆
"... Abstract. Significant research efforts have recently been dedicated to modeling and querying uncertain data. In this paper, we focus on skyline analysis of uncertain data, modeled as uncertain objects with probability distributions over a set of possible values called instances. Computing the exact ..."
Abstract
- Add to MetaCart
Abstract. Significant research efforts have recently been dedicated to modeling and querying uncertain data. In this paper, we focus on skyline analysis of uncertain data, modeled as uncertain objects with probability distributions over a set of possible values called instances. Computing the exact skyline probabilities of instances is expensive, and unnecessary when the user is only interested in instances with skyline probabilities over a certain threshold. We propose two filtering schemes for this case: a preliminary scheme that bounds an instance’s skyline probability for filtering, and an elaborate scheme that uses an instance’s bounds to filter other instances based on the dominance relationship. We experimentally demonstrate the effectiveness of our filtering schemes on both real and synthetic data sets and show the efficiency of our schemes compared with other algorithms. 1
THE UNIVERSITY OF
, 2010
"... Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are few examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not ex ..."
Abstract
- Add to MetaCart
Top-k pairs queries have many real applications. k closest pairs queries, k furthest pairs queries and their bichromatic variants are few examples of the top-k pairs queries that rank the pairs on distance functions. While these queries have received significant research attention, there does not exist a unified approach that can efficiently answer all these queries. Moreover, there is no existing work that supports top-k pairs queries based on generic ranking functions. In this paper, we present a unified approach that supports a broad class of top-k pairs queries including the queries mentioned above. Our proposed approach allows users to define a local scoring function for each attribute involved in the query and a global scoring function that computes the final score of a pair by combining its scores on different attributes. The proposed framework also supports the skyline pairs queries; that is, return the pairs that are not dominated by any other pair. We propose efficient internal and external memory algorithms and our theoretical analysis shows that the expected performance of the algorithms is optimal when two or less attributes are involved. Our approach does not

