Results 1  10
of
21
Shape Fitting on Point Sets with Probability Distributions
"... Abstract. We consider problems on data sets where each data point has uncertainty described by an individual probability distribution. We develop several frameworks and algorithms for calculating statistics on these uncertain data sets. Our examples focus on geometric shape fitting problems. We prov ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
(Show Context)
Abstract. We consider problems on data sets where each data point has uncertainty described by an individual probability distribution. We develop several frameworks and algorithms for calculating statistics on these uncertain data sets. Our examples focus on geometric shape fitting problems. We prove approximation guarantees for the algorithms with respect to the full probability distributions. We then empirically demonstrate that our algorithms are simple and practical, solving for a constant hidden by asymptotic analysis so that a user can reliably trade speed and size for accuracy. 1
Efficient and Effective Similarity Search over Probabilistic Data based on Earth Mover’s Distance
, 2010
"... Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional re ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited effectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have proven their values for better distinguishing ability in the probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover’s Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and knearest neighbor queries on probabilistic data, on the basis of Earth Mover’s Distance. Our solution utilizes the primaldual theory in linear programming and deploys B + tree index structures for effective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases. 1
Upi: A primary index for uncertain databases
 PVLDB
"... Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databa ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Uncertain data management has received growing attention from industry and academia. Many efforts have been made to optimize uncertain databases, including the development of special index data structures. However, none of these efforts have explored primary (clustered) indexes for uncertain databases, despite the fact that clustering has the potential to offer substantial speedups for nonselective analytic queries on large uncertain databases. In this paper, we propose a new index called a UPI (Uncertain Primary Index) that clusters heap files according to uncertain attributes with both discrete and continuous uncertainty distributions. Because uncertain attributes may have several possible values, a UPI on an uncertain attribute duplicates tuple data once for each possible value. To prevent the size of the UPI from becoming unmanageable, its size is kept small by placing lowprobability tuples in a special Cutoff Index that is consulted only when queries for lowprobability values are run. We also propose several other optimizations, including techniques to improve secondary index performance and techniques to reduce maintenance costs and fragmentation by buffering changes to the table and writing updates in sequential batches. Finally, we develop cost models for UPIs to estimate query performance in various settings to help automatically select tuning parameters of a UPI. We have implemented a prototype UPI and experimented on two real datasets. Our results show that UPIs can significantly (up to two orders of magnitude) improve the performance of uncertain queries both over clustered and unclustered attributes. We also show that our buffering techniques mitigate table fragmentation and keep the maintenance cost as low as or even lower than using an unclustered heap file. 1.
Closest Pair and the Post Office Problem for Stochastic Points
"... Abstract. Given a (master) set M of n points in ddimensional Euclidean space, consider drawing a random subset that includes each point mi ∈ M with an independent probability pi. How difficult is it to compute elementary statistics about the closest pair of points in such a subset? For instance, wh ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Given a (master) set M of n points in ddimensional Euclidean space, consider drawing a random subset that includes each point mi ∈ M with an independent probability pi. How difficult is it to compute elementary statistics about the closest pair of points in such a subset? For instance, what is the probability that the distance between the closest pair of points in the random subset is no more than ℓ, for a given value ℓ? Or, can we preprocess the master set M such that given a query point q, we can efficiently estimate the expected distance from q to its nearest neighbor in the random subset? We obtain hardness results and approximation algorithms for stochastic problems of this kind. 1
Threshold Query Optimization for Uncertain Data
"... The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearestneighbor queries, range queries, ranking queries, etc. In this p ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearestneighbor queries, range queries, ranking queries, etc. In this paper, we investigate the general PTQ for arbitrary SQL queries that involve selections, projections and joins. The uncertain database model that we use is one that combines both attribute and tuple uncertainty as well as correlations between arbitrary attribute sets. We address the PTQ optimization problem that aims at improving the efficiency of PTQ query execution by enabling alternative query plan enumeration for optimization. We propose general optimization rules as well as rules specifically for selections, projections and joins. We introduce a threshold operator (τoperator) to the query plan and show it is generally desirable to push down the τoperator as much as possible. Our PTQ optimizations are evaluated in a real uncertain database management system. Our experiments on both real and synthetic data sets show that the optimizations improve the PTQ query processing time.
On the Most Likely Convex Hull of Uncertain Points
"... Abstract. Consider a set of ddimensional points where the existence or the location of each point is determined by a probability distribution. The convex hull of this set is a random variable distributed over exponentially many choices. We are interested in finding the most likely convex hull, nam ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract. Consider a set of ddimensional points where the existence or the location of each point is determined by a probability distribution. The convex hull of this set is a random variable distributed over exponentially many choices. We are interested in finding the most likely convex hull, namely, the one with the maximum probability of occurrence. We investigate this problem under two natural models of uncertainty: the point (also called the tuple) model where each point (site) has a fixed position si but only exists with some probability pii, for 0 < pii ≤ 1, and the multipoint model where each point has multiple possible locations or it may not appear at all. We show that the most likely hull under the point model can be computed in O(n3) time for n points in d = 2 dimensions, but it is NP–hard for d ≥ 3 dimensions. On the other hand, we show that the problem is NP–hard under the multipoint model even for d = 2 dimensions. We also present hardness results for approximating the probability of the most likely hull. While we focus on the most likely hull for concreteness, our results hold for other natural definitions of a probabilistic hull. 1
Geometric Computations on Indecisive and Uncertain Points
"... We study computing geometric problems on uncertain points. An uncertain point is a point that does not have a fixed location, but rather is described by a probability distribution. When these probability distributions are restricted to a finite number of locations, the points are called indecisive p ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We study computing geometric problems on uncertain points. An uncertain point is a point that does not have a fixed location, but rather is described by a probability distribution. When these probability distributions are restricted to a finite number of locations, the points are called indecisive points. In particular, we focus on geometric shapefitting problems and on building compact distributions to describe how the solutions to these problems vary with respect to the uncertainty in the points. Our main results are: (1) a simple and efficient randomized approximation algorithm for calculating the distribution of any statistic on uncertain data sets; (2) a polynomial, deterministic and exact algorithm for computing the distribution of answers for any LPtype problem on an indecisive point set; and (3) the development of shape inclusion probability (SIP) functions which captures the ambient distribution of shapes fit to uncertain or indecisive point sets and are admissible to the two algorithmic constructions. 1
Range Counting Coresets for Uncertain Data
, 2013
"... We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncer ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximationsize tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancybased approaches on axisaligned range queries.
Effectively Indexing the MultiDimensional Uncertain Objects for Range Searching
"... The range searching problem is fundamental in a wide spectrum of applications such as radio frequency identification (RFID) networks, location based services (LBS), and global position system (GPS). As the uncertainty is inherent in those applications, it is highly demanded to address the uncertaint ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The range searching problem is fundamental in a wide spectrum of applications such as radio frequency identification (RFID) networks, location based services (LBS), and global position system (GPS). As the uncertainty is inherent in those applications, it is highly demanded to address the uncertainty in the range search since the traditional techniques cannot be applied due to the inherence difference between the uncertain data and traditional data. In the paper, we propose a novel indexing structure, named UQuadtree, to organize the uncertain objects in a multidimensional space such that the range searching can be answered efficiently by applying filtering techniques. Particularly, based on some insights of the range search on uncertain data, we propose a cost model which carefully considers various factors that may impact the performance of the range searching. Then an effective and efficient index construction algorithm is proposed to build the optimal UQuadtree regarding the cost model. Comprehensive experiments demonstrate that our technique outperforms the existing works for range searching on multidimensional uncertain objects.