Results 11  20
of
63
Scalable Probabilistic Similarity Ranking in Uncertain Databases
"... This paper introduces a scalable approach for probabilistic topk similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutuallyexclusive. The objective is to rank the uncertain data according to their distance to a ref ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
This paper introduces a scalable approach for probabilistic topk similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutuallyexclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several stateoftheart probabilistic ranking models. Existing approaches compute this probability distribution by applying the Poisson binomial recurrence technique of quadratic complexity. In this paper we theoretically as well as experimentally show that our framework reduces this to a lineartime complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic topk ranking for the objects, according to different stateoftheart definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach.
(Approximate) uncertain skylines
 IN ICDT
, 2011
"... Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, nearlinear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can efficiently determine the probability of a query point lying on the skyline.
Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data
"... Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which return the uncertain objects having the query object as nearest neighbor with a sufficiently high probability. We propose an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account. We compare our algorithm to stateoftheart approaches recently proposed. Our experimental evaluation shows that our approach is able to significantly outperform previous approaches. In addition, we show how our approach can easily be extended to PRkNN (where k> 1) query processing for which there is currently no efficient solution. 1.
Ranking Continuous Probabilistic Datasets
"... Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of rankin ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of ranking when the tuple scores are uncertain, and the uncertainty is captured using continuous probability distributions (e.g. Gaussian distributions). We present a comprehensive solution to compute the values of a parameterized ranking function (P RF) [18] for arbitrary continuous probability distributions (and thus rank the uncertain dataset); P RF can be used to simulate or approximate many other ranking functions proposed in prior work. We develop exact polynomial time algorithms for some continuous probability distribution classes, and efficient approximation schemes with provable guarantees for arbitrary probability distributions. Our algorithms can also be used for exact or approximate evaluation of knearest neighbor queries over uncertain objects, whose positions are modeled using continuous probability distributions. Our experimental evaluation over several datasets illustrates the effectiveness of our approach at efficiently ranking uncertain datasets with continuous attribute uncertainty. 1.
Attribute and object selection queries on objects with probabilistic attributes
 ACM Transactions on Database Systems (ACM TODS
, 2012
"... Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive value choices for each uncertain attribute along with a measure of probability for alternative values. However, the lay enduser, as well as some endapplications, might not be able to interpret the results if outputted in such a form. Thus, the question is how to present such results to the user in practice, for example, to support attributevalue selection and object selection queries the user might be interested in. Specifically, in this article we study the problem of maximizing the quality of these selection queries on top of such a probabilistic representation. The quality is measured using the standard and commonly used setbased quality metrics. We formalize the problem and then develop efficient approaches that provide highquality answers for these queries. The comprehensive empirical evaluation over three different domains demonstrates the advantage of our approach over existing techniques.
Ranking Distributed Probabilistic Data
, 2009
"... Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global topk tuples from all distributed sites with minimum communication cost. Using the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication and computationefficient algorithms for retrieving the topk tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.
1 Clustering Large Probabilistic Graphs
"... Abstract—We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic proteinprotein interaction networks and discovering groups of users in affilia ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic proteinprotein interaction networks and discovering groups of users in affiliation networks. We extend the editdistance based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameterfree. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real proteinprotein interaction network and groundtruth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.
A novel probabilistic pruning approach to speed up similarity queries in uncertain databases
"... Abstract — In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncer ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multidimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filterrefinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases. I.
Threshold Query Optimization for Uncertain Data
"... The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearestneighbor queries, range queries, ranking queries, etc. In this p ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearestneighbor queries, range queries, ranking queries, etc. In this paper, we investigate the general PTQ for arbitrary SQL queries that involve selections, projections and joins. The uncertain database model that we use is one that combines both attribute and tuple uncertainty as well as correlations between arbitrary attribute sets. We address the PTQ optimization problem that aims at improving the efficiency of PTQ query execution by enabling alternative query plan enumeration for optimization. We propose general optimization rules as well as rules specifically for selections, projections and joins. We introduce a threshold operator (τoperator) to the query plan and show it is generally desirable to push down the τoperator as much as possible. Our PTQ optimizations are evaluated in a real uncertain database management system. Our experiments on both real and synthetic data sets show that the optimizations improve the PTQ query processing time.
Duplicate Detection in Probabilistic Data
"... Abstract — Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches hav ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same realworld entities. I.