Results 1  10
of
19
Approximate Confidence Computation in Probabilistic Databases
"... Abstract—This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithmisbasedonanincrementalcompilationofformulasinto decision diagrams using three types of decompositions: Sha ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
Abstract—This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithmisbasedonanincrementalcompilationofformulasinto decision diagrams using three types of decompositions: Shannon expansion, independence partitioning, and product factorization. With each decomposition step, lower and upper bounds on the probability of the partially compiled formula can be quickly computed and checked against the allowed error. This algorithm can be effectively used to compute approximate confidence values of answer tuples to positive relational algebra queries on general probabilistic databases (ctables with discrete probability distributions). We further tune our algorithm so as to capture all known tractable conjunctive queries without selfjoins on tupleindependent probabilistic databases: In this case, the algorithm requires time polynomial in the input size even for exact computation. We implementedthealgorithm as anextension of theSPROUT query engine. An extensive experimental effort shows that it consistently outperforms stateofart approximation techniques by several orders of magnitude. I.
Ranking with Uncertain Scoring Functions: Semantics and Sensitivity Measures
"... Ranking queries report the topK results according to a userdefined scoring function. A widely used scoring function is the weighted summation of multiple scores. Often times, users cannot precisely specify the weights in such functions in order to produce the preferred order of results. Adopting u ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Ranking queries report the topK results according to a userdefined scoring function. A widely used scoring function is the weighted summation of multiple scores. Often times, users cannot precisely specify the weights in such functions in order to produce the preferred order of results. Adopting uncertain/incomplete scoring functions (e.g., using weight ranges and partiallyspecified weight preferences) can better capture user’s preferences in this scenario. In this paper, we study two aspects in uncertain scoring functions. The first aspect is the semantics of ranking queries, and the second aspect is the sensitivity of computed results to refinements made by the user. We formalize and solve multiple problems under both aspects, and present novel techniques that compute query results efficiently to comply with the interactive nature of these problems.
Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data
"... Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which return the uncertain objects having the query object as nearest neighbor with a sufficiently high probability. We propose an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account. We compare our algorithm to stateoftheart approaches recently proposed. Our experimental evaluation shows that our approach is able to significantly outperform previous approaches. In addition, we show how our approach can easily be extended to PRkNN (where k> 1) query processing for which there is currently no efficient solution. 1.
Ranking Continuous Probabilistic Datasets
"... Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of rankin ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of ranking when the tuple scores are uncertain, and the uncertainty is captured using continuous probability distributions (e.g. Gaussian distributions). We present a comprehensive solution to compute the values of a parameterized ranking function (P RF) [18] for arbitrary continuous probability distributions (and thus rank the uncertain dataset); P RF can be used to simulate or approximate many other ranking functions proposed in prior work. We develop exact polynomial time algorithms for some continuous probability distribution classes, and efficient approximation schemes with provable guarantees for arbitrary probability distributions. Our algorithms can also be used for exact or approximate evaluation of knearest neighbor queries over uncertain objects, whose positions are modeled using continuous probability distributions. Our experimental evaluation over several datasets illustrates the effectiveness of our approach at efficiently ranking uncertain datasets with continuous attribute uncertainty. 1.
Querying parse trees of stochastic contextfree grammars
 Proc. 13th International Conference on Database Theory (ICDT), ACM
"... ABSTRACT Stochastic contextfree grammars (SCFGs) have long been recognized as useful for a large variety of tasks including natural language processing, morphological parsing, speech recognition, information extraction, Webpage wrapping and even analysis of RNA. A string and an SCFG jointly repre ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT Stochastic contextfree grammars (SCFGs) have long been recognized as useful for a large variety of tasks including natural language processing, morphological parsing, speech recognition, information extraction, Webpage wrapping and even analysis of RNA. A string and an SCFG jointly represent a probabilistic interpretation of the meaning of the string, in the form of a (possibly infinite) probability space of parse trees. The problem of evaluating a query over this probability space is considered under the conventional semantics of querying a probabilistic database. For general SCFGs, extremely simple queries may have results that include irrational probabilities. But, for a large subclass of SCFGs (that includes all the standard studied subclasses of SCFGs) and the language of treepattern queries with projection (and child/descendant edges), it is shown that query results have rational probabilities with a polynomialsize bit representation and, more importantly, an efficient queryevaluation algorithm is presented.
Attribute and object selection queries on objects with probabilistic attributes
 ACM Transactions on Database Systems (ACM TODS
, 2012
"... Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive value choices for each uncertain attribute along with a measure of probability for alternative values. However, the lay enduser, as well as some endapplications, might not be able to interpret the results if outputted in such a form. Thus, the question is how to present such results to the user in practice, for example, to support attributevalue selection and object selection queries the user might be interested in. Specifically, in this article we study the problem of maximizing the quality of these selection queries on top of such a probabilistic representation. The quality is measured using the standard and commonly used setbased quality metrics. We formalize the problem and then develop efficient approaches that provide highquality answers for these queries. The comprehensive empirical evaluation over three different domains demonstrates the advantage of our approach over existing techniques.
Maximizing expected utility for stochastic combinatorial optimization problems
 IN FOCS
, 2011
"... We study the stochastic versions of a broad class of combinatorial problems where the weights of the elements in the input dataset are uncertain. The class of problems that we study includes shortest paths, minimum weight spanning trees, and minimum weight matchings over probabilistic graphs, and ot ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We study the stochastic versions of a broad class of combinatorial problems where the weights of the elements in the input dataset are uncertain. The class of problems that we study includes shortest paths, minimum weight spanning trees, and minimum weight matchings over probabilistic graphs, and other combinatorial problems like knapsack. We observe that the expected value is inadequate in capturing different types of riskaverse or riskprone behaviors, and instead we consider a more general objective which is to maximize the expected utility of the solution for some given utility function, rather than the expected weight (expected weight becomes a special case). We show that we can obtain a polynomial time approximation algorithm with additive error for any > 0, if there is a pseudopolynomial time algorithm for the exact version of the problem (This is true for the problems mentioned above) and the maximum value of the utility function is bounded by a constant. 1 Our result generalizes several prior results on stochastic shortest path, stochastic spanning tree, and stochastic knapsack. Our algorithm for utility maximization makes use of the separability of exponential utility and a technique to decompose a general utility function into exponential utility functions, which may be useful in other stochastic optimization problems.
Ranking queries on uncertain data
, 2011
"... Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as topk queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain data using three parameters: rank threshold k, ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Uncertain data is inherent in a few important applications. It is far from trivial to extend ranking queries (also known as topk queries), a popular type of queries on certain data, to uncertain data. In this paper, we cast ranking queries on uncertain data using three parameters: rank threshold k, probability threshold p, and answer set size threshold l. Systematically, we identify four types of ranking queries on uncertain data. First, a probability threshold topk query computes the uncertain records taking a probability of at least p to be in the topk list. Second, a top(k, l) query returns the topl uncertain records whose probabilities of being ranked among topk are the largest. Third, the prank of an uncertain record is the smallest number k such that the record takes a probability of at least p to be ranked in the topk list. A rank threshold topk query retrieves the records whose pranks are at most k. Last, a top(p, l) query returns the topl uncertain records with the smallest pranks. To answer such ranking queries, we present an efficient exact algorithm, a fast sampling algorithm, and a Poisson
Design and Implementation of the SPROUT Query Engine for Probabilistic Databases
"... I would like to thank a number of people for their assistance and support. First of all, I cannot overstate my indebtedness to my supervisor, Dan Olteanu. Since the conception of the projects, Dan has been a source of constant encouragement, sound advice, great company, and lots of good ideas. More ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
I would like to thank a number of people for their assistance and support. First of all, I cannot overstate my indebtedness to my supervisor, Dan Olteanu. Since the conception of the projects, Dan has been a source of constant encouragement, sound advice, great company, and lots of good ideas. More generally, he has taught me how to become a database researcher. Without his guidance, SPROUT and MayBMS would not have been possible. Aside from my supervisor, I must also thank Christoph Koch. He overcame several obstacles to provide me with generous financial support. The work on MayBMS was supported by a oneyear scholarship from Cornell University. What is more, he initially proposed the implementation of MayBMS and guided me during its execution. I also would like to express my gratitude to those who helped in numerous other ways, from proofreading to providing muchneeded feedback. In this regard, I would like to thank Sebastian Ordyniak, Margarita Satraki, Hao Wu and Haoxian Zhao.
Anytime approximation in probabilistic databases
, 2013
"... This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algori ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.