Results 1  10
of
13
ReadOnce Functions and Query Evaluation in Probabilistic Databases
"... Probabilistic databases hold promise of being a viable means for largescale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluat ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Probabilistic databases hold promise of being a viable means for largescale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on querycentric formulations (e.g., safe plans, hierarchical queries), in that, they only consider characteristics of the query and not the data in the database. It is easy to construct examples where a supposedly hard query run on an appropriate database gives rise to a tractable query evaluation problem. In this paper, we develop efficient query evaluation techniques that leverage characteristics of both the query and the data in the database. We focus on tupleindependent databases where the query evaluation problem is equivalent to computing marginal probabilities of Boolean formulas associated with the result tuples. Query evaluation is easy if the Boolean formulas can be factorized into a form that has every variable appearing at most once (called readonce); this suggests a naive approach that incorporates previously developed Boolean formula factorization algorithms into the query evaluation. We then develop novel, more efficient factorization algorithms that work for a large subclass of queries (specifically, conjunctive queries without selfjoins), by exploiting the unique structure of the result tuple Boolean formulas. We empirically demonstrate that our proposed techniques are (1) orders of magnitude faster than generic inference algorithms when used to evaluate general readonce functions, and (2) for the special case of hierarchical queries, they rival the efficiency of prior techniques specifically designed to handle such queries. 1.
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
, 2010
"... Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is e ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two wellknown scoring functions on graphs, namely graph reliability (which is #Phard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without selfjoins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Faster Query Answering in Probabilistic Databases using ReadOnce Functions
"... A boolean expression is in readonce form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in readonce form can be computed in polynomial time (whereas the general problem ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
A boolean expression is in readonce form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in readonce form can be computed in polynomial time (whereas the general problem for arbitrary expressions is #Pcomplete). Known approaches to checking readonce property seem to require putting these expressions in disjunctive normal form. In this paper, we tell a better story for a large subclass of boolean event expressions: those that are generated by conjunctive queries without selfjoins and on tupleindependent probabilistic databases. We first show that given a tupleindependent representation and the provenance graph of an SPJ query plan without selfjoins, we can, without using the DNF of a result event expression, efficiently compute its cooccurrence graph. From this, the readonce form can already, if it exists, be computed efficiently using existing techniques. Our second and key contribution is a complete, efficient, and simple to implement algorithm for computing the readonce forms (whenever they exist) directly, using a new concept, that of cotable graph, which can be significantly smaller than the cooccurrence graph.
On the Tractability of Query Compilation and Bounded Treewidth
, 2012
"... We consider the problem of computing the probability of a Boolean function, which generalizes the model counting problem. Given an OBDD for such a function, its probability can be computed in linear time in the size of the OBDD. In this paper we investigate the connection between treewidth and the s ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We consider the problem of computing the probability of a Boolean function, which generalizes the model counting problem. Given an OBDD for such a function, its probability can be computed in linear time in the size of the OBDD. In this paper we investigate the connection between treewidth and the size of the OBDD. Bounded treewidth has proven to be applicable to many graph problems, which are NPhard in general but become tractable on graphs with bounded treewidth. However, it is less well understood how bounded treewidth can be used for the probability computation problem of a Boolean function. We introduce a new notion of treewidth of a Boolean function, called the expression treewidth, as the smallest treewidth of any DAGexpression representing the function. Our new no
Queries with Difference on Probabilistic Databases
"... We study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tupleindependent databases. We show that even the difference between two “safe ” conjunctive queries without selfjoins is “unsafe ” for exact computation. We turn to approx ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tupleindependent databases. We show that even the difference between two “safe ” conjunctive queries without selfjoins is “unsafe ” for exact computation. We turn to approximation and design an FPRAS for a large class of relational queries with difference, limited by how difference is nested and by the nature of the subtracted subqueries. We give examples of inapproximable queries outside this class. 1.
Local Structure and Determinism in Probabilistic Databases
"... While extensive work has been done on evaluating queries over tupleindependent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., inform ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
While extensive work has been done on evaluating queries over tupleindependent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., information extraction, data integration, computer vision, etc. In this paper, we develop a novel approach for efficiently evaluating probabilistic queries over correlated databases where correlations are represented using a factor graph, a class of graphical models widely used for capturing correlations and performing statistical inference. Our approach exploits the specific values of the factor parameters and the determinism in the correlations, collectively called local structure, to reduce the complexity of query evaluation. Our framework is based on arithmetic circuits, factorized representations of probability distributions that can exploit such local structure. Traditionally, arithmetic circuits are generated following a compilation process and can not be updated directly. We introduce a generalization of arithmetic circuits, called annotated arithmetic circuits, and a novel algorithm for updating them, which enables us to answer probabilistic queries efficiently. We present a comprehensive experimental analysis and show speedups of at least one order of magnitude in many cases.
Approximate Lifted Inference with Probabilistic Databases
"... This paper proposes a new approach for approximate evaluation of #Phard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking th ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
This paper proposes a new approach for approximate evaluation of #Phard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME selfjoinfree conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over nonprobabilistic methods for ranking query answers. 1.
Oblivious bounds on the probability of Boolean functions
 ACM Trans. Database Syst. (TODS
"... This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. w ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #Phard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxationbased and modelbased approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Oblivious Bounds on the Probability of Boolean Functions
, 2013
"... This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. wh ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #Phard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound, respectively, on the probability of the original formula. Our new bounds shed light on the connection between previous relaxationbased and modelbased approximations in the literature and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management systems (DBMS) to both upper and lower bound hard probabilistic queries.
Deliverable D4.1
, 2010
"... Objectives of WP4 The broad objective is to develop a framework to deal with missing data, in particular to develop new techniques and algorithms for handling 1. missing values in XML documents through an approach based on uncertainty, and 2. missing data through automatic recovery. The two objectiv ..."
Abstract
 Add to MetaCart
(Show Context)
Objectives of WP4 The broad objective is to develop a framework to deal with missing data, in particular to develop new techniques and algorithms for handling 1. missing values in XML documents through an approach based on uncertainty, and 2. missing data through automatic recovery. The two objectives are represented by tasks T4.1 (develop a foundational framework for dealing with missing data in XML) and T4.2 (develop a foundational framework for recovering of missing metadata), respectively. Task T4.3 (integrate a prototype implementation of the new algorithms into the Software Library of T1.1 of WP1) is scheduled to start later. Main results The key achievements of the first year are on (i) modelling uncertainty in XML, (ii) tractability of the main computational tasks, and (iii) regular expression inference. The first key achievement includes comprehensive analyses of expressiveness and succinctness of probabilistic XML models based on recursive Markov Chains, and of the interaction of incompleteness and constraints. The second achievement includes efficient algorithms for query evaluation on incomplete XML, and exact and approximate query evaluation algorithms on probabilistic data. As both achievements discuss modelling aspects of missing data as well as the developments of new tools and algorithms, they contribute to milestones I and III mentioned above. The third main achievement addresses the issue of missing data through automatic recovery. In particular, we developed new methods and tools for deterministic regular expression inference which constitutes one of the cornerstones for XML Schema (XSD) inference. As the latter will be included in the schema library, the third main achievement contributes to II and III of the project milestones. Dissemination The results are published in [1, 6, 8, 2, 11, 12, 7, 4, 5]. The full version of [1] is under submission to the Journal of the ACM, the top journal for computer science research. Paper [3] is under submission. Publication venues include internationally leading database and