Results 1 - 10
of
13
Read-Once Functions and Query Evaluation in Probabilistic Databases
"... Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluat ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g., safe plans, hierarchical queries), in that, they only consider characteristics of the query and not the data in the database. It is easy to construct examples where a supposedly hard query run on an appropriate database gives rise to a tractable query evaluation problem. In this paper, we develop efficient query evaluation techniques that leverage characteristics of both the query and the data in the database. We focus on tuple-independent databases where the query evaluation problem is equivalent to computing marginal probabilities of Boolean formulas associated with the result tuples. Query evaluation is easy if the Boolean formulas can be factorized into a form that has every variable appearing at most once (called read-once); this suggests a naive approach that incorporates previously developed Boolean formula factorization algorithms into the query evaluation. We then develop novel, more efficient factorization algorithms that work for a large subclass of queries (specifically, conjunctive queries without self-joins), by exploiting the unique structure of the result tuple Boolean formulas. We empirically demonstrate that our proposed techniques are (1) orders of magnitude faster than generic inference algorithms when used to evaluate general read-once functions, and (2) for the special case of hierarchical queries, they rival the efficiency of prior techniques specifically designed to handle such queries. 1.
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
, 2010
"... Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every query is e ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
(Show Context)
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two well-known scoring functions on graphs, namely graph reliability (which is #P-hard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without self-joins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Faster Query Answering in Probabilistic Databases using Read-Once Functions
"... A boolean expression is in read-once form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in read-once form can be computed in polynomial time (whereas the general problem ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
A boolean expression is in read-once form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in read-once form can be computed in polynomial time (whereas the general problem for arbitrary expressions is #P-complete). Known approaches to checking read-once property seem to require putting these expressions in disjunctive normal form. In this paper, we tell a better story for a large subclass of boolean event expressions: those that are generated by conjunctive queries without self-joins and on tuple-independent probabilistic databases. We first show that given a tuple-independent representation and the provenance graph of an SPJ query plan without self-joins, we can, without using the DNF of a result event expression, efficiently compute its co-occurrence graph. From this, the read-once form can already, if it exists, be computed efficiently using existing techniques. Our second and key contribution is a complete, efficient, and simple to implement algorithm for computing the read-once forms (whenever they exist) directly, using a new concept, that of co-table graph, which can be significantly smaller than the cooccurrence graph.
On the Tractability of Query Compilation and Bounded Treewidth
, 2012
"... We consider the problem of computing the probability of a Boolean function, which generalizes the model counting problem. Given an OBDD for such a function, its probability can be computed in linear time in the size of the OBDD. In this paper we investigate the connection between treewidth and the s ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We consider the problem of computing the probability of a Boolean function, which generalizes the model counting problem. Given an OBDD for such a function, its probability can be computed in linear time in the size of the OBDD. In this paper we investigate the connection between treewidth and the size of the OBDD. Bounded treewidth has proven to be applicable to many graph problems, which are NP-hard in general but become tractable on graphs with bounded treewidth. However, it is less well understood how bounded treewidth can be used for the probability computation problem of a Boolean function. We introduce a new notion of treewidth of a Boolean function, called the expression treewidth, as the smallest treewidth of any DAG-expression representing the function. Our new no-
Queries with Difference on Probabilistic Databases
"... We study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tuple-independent databases. We show that even the difference between two “safe ” conjunctive queries without self-joins is “unsafe ” for exact computation. We turn to approx ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We study the feasibility of the exact and approximate computation of the probability of relational queries with difference on tuple-independent databases. We show that even the difference between two “safe ” conjunctive queries without self-joins is “unsafe ” for exact computation. We turn to approximation and design an FPRAS for a large class of relational queries with difference, limited by how difference is nested and by the nature of the subtracted subqueries. We give examples of inapproximable queries outside this class. 1.
Local Structure and Determinism in Probabilistic Databases
"... While extensive work has been done on evaluating queries over tuple-independent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., inform ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
While extensive work has been done on evaluating queries over tuple-independent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., information extraction, data integration, computer vision, etc. In this paper, we develop a novel approach for efficiently evaluating probabilistic queries over correlated databases where correlations are represented using a factor graph, a class of graphical models widely used for capturing correlations and performing statistical inference. Our approach exploits the specific values of the factor parameters and the determinism in the correlations, collectively called local structure, to reduce the complexity of query evaluation. Our framework is based on arithmetic circuits, factorized representations of probability distributions that can exploit such local structure. Traditionally, arithmetic circuits are generated following a compilation process and can not be updated directly. We introduce a generalization of arithmetic circuits, called annotated arithmetic circuits, and a novel algorithm for updating them, which enables us to answer probabilistic queries efficiently. We present a comprehensive experimental analysis and show speed-ups of at least one order of magnitude in many cases.
Approximate Lifted Inference with Probabilistic Databases
"... This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluat-ing a fixed number of query plans, each providing an upper bound on the true probability, then taking th ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluat-ing a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possi-ble plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed ex-perimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. 1.
Oblivious bounds on the probability of Boolean functions
- ACM Trans. Database Syst. (TODS
"... This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this ap-proach dissociation and give an exact characterization of optimal oblivious bounds, i.e. w ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this ap-proach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new prob-abilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection be-tween previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Oblivious Bounds on the Probability of Boolean Functions
, 2013
"... This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. wh ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new prob-abilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound, respectively, on the probability of the original formula. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations in the literature and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management systems (DBMS) to both upper and lower bound hard probabilistic queries.
Deliverable D4.1
, 2010
"... Objectives of WP4 The broad objective is to develop a framework to deal with missing data, in particular to develop new techniques and algorithms for handling 1. missing values in XML documents through an approach based on uncertainty, and 2. missing data through automatic recovery. The two objectiv ..."
Abstract
- Add to MetaCart
(Show Context)
Objectives of WP4 The broad objective is to develop a framework to deal with missing data, in particular to develop new techniques and algorithms for handling 1. missing values in XML documents through an approach based on uncertainty, and 2. missing data through automatic recovery. The two objectives are represented by tasks T4.1 (develop a foundational framework for dealing with missing data in XML) and T4.2 (develop a foundational framework for recovering of missing metadata), respectively. Task T4.3 (integrate a prototype implementation of the new algorithms into the Software Library of T1.1 of WP1) is scheduled to start later. Main results The key achievements of the first year are on (i) modelling uncertainty in XML, (ii) tractability of the main computational tasks, and (iii) regular expression inference. The first key achievement includes comprehensive analyses of expressiveness and succinctness of probabilistic XML models based on recursive Markov Chains, and of the interaction of incompleteness and constraints. The second achievement includes efficient algorithms for query evaluation on incomplete XML, and exact and approximate query evaluation algorithms on probabilistic data. As both achievements discuss modelling aspects of missing data as well as the developments of new tools and algorithms, they contribute to milestones I and III mentioned above. The third main achievement addresses the issue of missing data through automatic recovery. In particular, we developed new methods and tools for deterministic regular expression inference which constitutes one of the cornerstones for XML Schema (XSD) inference. As the latter will be included in the schema library, the third main achievement contributes to II and III of the project milestones. Dissemination The results are published in [1, 6, 8, 2, 11, 12, 7, 4, 5]. The full version of [1] is under submission to the Journal of the ACM, the top journal for computer science research. Paper [3] is under submission. Publication venues include internationally leading database and