Results 1 - 10
of
40
ULDBs: Databases with uncertainty and lineage
- IN VLDB
, 2006
"... This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, howev ..."
Abstract
-
Cited by 310 (32 self)
- Add to MetaCart
This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately. We show that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of ULDB minimality—dataminimal and lineage-minimal—and study minimization of ULDB representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. Finally, we show how ULDBs enable a new approach to query processing in probabilistic databases. ULDBs form the basis of the Trio system under development at Stanford.
Probabilistic data exchange
- In Proc. ICDT
, 2010
"... The work reported here lays the foundations of data exchange in the presence of probabilistic data. This requires rethinking the very basic concepts of traditional data exchange, such as solution, universal solution, and the certain answers of target queries. We develop a framework for data exchange ..."
Abstract
-
Cited by 40 (7 self)
- Add to MetaCart
(Show Context)
The work reported here lays the foundations of data exchange in the presence of probabilistic data. This requires rethinking the very basic concepts of traditional data exchange, such as solution, universal solution, and the certain answers of target queries. We develop a framework for data exchange over probabilistic databases, and make a case for its coherence and robustness. This framework applies to arbitrary schema mappings, and finite or countably infinite probability spaces on the source and target instances. After establishing this framework and formulating the key concepts, we study the application of the framework to a concrete and practical setting where probabilistic databases are compactly encoded by means of annotations formulated over random Boolean variables. In this setting, we study the problems of testing for the existence of solutions and universal solutions, materializing such solutions, and evaluating target queries (for unions of conjunctive queries) in both the exact sense and the approximate sense. For each of the problems, we carry out a complexity analysis based on properties of the annotation, in various classes of dependencies. Finally, we show that the framework and results easily and completely generalize to allow not only the data, but also the schema mapping itself to be probabilistic.
Exploiting Shared Correlations in Probabilistic Databases
, 2008
"... There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently repre ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources — from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as shared correlations where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the random variable elimination graph (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.
Containment of Conjunctive Queries on Annotated Relations
"... We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incom ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive ” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism. 1.
The Dichotomy of Probabilistic Inference for Unions of Conjunctive Queries
"... We study the complexity of computing the probability of a query on a probabilistic database. The queries that we consider are unions of conjunctive queries, UCQ: equivalently, these are positive, existential First Order Logic sentences, or non-recursive datalog programs. The databases that we consid ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
(Show Context)
We study the complexity of computing the probability of a query on a probabilistic database. The queries that we consider are unions of conjunctive queries, UCQ: equivalently, these are positive, existential First Order Logic sentences, or non-recursive datalog programs. The databases that we consider are tuple-independent. We prove the following dichotomy theorem. For every UCQ query, either its probability can be computed in polynomial time in the size of the database, or is hard for FP #P. Our result also has applications to the problem of computing the probability of positive, Boolean expressions, and establishes a dichotomy for such classes based on their structure. For the tractable case, we give a very simple algorithm that alternates between two steps: applying the inclusion/exclusion formula, and removing one existential variable. A key, and novel feature of this algorithm is that it avoids computing terms that cancel out in the inclusion/exclusion formula, in other words it only computes those terms whose Mobius function in an appropriate lattice is non-zero. We show that This simple feature is a key ingredient needed to ensure completeness. For the hardness proof, we give a reduction from the counting problem for positive, partitioned 2CNF, which is known to be #P-complete. The hardness proof is non-trivial, and uses techniques from logic and from classical algebra.
Capturing data uncertainty in high-volume stream processing
- In CIDR
, 2009
"... We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous ran-dom variables such as many types of sensor data. To provide an end- ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous ran-dom variables such as many types of sensor data. To provide an end-to-end solution, our system employs probabilistic modeling and inference to generate uncertainty description for raw data, and then a suite of statistical techniques to capture changes of uncertainty as data propagates through query operators. To cope with high-volume streams, we ex-plore advanced approximation techniques for both space and time efficiency. We are currently working with a group of scientists to evaluate our system using traces collected from real-world applications for hazardous weather monitoring and for object tracking and monitoring. 1.
Bridging the gap between intensional and extensional query evaluation in probabilistic databases. EDBT
, 2010
"... ..."
Top-k dominating queries in uncertain databases
- in EDBT, 2009
"... Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilist ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilistic top-k dominating (PTD) query, in the uncertain database. In particular, a PTD query re-trieves k uncertain objects that are expected to dynamically domi-nate the largest number of uncertain objects. We propose an effec-tive pruning approach to reduce the PTD search space, and present an efficient query procedure to answer PTD queries. Furthermore, approximate PTD query processing and the case where the PTD query is issued from an uncertain query object are also discussed. Extensive experiments have demonstrated the efficiency and effec-tiveness of our proposed PTD query processing approaches. 1.
Database Foundations for Scalable RDF Processing
- In Reasoning Web
"... Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and query-ing RDF with relatio ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and query-ing RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As central-ized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query pro-cessing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chap-ter, we argue that extracting knowledge from the Web is an excellent showcase – and potentially one of the biggest challenges – for the scal-