Results 1 -
9 of
9
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
- In SIGMOD
, 2011
"... Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to p ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., “Why is this tuple in my result? ” or “Why does this output tuple have such high probability?”. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the top-ℓ influential set of variables and the top-ℓ set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, top-k queries and aggregation queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.
Probabilistic Databases with MarkoViews
"... Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where all tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposit ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where all tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations between tuples, and here query evaluation is significantly more expensive, or more restrictive. In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view’s outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tuple-independent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite non-obvious (some resulting probabilities may be negative!). This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data. 1.
Database Foundations for Scalable RDF Processing
- In Reasoning Web
"... Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and query-ing RDF with relatio ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and query-ing RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As central-ized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query pro-cessing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chap-ter, we argue that extracting knowledge from the Web is an excellent showcase – and potentially one of the biggest challenges – for the scal-
Local Structure and Determinism in Probabilistic Databases
"... While extensive work has been done on evaluating queries over tuple-independent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., inform ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
While extensive work has been done on evaluating queries over tuple-independent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, e.g., information extraction, data integration, computer vision, etc. In this paper, we develop a novel approach for efficiently evaluating probabilistic queries over correlated databases where correlations are represented using a factor graph, a class of graphical models widely used for capturing correlations and performing statistical inference. Our approach exploits the specific values of the factor parameters and the determinism in the correlations, collectively called local structure, to reduce the complexity of query evaluation. Our framework is based on arithmetic circuits, factorized representations of probability distributions that can exploit such local structure. Traditionally, arithmetic circuits are generated following a compilation process and can not be updated directly. We introduce a generalization of arithmetic circuits, called annotated arithmetic circuits, and a novel algorithm for updating them, which enables us to answer probabilistic queries efficiently. We present a comprehensive experimental analysis and show speed-ups of at least one order of magnitude in many cases.
Lineage for Markovian Stream Event Queries
"... Imprecise, sequential data, such as location sequences inferred from RFID/GPS, are often represented as Markovian (probabilistic, temporally-correlated) streams. Event queries, which detect instances of specific patterns in these streams, have become the standard tool for analysis of these streams; ..."
Abstract
- Add to MetaCart
(Show Context)
Imprecise, sequential data, such as location sequences inferred from RFID/GPS, are often represented as Markovian (probabilistic, temporally-correlated) streams. Event queries, which detect instances of specific patterns in these streams, have become the standard tool for analysis of these streams; however, many data mining applications require richer information such as how a pattern is matched, how long the match is, or what stream elements matched specific pattern predicates. Such queries can dramatically increase the power of applications, but they cannot be answered by existing tools. In this paper, we present novel techniques for processing the above queries on Markovian streams. Central to our approach are algorithms for computing and manipulating the lineage of Markovian stream event queries. We provide formal definitions and linear-time algorithms for computing lineage, which may be exponentially-sized in the length of the input stream. We additionally demonstrate the importance of flexible lineage projections, and provide definitions of, and two efficient algorithms for, these projections. We evaluate all algorithms on two real-world data sets (location from RFID and words from spoken audio), and demonstrate that lineage can greatly increase the analytical power of applications while incurring small processing overhead. 1.
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
, 2011
"... Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to p ..."
Abstract
- Add to MetaCart
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., “Why is this tuple in my result? ” or “Why does this output tuple have such high probability?”. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, it does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above and facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the top-ℓ influential set of variables and the top-ℓ set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, top-k queries and aggregation queries. Further, our framework naturally enables highly efficient, incremental evaluation when the input probabilities are modified, i.e., if the user decides to change the probability of an input tuple (e.g., if the uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.
Querying and Learning in Probabilistic Databases
"... Abstract. Probabilistic Databases (PDBs) lie at the expressive inter-section of databases, first-order logic, and probability theory. PDBs em-ploy logical deduction rules to process Select-Project-Join (SPJ) queries, which form the basis for a variety of declarative query languages such as Datalog, ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Probabilistic Databases (PDBs) lie at the expressive inter-section of databases, first-order logic, and probability theory. PDBs em-ploy logical deduction rules to process Select-Project-Join (SPJ) queries, which form the basis for a variety of declarative query languages such as Datalog, Relational Algebra, and SQL. They employ logical consistency constraints to resolve data inconsistencies, and they represent query an-swers via logical lineage formulas (aka.“data provenance”) to trace the dependencies between these answers and the input tuples that led to their derivation. While the literature on PDBs dates back to more than 25 years of research, only fairly recently the key role of lineage for es-tablishing a closed and complete representation model of relational op-erations over this kind of probabilistic data was discovered. Although PDBs benefit from their efficient and scalable database infrastructures for data storage and indexing, they couple the data computation with probabilistic inference, the latter of which remains a #P-hard problem also in the context of PDBs. In this chapter, we provide a review on the key concepts of PDBs with a particular focus on our own recent research results related to this field. We highlight a number of ongoing research challenges related to PDBs, and we keep referring to an information extraction (IE) scenario as a running application to manage uncertain and temporal facts obtained from IE techniques directly inside a PDB setting.