Results 11 - 20
of
91
Tiresias: the database oracle for how-to queries
- In SIGMOD
, 2012
"... How-To queries answer fundamental data analysis questions of the form: “How should the input change in order to achieve the desired output”. As a Reverse Data Management problem, the evaluation of how-to queries is harder than their “forward ” counterpart: hypothetical, or what-if queries. In this p ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
(Show Context)
How-To queries answer fundamental data analysis questions of the form: “How should the input change in order to achieve the desired output”. As a Reverse Data Management problem, the evaluation of how-to queries is harder than their “forward ” counterpart: hypothetical, or what-if queries. In this paper, we present Tiresias, the first system that provides support for how-to queries, allowing the definition and integrated evaluation of a large set of constrained optimization problems, specifically Mixed Integer Programming problems, on top of a relational database system. Tiresias generates the problem variables, constraints and objectives by issuing standard SQL statements, allowing for its integration with any RDBMS. The contributions of this work are the following: (a) we define how-to queries using possible world semantics, and propose the specification language TiQL (for Tiresias Query Language) based on simple extensions to standard Datalog. (b) We define translation rules that generate a Mixed Integer Program (MIP) from TiQL specifications, which can be solved using existing tools. (c) Tiresias implements powerful “dataaware” optimizations that are beyond the capabilities of modern MIP solvers, dramatically improving the system performance. (d) Finally, an extensive performance evaluation on the TPC-H dataset demonstrates the effectiveness of these optimizations, particularly highlighting the ability to apply divide-and-conquer methods to break MIP problems into smaller instances.
10ˆ(10ˆ6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
- VLDBJ
"... We present a decomposition-based approach to managing probabilistic information. We introduce world-set decompositions (WSDs), a space-efficient and complete representation system for finite sets of worlds. We study the problem of efficiently evaluating relational algebra queries on world-sets repre ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We present a decomposition-based approach to managing probabilistic information. We introduce world-set decompositions (WSDs), a space-efficient and complete representation system for finite sets of worlds. We study the problem of efficiently evaluating relational algebra queries on world-sets represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
Probabilistic String Similarity Joins
"... Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (determ ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in real-world applications. The string-level model is complete, but may be expensive to represent and process. The character-level model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated order-of-magnitude improvements of our approaches over the baseline.
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
- In SIGMOD
, 2011
"... Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to p ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., “Why is this tuple in my result? ” or “Why does this output tuple have such high probability?”. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the top-ℓ influential set of variables and the top-ℓ set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, top-k queries and aggregation queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.
Bridging the gap between intensional and extensional query evaluation in probabilistic databases. EDBT
, 2010
"... ..."
(Show Context)
Probabilistic Databases with MarkoViews
"... Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where all tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposit ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where all tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations between tuples, and here query evaluation is significantly more expensive, or more restrictive. In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view’s outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tuple-independent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite non-obvious (some resulting probabilities may be negative!). This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data. 1.
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
, 2010
"... Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every query is e ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
(Show Context)
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two well-known scoring functions on graphs, namely graph reliability (which is #P-hard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without self-joins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Probabilistic Similarity Search for Uncertain Time Series
, 2009
"... A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic rang ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic range queries over uncertain time series. Furthermore, we propose an original approximate representation of uncertain time series that can be used to efficiently support both new query types by upper and lower bounding the Euclidean distance.
Lineage processing over correlated probabilistic databases
- In SIGMOD Conference
, 2010
"... In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabili ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of junction trees. We observe that evaluating even read-once (tree structured) lineages (e.g., those generated by hierarchical conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #P-complete for lightly correlated probabilistic databases like Markov sequences. We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called lwidth (analogous to the notion of treewidth). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases.
On the Complexity of Query Answering over Incomplete XML Documents
"... Previous studies of incomplete XML documents have identified three main sources of incompleteness – in structural information, data values, and labeling – and addressed data complexity of answering analogs of unions of conjunctive queries under the open world assumption. It is known that structural ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
(Show Context)
Previous studies of incomplete XML documents have identified three main sources of incompleteness – in structural information, data values, and labeling – and addressed data complexity of answering analogs of unions of conjunctive queries under the open world assumption. It is known that structural incompleteness leads to intractability, while incompleteness in data values and labeling still permits efficient computation of certain answers. The goal of this paper is to provide a complete picture of the complexity of query answering over incomplete XML documents. We look at more expressive languages, at other semantic assumptions, and at both data and combined complexity of query answering, to see whether some well-behaving tractable classes have been missed. To incorporate non-positive features into query languages, we look at gentle ways of introducing negation via inequalities and/or Boolean combinations of positive queries, as well as the analog of relational calculus. We also look at the closed world assumption which, due to the hierarchical structure of XML, has two variations. For all combinations of languages and semantics of incompleteness we determine data and combined complexity of computing certain answers. We show that structural incompleteness leads to intractability under all assumptions, while by dropping it we can recover efficient evaluation algorithms for some queries that go beyond those previously studied.