Results 11  20
of
91
Tiresias: the database oracle for howto queries
 In SIGMOD
, 2012
"... HowTo queries answer fundamental data analysis questions of the form: “How should the input change in order to achieve the desired output”. As a Reverse Data Management problem, the evaluation of howto queries is harder than their “forward ” counterpart: hypothetical, or whatif queries. In this p ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
HowTo queries answer fundamental data analysis questions of the form: “How should the input change in order to achieve the desired output”. As a Reverse Data Management problem, the evaluation of howto queries is harder than their “forward ” counterpart: hypothetical, or whatif queries. In this paper, we present Tiresias, the first system that provides support for howto queries, allowing the definition and integrated evaluation of a large set of constrained optimization problems, specifically Mixed Integer Programming problems, on top of a relational database system. Tiresias generates the problem variables, constraints and objectives by issuing standard SQL statements, allowing for its integration with any RDBMS. The contributions of this work are the following: (a) we define howto queries using possible world semantics, and propose the specification language TiQL (for Tiresias Query Language) based on simple extensions to standard Datalog. (b) We define translation rules that generate a Mixed Integer Program (MIP) from TiQL specifications, which can be solved using existing tools. (c) Tiresias implements powerful “dataaware” optimizations that are beyond the capabilities of modern MIP solvers, dramatically improving the system performance. (d) Finally, an extensive performance evaluation on the TPCH dataset demonstrates the effectiveness of these optimizations, particularly highlighting the ability to apply divideandconquer methods to break MIP problems into smaller instances.
10ˆ(10ˆ6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
 VLDBJ
"... We present a decompositionbased approach to managing probabilistic information. We introduce worldset decompositions (WSDs), a spaceefficient and complete representation system for finite sets of worlds. We study the problem of efficiently evaluating relational algebra queries on worldsets repre ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
We present a decompositionbased approach to managing probabilistic information. We introduce worldset decompositions (WSDs), a spaceefficient and complete representation system for finite sets of worlds. We study the problem of efficiently evaluating relational algebra queries on worldsets represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
Probabilistic String Similarity Joins
"... Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (determ ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Edit distance based string similarity join is a fundamental operator in string databases. Increasingly, many applications in data cleaning, data integration, and scientific computing have to deal with fuzzy information in string attributes. Despite the intensive efforts devoted in processing (deterministic) string joins and managing probabilistic data respectively, modeling and processing probabilistic strings is still a largely unexplored territory. This work studies the string join problem in probabilistic string databases, using the expected edit distance (EED) as the similarity measure. We first discuss two probabilistic string models to capture the fuzziness in string values in realworld applications. The stringlevel model is complete, but may be expensive to represent and process. The characterlevel model has a much more succinct representation when uncertainty in strings only exists at certain positions. Since computing the EED between two probabilistic strings is prohibitively expensive, we have designed efficient and effective pruning techniques that can be easily implemented in existing relational database engines for both models. Extensive experiments on real data have demonstrated orderofmagnitude improvements of our approaches over the baseline.
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
 In SIGMOD
, 2011
"... Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to p ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide explanations for query results, e.g., “Why is this tuple in my result? ” or “Why does this output tuple have such high probability?”. Second, the problem of determining the sensitive input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the lineage/provenance of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of influence and explanations and provide algorithms to determine the topℓ influential set of variables and the topℓ set of explanations for a variety of queries, including conjunctive queries, probabilistic threshold queries, topk queries and aggregation queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.
Bridging the gap between intensional and extensional query evaluation in probabilistic databases. EDBT
, 2010
"... ..."
(Show Context)
Probabilistic Databases with MarkoViews
"... Most of the work on query evaluation in probabilistic databases has focused on the simple tupleindependent data model, where all tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, treedecomposit ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
(Show Context)
Most of the work on query evaluation in probabilistic databases has focused on the simple tupleindependent data model, where all tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, treedecomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations between tuples, and here query evaluation is significantly more expensive, or more restrictive. In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view’s outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tupleindependent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite nonobvious (some resulting probabilities may be negative!). This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data. 1.
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
, 2010
"... Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is e ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two wellknown scoring functions on graphs, namely graph reliability (which is #Phard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without selfjoins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Probabilistic Similarity Search for Uncertain Time Series
, 2009
"... A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic rang ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic range queries over uncertain time series. Furthermore, we propose an original approximate representation of uncertain time series that can be used to efficiently support both new query types by upper and lower bounding the Euclidean distance.
Lineage processing over correlated probabilistic databases
 In SIGMOD Conference
, 2010
"... In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a twophase approach where we first compute lineages of the output tuples, and then compute the probabili ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a twophase approach where we first compute lineages of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of junction trees. We observe that evaluating even readonce (tree structured) lineages (e.g., those generated by hierarchical conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #Pcomplete for lightly correlated probabilistic databases like Markov sequences. We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called lwidth (analogous to the notion of treewidth). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases.
On the Complexity of Query Answering over Incomplete XML Documents
"... Previous studies of incomplete XML documents have identified three main sources of incompleteness – in structural information, data values, and labeling – and addressed data complexity of answering analogs of unions of conjunctive queries under the open world assumption. It is known that structural ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
Previous studies of incomplete XML documents have identified three main sources of incompleteness – in structural information, data values, and labeling – and addressed data complexity of answering analogs of unions of conjunctive queries under the open world assumption. It is known that structural incompleteness leads to intractability, while incompleteness in data values and labeling still permits efficient computation of certain answers. The goal of this paper is to provide a complete picture of the complexity of query answering over incomplete XML documents. We look at more expressive languages, at other semantic assumptions, and at both data and combined complexity of query answering, to see whether some wellbehaving tractable classes have been missed. To incorporate nonpositive features into query languages, we look at gentle ways of introducing negation via inequalities and/or Boolean combinations of positive queries, as well as the analog of relational calculus. We also look at the closed world assumption which, due to the hierarchical structure of XML, has two variations. For all combinations of languages and semantics of incompleteness we determine data and combined complexity of computing certain answers. We show that structural incompleteness leads to intractability under all assumptions, while by dropping it we can recover efficient evaluation algorithms for some queries that go beyond those previously studied.