Results 1  10
of
14
Scorpion: Explaining Away Outliers in Aggregate Queries ABSTRACT
"... Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. U ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. We propose Scorpion, a system that takes a set of userspecified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, this explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set. 1.
Topk Query Processing in Probabilistic Databases with NonMaterialized Views
, 2012
"... In this paper, we investigate a novel approach of computing confidence bounds for topk ranking queries in probabilistic databases with nonmaterialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabil ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
In this paper, we investigate a novel approach of computing confidence bounds for topk ranking queries in probabilistic databases with nonmaterialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of selectprojectjoin views, the latter of which are cast into Datalog rules, where also the rules themselves may be uncertain, i.e., be valid with some degree of confidence. To our knowledge, this work is the first to address integrated data and confidence computations in the context of probabilistic databases by considering confidence bounds over partially evaluated query answers with firstorder lineage formulas. We further extend our query processing techniques by a toolsuite of scheduling strategies based on selectivity estimation and the expected impact of subgoals on the final confidence of answer candidates. Experiments with large datasets demonstrate drastic runtime improvements over both sampling and decompositionbased methods—even
Aggregation in Probabilistic Databases via Knowledge Compilation
"... This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and se ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into socalled decomposition trees, for which the computation of the probability distribution can be done in polynomial time in the size of the tree and of the distributions represented by its nodes. We give syntactic characterisations of tractable queries with aggregates by exploiting the connection between query tractability and polynomialtime decomposition trees. A prototype of the technique is incorporated in the probabilistic database engine SPROUT. We report on performance experiments with custom datasets and TPCH data. 1.
A Formal Approach to Finding Explanations for Database Queries∗
"... As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research in data management is to develop tools to assi ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level information from datasets collected from various sources and combined using data integration techniques. A major challenge for research in data management is to develop tools to assist users in explaining observed query outputs. In this paper we introduce a principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. We provide a formal definition of intervention in the presence of multiple relations which can interact with each other through foreign keys. First we give a set of recursive rules to compute the intervention for any given explanation in polynomial time (data complexity). Then we give simple and efficient algorithms based on SQL queries that can compute the topK explanations by using standard database management systems under certain conditions. We evaluate the quality and performance of our approach by experiments on real datasets.
A Demonstration of DBWipes: Clean as You Query
"... As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly important to provide tools that help analysts understand the underlying reasons when they encounter errors in the result. While data provenance has been a large step in prov ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly important to provide tools that help analysts understand the underlying reasons when they encounter errors in the result. While data provenance has been a large step in providing tools to help debug complex workflows, its current form has limited utility when debugging aggregation operators that compute a single output from a large collection of inputs. Traditional provenance will return the entire input collection, which has very low precision. In contrast, users are seeking precise descriptions of the inputs that caused the errors. We propose a Ranked Provenance System, which identifies subsets of inputs that influenced the output error, describes each subset with human readable predicates and orders them by contribution to the error. In this demonstration, we will present DBWipes, a novel data cleaning system that allows users to execute aggregate queries, and interactively detect, understand, and clean errors in the query results. Conference attendees will explore anomalies in campaign donations from the current US presidential election and in readings from a 54node sensor deployment. 1.
Descriptive and Prescriptive Data Cleaning
"... Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPCH Benchmark for different scenarios and classes of quality rules. 1.
Anytime approximation in probabilistic databases
, 2013
"... This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algori ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.
Deduction with contradictions in datalog
 In ICDT
, 2014
"... We study deduction in the presence of inconsistencies. Following previous works, we capture deduction via datalog programs and inconsistencies through violations of functional dependencies (FDs). We study and compare two semantics for datalog with FDs: the first, of a logical nature, is based on in ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We study deduction in the presence of inconsistencies. Following previous works, we capture deduction via datalog programs and inconsistencies through violations of functional dependencies (FDs). We study and compare two semantics for datalog with FDs: the first, of a logical nature, is based on inferring facts one at a time, while never violating the FDs; the second, of an operational nature, consists in a fixpoint computation in which maximal sets of facts consistent with the FDs are inferred at each stage. Both semantics are nondeterministic, yielding sets of possible worlds. We introduce a PTIME (in the size of the extensional data) algorithm, that given a datalog program, a set of FDs and an input instance, produces a ctable representation of the set of possible worlds. Then, we propose to quantify nondeterminism with probabilities, by means of a probabilistic semantics. We consider the problem of capturing possible worlds along with their probabilities via probabilistic ctables. We then study classical computational problems in this novel context. We consider the problems of computing the probabilities of answers, of identifying most likely supports for answers, and of determining the extensional facts that are most influential for deriving a particular fact. We show that the interplay of recursion and FDs leads to novel technical challenges in the context of these problems.
Algebraic model counting
, 2012
"... Weighted model counting (WMC) is a wellknown inference task on knowledge bases, used for probabilistic inference in graphical models. We introduce algebraic model counting (AMC), a generalization of WMC to a semiring structure. We show that AMC generalizes many wellknown tasks in a variety of doma ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Weighted model counting (WMC) is a wellknown inference task on knowledge bases, used for probabilistic inference in graphical models. We introduce algebraic model counting (AMC), a generalization of WMC to a semiring structure. We show that AMC generalizes many wellknown tasks in a variety of domains such as probabilistic inference, soft constraints and network and database analysis. Furthermore, we investigate AMC from a knowledge compilation perspective and show that all AMC tasks can be evaluated using sdDNNF circuits. We identify further characteristics of AMC instances that allow for the use of even more succinct circuits. 1
Cleaning Uncertain Data for Topk Queries
"... Abstract — The information managed in emerging applications, such as sensor networks, locationbased services, and data integration, are inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of an ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract — The information managed in emerging applications, such as sensor networks, locationbased services, and data integration, are inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of answers returned by a probabilistic topk query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database, in order to improve topk query quality. Cleaning involves the reduction of ambiguity associated with the database entities. For example, the uncertainty of a temperature value acquired from a sensor can be reduced, or cleaned, by requesting its newest value from the sensor. While this “cleaning operation ” may produce a better query result, it may involve a cost and fail. We investigate the problem of selecting entities to be cleaned under a limited budget. Particularly, we propose an optimal solution and several heuristics. Experiments show that the greedy algorithm is efficient and close to optimal. I.