Results 1 - 10
of
14
Scorpion: Explaining Away Outliers in Aggregate Queries ABSTRACT
"... Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. U ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. We propose Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, this explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set. 1.
Aggregation in Probabilistic Databases via Knowledge Compilation
"... This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and se ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into so-called decomposition trees, for which the computation of the probability distribution can be done in polynomial time in the size of the tree and of the distributions represented by its nodes. We give syntactic characterisations of tractable queries with aggregates by exploiting the connection between query tractability and polynomial-time decomposition trees. A prototype of the technique is incorporated in the probabilistic database engine SPROUT. We report on performance experiments with custom datasets and TPC-H data. 1.
Top-k Query Processing in Probabilistic Databases with Non-Materialized Views
, 2012
"... In this paper, we investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabil ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules, where also the rules themselves may be uncertain, i.e., be valid with some degree of confidence. To our knowledge, this work is the first to address integrated data and confidence computations in the context of probabilistic databases by considering confidence bounds over partially evaluated query answers with first-order lineage formulas. We further extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact of subgoals on the final confidence of answer candidates. Experiments with large datasets demonstrate drastic runtime improvements over both sampling and decomposition-based methods—even
A Formal Approach to Finding Explanations for Database Queries∗
"... As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level in-formation from datasets collected from various sources and combined using data integration techniques. A major chal-lenge for research in data management is to develop tools to assi ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
As a consequence of the popularity of big data, many users with a variety of backgrounds seek to extract high level in-formation from datasets collected from various sources and combined using data integration techniques. A major chal-lenge for research in data management is to develop tools to assist users in explaining observed query outputs. In this paper we introduce a principled approach to provide expla-nations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. We provide a formal definition of inter-vention in the presence of multiple relations which can inter-act with each other through foreign keys. First we give a set of recursive rules to compute the intervention for any given explanation in polynomial time (data complexity). Then we give simple and efficient algorithms based on SQL queries that can compute the top-K explanations by using standard database management systems under certain conditions. We evaluate the quality and performance of our approach by ex-periments on real datasets.
A Demonstration of DBWipes: Clean as You Query
"... As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly important to provide tools that help analysts understand the underlying reasons when they encounter errors in the result. While data provenance has been a large step in prov ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly important to provide tools that help analysts understand the underlying reasons when they encounter errors in the result. While data provenance has been a large step in providing tools to help debug complex workflows, its current form has limited utility when debugging aggregation operators that compute a single output from a large collection of inputs. Traditional provenance will return the entire input collection, which has very low precision. In contrast, users are seeking precise descriptions of the inputs that caused the errors. We propose a Ranked Provenance System, which identifies subsets of inputs that influenced the output error, describes each subset with human readable predicates and orders them by contribution to the error. In this demonstration, we will present DBWipes, a novel data cleaning system that allows users to execute aggregate queries, and interactively detect, understand, and clean errors in the query results. Conference attendees will explore anomalies in campaign donations from the current US presidential election and in readings from a 54-node sensor deployment. 1.
Descriptive and Prescriptive Data Cleaning
"... Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations us-ing some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some tar-get report generated by transformations over multiple ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations us-ing some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some tar-get report generated by transformations over multiple data sources. This creates a situation where the violations de-tected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possi-ble and affordable, this would be of little help towards iden-tifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we pro-pose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and ef-ficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules. 1.
Anytime approximation in probabilistic databases
, 2013
"... This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algori ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This article describes an approximation algorithm for computing the probability of propositional formulas over discrete random variables. It incrementally refines lower and upper bounds on the probability of the formulas until the desired absolute or relative error guarantee is reached. This algorithm is used by the SPROUT query engine to approximate the probabilities of results to relational algebra queries on expressive probabilistic databases.
Cleaning Uncertain Data for Top-k Queries
"... Abstract — The information managed in emerging applications, such as sensor networks, location-based services, and data integration, are inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of an ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract — The information managed in emerging applications, such as sensor networks, location-based services, and data integration, are inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of answers returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database, in order to improve top-k query quality. Cleaning involves the reduction of ambiguity associated with the database entities. For example, the uncertainty of a temperature value acquired from a sensor can be reduced, or cleaned, by requesting its newest value from the sensor. While this “cleaning operation ” may produce a better query result, it may involve a cost and fail. We investigate the problem of selecting entities to be cleaned under a limited budget. Particularly, we propose an optimal solution and several heuristics. Experiments show that the greedy algorithm is efficient and close to optimal. I.
Algebraic model counting
, 2012
"... Abstract Weighted model counting (WMC) is a well-known inference task on knowledge bases, used for probabilistic inference in graphical models. We introduce algebraic model counting (AMC), a generalization of WMC to a semiring structure. We show that AMC generalizes many well-known tasks in a varie ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract Weighted model counting (WMC) is a well-known inference task on knowledge bases, used for probabilistic inference in graphical models. We introduce algebraic model counting (AMC), a generalization of WMC to a semiring structure. We show that AMC generalizes many well-known tasks in a variety of domains such as probabilistic inference, soft constraints and network and database analysis. Furthermore, we investigate AMC from a knowledge compilation perspective and show that all AMC tasks can be evaluated using sd-DNNF circuits. We identify further characteristics of AMC instances that allow for the use of even more succinct circuits.
Deduction with contradictions in datalog
- In ICDT
, 2014
"... We study deduction in the presence of inconsistencies. Following previous works, we capture deduction via datalog programs and in-consistencies through violations of functional dependencies (FDs). We study and compare two semantics for datalog with FDs: the first, of a logical nature, is based on in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
We study deduction in the presence of inconsistencies. Following previous works, we capture deduction via datalog programs and in-consistencies through violations of functional dependencies (FDs). We study and compare two semantics for datalog with FDs: the first, of a logical nature, is based on inferring facts one at a time, while never violating the FDs; the second, of an operational nature, consists in a fixpoint computation in which maximal sets of facts consistent with the FDs are inferred at each stage. Both semantics are nondeterministic, yielding sets of possible worlds. We introduce a PTIME (in the size of the extensional data) algorithm, that given a datalog program, a set of FDs and an input instance, produces a c-table representation of the set of possible worlds. Then, we propose to quantify nondeterminism with prob-abilities, by means of a probabilistic semantics. We consider the problem of capturing possible worlds along with their probabilities via probabilistic c-tables. We then study classical computational problems in this novel context. We consider the problems of computing the probabilities of answers, of identifying most likely supports for answers, and of determining the extensional facts that are most influential for de-riving a particular fact. We show that the interplay of recursion and FDs leads to novel technical challenges in the context of these problems.