Results 1  10
of
98
Consistent Query Answering: Five Easy Pieces
, 2007
"... Consistent query answering (CQA) is an approach to querying inconsistent databases without repairing them first. This invited talk introduces the basics of CQA, and discusses selected issues in this area. The talk concludes with a summary of other relevant work and an outline of potential future r ..."
Abstract

Cited by 80 (3 self)
 Add to MetaCart
Consistent query answering (CQA) is an approach to querying inconsistent databases without repairing them first. This invited talk introduces the basics of CQA, and discusses selected issues in this area. The talk concludes with a summary of other relevant work and an outline of potential future research topics.
Improving Data Quality: Consistency and Accuracy
"... Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D ′ that satisfies the constraints and “m ..."
Abstract

Cited by 73 (15 self)
 Add to MetaCart
(Show Context)
Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D ′ that satisfies the constraints and “minimally ” differs from D. Equally important is to ensure that the automaticallygenerated repair D ′ is accurate, or makes sense, i.e., D ′ differs from the “correct ” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D ′ that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction. 1.
Conditional functional dependencies for capturing data inconsistencies
 TODS
"... We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (cfds), and study their applications in data cleaning. In contrast to traditional functional dependencies (fds) that were developed mainly for schema design, cfds aim at capturing ..."
Abstract

Cited by 68 (12 self)
 Add to MetaCart
(Show Context)
We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (cfds), and study their applications in data cleaning. In contrast to traditional functional dependencies (fds) that were developed mainly for schema design, cfds aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of cfds we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of cfds, and the implication problem, which is to decide whether or not a set of cfds entails another cfd. We show that while any set of transitional fds is trivially consistent, the consistency problem is npcomplete for cfds, but it is in ptime when either the database schema is predefined or no attributes involved in the cfds have a finite domain. For the implication analysis of cfds, we provide an inference system analogous to Armstrong’s axioms for fds, and show that the implication problem is conpcomplete for cfds in contrast to the lineartime complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of cfds. Since cfds allow data bindings, in some cases cfds may be physically large, complicating detection of constraint violations. We develop techniques for detecting cfd violations in sql as well as novel techniques for checking multiple
10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
, 2006
"... Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this pa ..."
Abstract

Cited by 66 (8 self)
 Add to MetaCart
(Show Context)
Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decompositionbased approach to addressing this problem. We introduce worldset decompositions (WSDs), a spaceefficient formalism for representing any finite set of possible worlds over relational databases. WSDs are therefore a strong representation system for any relational query language. We study the problem of efficiently evaluating relational algebra queries on sets of worlds represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
Conditional functional dependencies for data cleaning
 In ICDE
, 2007
"... We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorpo ..."
Abstract

Cited by 62 (6 self)
 Add to MetaCart
(Show Context)
We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantically related values. For CFDs we provide an inference system analogous to Armstrong’s axioms for FDs, as well as consistency analysis. Since CFDs allow data bindings, a large number of individual constraints may hold on a table, complicating detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints in a single query. We experimentally evaluate the performance of our CFDbased methods for inconsistency detection. This not only yields a constraint theory for CFDs butisalsoasteptowardapractical constraintbased method for improving data quality. 1
Largescale deduplication with constraints using dedupalog
 in: Proceedings of the 25th International Conference on Data Engineering (ICDE
"... Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication v ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication venue”; iftwo paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalogstyle language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an adhoc domainspecific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thorough experimental results over realworld data demonstrating the utility of our framework for highquality and scalable deduplication. I.
Extending Dependencies with Conditions
"... This paper introduces a class of conditional inclusion dependencies (CINDs), which extends traditional inclusion dependencies (INDs) by enforcing bindings of semantically related data values. We show that CINDs are useful not only in data cleaning, but are also in contextual schema matching [7]. To ..."
Abstract

Cited by 40 (10 self)
 Add to MetaCart
(Show Context)
This paper introduces a class of conditional inclusion dependencies (CINDs), which extends traditional inclusion dependencies (INDs) by enforcing bindings of semantically related data values. We show that CINDs are useful not only in data cleaning, but are also in contextual schema matching [7]. To make effective use of CINDs in practice, it is often necessary to reason about them. The most important static analysis issue concerns consistency, to determine whether or not a given set of CINDs has conflicts. Another issue concerns implication, i.e., deciding whether a set of CINDs entails another CIND. We give a full treatment of the static analyses of CINDs, and show that CINDs retain most nice properties of traditional INDs: (a) CINDs are always consistent; (b) CINDs are finitely axiomatizable, i.e., there exists a sound and complete inference system for implication of CINDs; and (c) the implication problem for CINDs has the same complexity as its traditional counterpart, namely, PSPACEcomplete, in the absence of attributes with a finite domain; but it is EXPTIMEcomplete in the general setting. In addition, we investigate the interaction between CINDs and conditional functional dependencies (CFDs), an extension of functional dependencies proposed in [9]. We show that the consistency problem for the combination of CINDs and CFDs becomes undecidable. In light of the undecidability, we provide heuristic algorithms for the consistency analysis of CFDs and CINDs, and experimentally verify the effectiveness and efficiency of our algorithms. 1.
Towards Certain Fixes with Editing Rules and Master Data
"... A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, ..."
Abstract

Cited by 38 (11 self)
 Add to MetaCart
(Show Context)
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm. 1.
Worldset decompositions: Expressiveness and efficient algorithms
 In Proc. ICDT
, 2007
"... Abstract. Uncertain information is commonplace in realworld data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSD ..."
Abstract

Cited by 37 (12 self)
 Add to MetaCart
(Show Context)
Abstract. Uncertain information is commonplace in realworld data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSDs) provides a spaceefficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite worldsets in that they can represent any finite set of possible worlds. For possibly infinite worldsets, we show that a natural generalization of WSDs precisely captures the expressive power of ctables. We then show that several important problems are efficiently solvable on WSDs while they are NPhard on ctables. Finally, we give a polynomialtime algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations. 1
Repair Checking in Inconsistent Databases: Algorithms and Complexity
 Proc. of the International Conference on Database Theory (ICDT
"... Managing inconsistency in databases has long been recognized as an important problem. One of the most promising approaches to coping with inconsistency in databases is the framework of database repairs, which has been the topic of an extensive investigation over the past several years. Intuitively, ..."
Abstract

Cited by 37 (3 self)
 Add to MetaCart
(Show Context)
Managing inconsistency in databases has long been recognized as an important problem. One of the most promising approaches to coping with inconsistency in databases is the framework of database repairs, which has been the topic of an extensive investigation over the past several years. Intuitively, a repair of an inconsistent database is a consistent database that differs from the given inconsistent database in a minimal way. So far, most of the work in this area has addressed the problem of obtaining the consistent answers to a query posed on an inconsistent database. Repair checking is the following decision problem: given two databases r and r ′ , is r ′ a repair of r? Although repair checking is a fundamental algorithmic problem about inconsistent databases, it has not received as much attention as consistent query answering. In this paper, we give a polynomialtime algorithm for subsetrepair checking under integrity constraints that are the union of a weakly acyclic set of localasview (LAV) tuplegenerating dependencies and a set of equalitygenerating dependencies. This result significantly generalizes earlier work for subsetrepair checking when the integrity constraints are the union of an acyclic set of inclusion dependencies and a set of functional dependencies. We also give a polynomialtime algorithm for symmetricdifference repair checking, when the integrity constraints form a weakly acyclic set of LAV tgds. After this, we establish a number of complexitytheoretic results that delineate the boundary between tractability and intractability for the repairchecking problem. Specifically, we show that the aforementioned tractability