Results 1  10
of
12
Improving Data Quality: Consistency and Accuracy
"... Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D ′ that satisfies the constraints and “m ..."
Abstract

Cited by 72 (15 self)
 Add to MetaCart
(Show Context)
Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D ′ that satisfies the constraints and “minimally ” differs from D. Equally important is to ensure that the automaticallygenerated repair D ′ is accurate, or makes sense, i.e., D ′ differs from the “correct ” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D ′ that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction. 1.
Conditional functional dependencies for capturing data inconsistencies
 TODS
"... We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (cfds), and study their applications in data cleaning. In contrast to traditional functional dependencies (fds) that were developed mainly for schema design, cfds aim at capturing ..."
Abstract

Cited by 68 (13 self)
 Add to MetaCart
(Show Context)
We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (cfds), and study their applications in data cleaning. In contrast to traditional functional dependencies (fds) that were developed mainly for schema design, cfds aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of cfds we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of cfds, and the implication problem, which is to decide whether or not a set of cfds entails another cfd. We show that while any set of transitional fds is trivially consistent, the consistency problem is npcomplete for cfds, but it is in ptime when either the database schema is predefined or no attributes involved in the cfds have a finite domain. For the implication analysis of cfds, we provide an inference system analogous to Armstrong’s axioms for fds, and show that the implication problem is conpcomplete for cfds in contrast to the lineartime complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of cfds. Since cfds allow data bindings, in some cases cfds may be physically large, complicating detection of constraint violations. We develop techniques for detecting cfd violations in sql as well as novel techniques for checking multiple
Increasing the expressivity of conditional functional dependencies without extra complexity
 In Proceedings of the International Conference on Data Engineering
"... Abstract — The paper proposes an extension of CFDs [1], referred to as extended Conditional Functional Dependencies (eCFDs). In contrast to CFDs, eCFDs specify patterns of semantically related values in terms of disjunction and inequality, and are capable of catching inconsistencies that arise in pr ..."
Abstract

Cited by 23 (6 self)
 Add to MetaCart
(Show Context)
Abstract — The paper proposes an extension of CFDs [1], referred to as extended Conditional Functional Dependencies (eCFDs). In contrast to CFDs, eCFDs specify patterns of semantically related values in terms of disjunction and inequality, and are capable of catching inconsistencies that arise in practice but cannot be detected by CFDs. The increase in expressive power does not incur extra complexity: we show that the satisfiability and implication analyses of eCFDs remain NPcomplete and coNPcomplete, respectively, the same as their CFDs counterparts. In light of the intractability, we present an algorithm that approximates the maximum number of eCFDs that are satisfiable. In addition, we revise SQL techniques for detecting CFD violations, and show that violations of multiple eCFDs can be captured via a single pair of SQL queries. We also introduce an incremental SQL technique for detecting eCFD violations in response to database updates. We experimentally verify the effectiveness and efficiency of our SQLbased detection methods. I.
Y.: Propagating functional dependencies with conditions
 In: VLDB (2008
"... The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies ( ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies (CFDs). The need for this study is evident in data integration, exchange and cleaning since dependencies on data sources often only hold conditionally on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, CFDs as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). (a) We establish lower and upper bounds, all matching, ranging from ptime to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of finite domains. (b) We provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views; the algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. (c) We experimentally verify that the algorithm is efficient. 1.
Discovering Functional Dependencies and Association Rules by Navigating in a Lattice of OLAP Views
"... Abstract. Discovering dependencies in data is a wellknow problem in database theory. The most common rules are Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Association Rules (ARs). Many tools can display those rules as lists, but those lists are often too long for i ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Discovering dependencies in data is a wellknow problem in database theory. The most common rules are Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Association Rules (ARs). Many tools can display those rules as lists, but those lists are often too long for inspection by users. We propose a new way to display and navigate through those rules. Display is based on OnLine Analytical Processing (OLAP), presenting a set of rules as a cube, where dimensions correspond to the premises of rules. Cubes reflect the hierarchy that exists between FDs, CFDs and ARs. Navigation is based on a lattice, where nodes are OLAP views, and edges are OLAP navigation links, and guides users from cube to cube. We present an illustrative example with the help of our prototype.
Analyses and Validation of Conditional Dependencies with Builtin Predicates
"... Abstract. This paper proposes a natural extension of conditional functional dependencies (cfds [14]) and conditional inclusion dependencies (cinds [8]), denoted by cfd p s and cind p s, respectively, by specifying patterns of data values with =, <, ≤,> and ≥ predicates. As data quality rules, ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. This paper proposes a natural extension of conditional functional dependencies (cfds [14]) and conditional inclusion dependencies (cinds [8]), denoted by cfd p s and cind p s, respectively, by specifying patterns of data values with =, <, ≤,> and ≥ predicates. As data quality rules, cfd p s and cind p s are able to capture errors that commonly arise in practice but cannot be detected by cfds and cinds. We establish two sets of results for central technical problems associated with cfd p s and cind p s. (a) One concerns the satisfiability and implication problems for cfd p s and cind p s, taken separately or together. These are important for, e.g., deciding whether data quality rules are dirty themselves, and for removing redundant rules. We show that despite the increased expressive power, the static analyses of cfd p s and cind p s retain the same complexity as their cfds and cinds counterparts. (b) The other concerns validation of cfd p s and cind p s. We show that given a set Σ of cfd p s and cind p s on a database D, a set of sql queries can be automatically generated that, when evaluated against D, return all tuples in D that violate some dependencies in Σ. This provides commercial dbms with an immediate capability to detect errors based on cfd p s and cind p s. Key words: functional dependency, inclusion dependency, data quality 1
Mining Constant Conditional Functional Dependencies for Improving Data Quality
"... This paper applies the data mining techniques in the area of data cleaning as effective in discovering Constant Conditional Functional Dependencies(CCFDs) from relational databases. These CCFDs are used as business rules for context dependent data validations. Conditional Functional Dependencies(CFD ..."
Abstract
 Add to MetaCart
(Show Context)
This paper applies the data mining techniques in the area of data cleaning as effective in discovering Constant Conditional Functional Dependencies(CCFDs) from relational databases. These CCFDs are used as business rules for context dependent data validations. Conditional Functional Dependencies(CFDs) are an extension of Functional dependencies(FDs) which captures the consistency of data by supporting patterns of semantically related constants. Based on the hierarchy between FDs, CFDs and Association Rules:Union of Association Rules are CFDs, while union of CFDs are FDs. This paper proposes the algorithms used for Association Rule discovery to be reused for CCFD Mining i.e CFDs with constant patterns only. Three algorithms for CCFD mining namely CCFDFPGrowth, CCFDAprioriClose and CCFDZartMNR are provided in this paper. CCFDFPGrowth uses FPgrowth algorithm to find frequent itemsets and then generates rules as constant patterns from the set of frequent itemsets using modified Agrawal Association rule Generation algorithm. CCFDAprioriClose uses Apriori algorithm to find frequent closed itemsets and then generates rules as constant patterns from the set of frequent closed itemsets using modified Agrawal Association rule Generation algorithm. CCFDZartMNR uses Zart algorithm to find closed itemsets and minimal generators and then generates minimal nonredundant rules from the set of closed itemsets. Experimental results on two realworld data sets show that this approach performs well across several dimensions such as recall, runtime and scalability.
Incorporating Cardinality Constraints and Synonym Rules into Conditional Functional Dependencies
"... We propose an extension of conditional functional dependencies (CFDs), denoted by CFD c s, to express cardinality constraints, domainspecific conventions, and patterns of semantically related constants in a uniform constraint formalism. We show that despite the increased expressive power, the satis ..."
Abstract
 Add to MetaCart
(Show Context)
We propose an extension of conditional functional dependencies (CFDs), denoted by CFD c s, to express cardinality constraints, domainspecific conventions, and patterns of semantically related constants in a uniform constraint formalism. We show that despite the increased expressive power, the satisfiability and implication problems for CFD c s remain NPcomplete and coNPcomplete, respectively, the same as their counterparts for CFDs. We also identify tractable special cases. Key words: computational complexity, databases, specification languages 1.
Chinese Academy of Sciences
"... The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies (C ..."
Abstract
 Add to MetaCart
(Show Context)
The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies (CFDs). The need for this study is evident in data integration, exchange and cleaning since dependencies on data sources often only hold conditionally on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, CFDs as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). (a) We establish lower and upper bounds, all matching, ranging from ptime to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of finite domains. (b) We provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views; the algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. (c) We experimentally verify that the algorithm is efficient. 1.
3Bell Laboratories
"... Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D ′ that satisfies the constraints and “m ..."
Abstract
 Add to MetaCart
(Show Context)
Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D ′ that satisfies the constraints and “minimally ” differs from D. Equally important is to ensure that the automaticallygenerated repair D ′ is accurate, or makes sense, i.e., D ′ differs from the “correct ” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D ′ that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction. 1.