Results 1  10
of
66
Discovering Conditional Functional Dependencies
 IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING
, 2009
"... This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding CFDs is an expensive p ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from sample relations. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed itemsets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. The other two algorithms are developed for discovering general CFDs. The first algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a wellknown algorithm for mining FDs. The other, referred to as FastCFD, is based on the depthfirst approach used in FastFD, a method for discovering FDs. It leverages closeditemset mining to reduce search space. Our experimental results demonstrate the following. (a) CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. (b) CTANE works well when a given sample relation is large, but it does not scale well with the arity of the relation. (c) FastCFD is far more efficient than CTANE when the arity of the relation is large.
Interaction between record matching and data repairing
 In SIGMOD
, 2011
"... Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same realworld object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current da ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
(Show Context)
Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same realworld object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on heuristic solutions. This paper studies a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP or coNPcomplete, to PSPACEcomplete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using reallife data.
Holistic data cleaning: Put violations into context
 In ICDE
, 2013
"... Abstract—Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
(Show Context)
Abstract—Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with adhoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as “greater than ” and “less than”. More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of repair context which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair. I.
NADEEF: a commodity data cleaning system
 In SIGMOD
, 2013
"... Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single endtoend offtheshelf solution to (semi)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and adhoc quali ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
(Show Context)
Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single endtoend offtheshelf solution to (semi)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and adhoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve applicationspecific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easytodeploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as blackboxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e., detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other userprovided algorithms. Using reallife data, we experimentally verify the generality, extensibility, and effectiveness of our system.
A Revival of Integrity Constraints for Data Cleaning
 Proc. VLDB Endow
, 2008
"... ..."
(Show Context)
Y.: Propagating functional dependencies with conditions
 In: VLDB (2008
"... The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies ( ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies (CFDs). The need for this study is evident in data integration, exchange and cleaning since dependencies on data sources often only hold conditionally on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, CFDs as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). (a) We establish lower and upper bounds, all matching, ranging from ptime to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of finite domains. (b) We provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views; the algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. (c) We experimentally verify that the algorithm is efficient. 1.
On Codd Families of Keys over Incomplete Relations
, 2010
"... Keys allow a database management system to uniquely identify tuples in a database. Consequently, the class of keys is of great significance for almost all data processing tasks. In the relational model of data, keys have received considerable interest and are well understood. However, for efficient ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
Keys allow a database management system to uniquely identify tuples in a database. Consequently, the class of keys is of great significance for almost all data processing tasks. In the relational model of data, keys have received considerable interest and are well understood. However, for efficient means of data processing most commercial relational database systems deviate from the relational model. For example, tuples may contain only partial information in the sense that they contain socalled null values to represent incomplete information. Codd’s principle of entity integrity says that every tuple of every relation must not contain a null value on any attribute of the primary key. Therefore, a key over partial relations enforces both uniqueness and totality of tuples on the attributes of the key. On the basis of these two requirements, we study the resulting class of keys over relations that permit occurrences of Zaniolo’s null value ‘noinformation’. We show that the interaction of this class of keys is different from the interaction of the class of keys over total relations. We establish a finite ground axiomatization, and an algorithm for deciding the associated implication problem in linear time. Further, we characterize Armstrong relations for an arbitrarily given sets of keys; that is, we give a sufficient and necessary condition for a partial relation to satisfy a key precisely when it is implied by a given set of keys. We also establish an algorithm that computes an Armstrong relation
Estimating the Confidence of Conditional Functional Dependencies
"... Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and th ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Conditional functional dependencies (CFDs) have recently been proposed as extensions of classical functional dependencies that apply to a certain subset of the relation, as specified by a pattern tableau. Calculating the support and confidence of a CFD (i.e., the size of the applicable subset and the extent to which it satisfies the CFD) gives valuable information about data semantics and data quality. While computing the support is easier, computing the confidence exactly is expensive if the relation is large, and estimating it from a random sample of the relation is unreliable unless the sample is large. We study how to efficiently estimate the confidence of a CFD with a small number of passes (one or two) over the input using small space. Our solutions are based on a variety of sampling and sketching techniques, and apply when the pattern tableau is known in advance, and also the harder case when this is given after the data have been seen. We analyze our algorithms, and show that they can guarantee a small additive error; we also show that relative errors guarantees are not possible. We demonstrate the power of these methods empirically, with a detailed study using both real and synthetic data. These experiments show that it is possible to estimate the CFD confidence very accurately with summaries which are much smaller than the size of the data they represent.
The LLUNATIC DataCleaning Framework
"... Datacleaning (or datarepairing) is considered a crucial problem in many databaserelated tasks. It consists in making a database consistent with respect to a set of given constraints. In recent years, repairing methods have been proposed for several classes of constraints. However, these methods r ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
Datacleaning (or datarepairing) is considered a crucial problem in many databaserelated tasks. It consists in making a database consistent with respect to a set of given constraints. In recent years, repairing methods have been proposed for several classes of constraints. However, these methods rely on ad hoc decisions and tend to hardcode the strategy to repair conflicting values. As a consequence, there is currently no general algorithm to solve database repairing problems that involve different kinds of constraints and different strategies to select preferred values. In this paper we develop a uniform framework to solve this problem. We propose a new semantics for repairs, and a chasebased algorithm to compute minimal solutions. We implemented the framework in a DBMSbased prototype, and we report experimental results that confirm its good scalability and superior quality in computing repairs. 1.
Differential Dependencies: Reasoning and Discovery
, 2011
"... The importance of difference semantics (e.g., “similar” or “dissimilar”) has been recently recognized for declaring dependencies among various types of data, such as numerical values or text values. We propose a novel form of Differential Dependencies (DDs), which specifies constraints on difference ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
The importance of difference semantics (e.g., “similar” or “dissimilar”) has been recently recognized for declaring dependencies among various types of data, such as numerical values or text values. We propose a novel form of Differential Dependencies (DDs), which specifies constraints on difference, called differential functions, instead of identification functions in traditional dependency notations like functional dependencies. Informally, a differential dependency states that if two tuples have distances on attributes X agreeing with a certain differential function, then their distances on attributes Y should also agree with the corresponding differential function on Y. For example, [date( ≤ 7)] → [price(< 100)] states that the price difference of any two days within a week length should be no greater than 100 dollars. Such differential dependencies are useful in various applications, for example, violation detection, data partition, query optimization, record linkage, etc. In this article, we first address several theoretical issues of differential dependencies, including formal definitions of DDs and differential keys, subsumption order relation of differential functions, implication of DDs, closure of a differential function, a sound and complete inference system, and minimal cover for DDs. Then, we investigate a practical problem, that is, how to discover DDs and differential keys from a given dataset. Due to the intrinsic hardness, we develop several pruning methods to improve the discovery efficiency in practice. Finally, through an extensive experimental evaluation on real datasets, we demonstrate the discovery performance and the effectiveness of DDs in several real applications.