An important step in data warehousing and Enterprise Data Integration is cleaning data of discrepancies in structure and content. Current commercial solutions for data cleaning involve many iterations of time-consuming “auditing” to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation programs. In this paper, we present an interactive data cleaning system that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in a intuitive, graphical manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. In the background, the system automatically infers the structure of the data in terms of user-defined domains and applies suitable algorithms to check the data for discrepancies, flagging them as they are found. This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. We choose and adapt a small set of transforms from the literature and describe methods for their graphical specification and interactive application. We combine the Minimum Description Length principle with the traditional database notion of user-defined types to automatically extract suitable structures for data values, in an extensible fashion. Such structure extraction is also applied in the graphical specification of transforms, to infer transforms from examples. We also describe methods for optimizing the final sequence of transforms to minimize memory allocations and copies. 1
|
4364
|
Elements of Information Theory
– Cover, Thomas
- 1991
|
|
699
|
Modeling by shortest data description
– Rissanen
- 1978
|
|
265
|
Inferring decision trees using the minimum description length principle
– Quinlan, Rivest
- 1989
|
|
201
|
An overview of data warehousing and olap technology
– Chaudhuri, Dayal
- 1997
|
|
185
|
The KDD Process for Extracting Useful Knowledge from Volumes of Data
– Fayyad, Piatetsky-Shapiro, et al.
- 1996
|
|
168
|
HILOG: A Foundation for Higher-Order Logic Programming
– Chen, Kifer, et al.
- 1993
|
|
148
|
F-Logic: A Higher-Order Language for Reasoning about Objects, Inheritance, and Scheme
– Kifer, Lausen
- 1989
|
|
123
|
Multiple-query optimization
– Sellis
- 1988
|
|
116
|
Efficient algorithms for mining outliers from large data sets
– Ramaswamy, Rastogi, et al.
- 2000
|
|
107
|
Real-world data is dirty: Data cleansing and the merge/purge problem
– Hernández, Stolfo
- 1998
|
|
100
|
NoDoSE: A tool for semiautomatically extracting structured and semistructured data from text documents
– Adelberg
- 1998
|
|
88
|
Towards heterogeneous multimedia information systems: The Garlic approach
– Carey, Haas, et al.
- 1995
|
|
74
|
XTRACT: A System for Extracting Document Type Descriptors from XML Documents
– Garofalakis, Gionis, et al.
- 2000
|
|
62
|
A linear method for deviation detection in large databases
– Arning, Agrawal, et al.
- 1996
|
|
58
|
Scaling access to heterogeneous data sources with disco
– Tomasic, Raschid, et al.
- 1998
|
|
55
|
The future of interactive systems and the emergence of direct manipulation
– Shneiderman
- 1982
|
|
51
|
Using schematically heterogeneous structures
– Miller
- 1998
|
|
49
|
Supporting FineGrained Data Lineage in a Database Visualization Environment
– Woodruff, Stonebraker
- 1997
|
|
46
|
Approximate inference of functional dependencies from relations
– Kivinen, Mannila
- 1995
|
|
38
|
Tables as a paradigm for querying and restructuring
– Gyssens, Lakshmanan, et al.
- 1996
|
|
33
|
Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE
– Huhtala, Krkkinen, et al.
- 1998
|
|
32
|
Online dynamic reordering for interactive data processing
– Raman, Raman, et al.
- 1999
|
|
30
|
Transforming heterogeneous data with database middleware: Beyond integration
– Haas, Miller, et al.
- 1999
|
|
30
|
Using common subexpressions to optimize multiple queries
– Park, Segev
- 1988
|
|
25
|
A uniform framework for integrating knowledge in heterogeneous knowledge systems
– Adali, Emery
- 1995
|
|
25
|
On efficiently implementing SchemaSQL on a SQL database system
– Lakshmanan, Sadri
- 1999
|
|
24
|
Approximate dependency inference from relations
– Kivinen, Mannila
- 1992
|
|
17
|
Tools for Data Translation and Integration
– Abiteboul, Cluet, et al.
- 1999
|
|
17
|
In-dependent, open enterprise data integration
– Hellerstein, Stonebraker, et al.
|
|
13
|
Cleansing data for mining and warehousing
– Lee, Ling, et al.
- 1999
|
|
12
|
In search of the lost schema
– Grumbach, Mecca
- 1999
|
|
7
|
Scalable spreadsheets for interactive data analysis
– Raman, Chou, et al.
- 1999
|
|
4
|
SchemaSQL: A language for intereoperability in relational multi-database systems
– Lakshmanan, Sadri, et al.
- 1996
|
|
3
|
Frequently occurring first names and surnames from the 1990 census. http://www.census.gov/genealogy/www/freqnames.html
– Bureau
|
|
2
|
Resolve complex B2B integration challenges once and for all. http://www.webmethods.com/content/1,1107,SolutionsIndex,FF.html
– WebMethods
|
|
2
|
Crane’s Auto Import
– White
|
|
1
|
Vignette eSeries Overview. http://www.vignette.com/Downloads/DS ESERIES OVERVIEW.pdf. formerly called OnDisplay
– Corporation
|