MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  . An Interactive Framework for Data Transformation and Cleaning

Download:
Download as a PDF
unknown authors
http://www.cs.berkeley.edu/~rshankar/pwheel/pwheel5.pdf
Add To MetaCart

Abstract:

An important step in data warehousing and Enterprise Data Integration is cleaning data of discrepancies in structure and content. Current commercial solutions for data cleaning involve many iterations of time-consuming “auditing” to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation programs. In this paper, we present an interactive data cleaning system that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in a intuitive, graphical manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. In the background, the system automatically infers the structure of the data in terms of user-defined domains and applies suitable algorithms to check the data for discrepancies, flagging them as they are found. This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. We choose and adapt a small set of transforms from the literature and describe methods for their graphical specification and interactive application. We combine the Minimum Description Length principle with the traditional database notion of user-defined types to automatically extract suitable structures for data values, in an extensible fashion. Such structure extraction is also applied in the graphical specification of transforms, to infer transforms from examples. We also describe methods for optimizing the final sequence of transforms to minimize memory allocations and copies. 1

Citations

4364 Elements of Information Theory – Cover, Thomas - 1991
699 Modeling by shortest data description – Rissanen - 1978
265 Inferring decision trees using the minimum description length principle – Quinlan, Rivest - 1989
201 An overview of data warehousing and olap technology – Chaudhuri, Dayal - 1997
185 The KDD Process for Extracting Useful Knowledge from Volumes of Data – Fayyad, Piatetsky-Shapiro, et al. - 1996
168 HILOG: A Foundation for Higher-Order Logic Programming – Chen, Kifer, et al. - 1993
148 F-Logic: A Higher-Order Language for Reasoning about Objects, Inheritance, and Scheme – Kifer, Lausen - 1989
123 Multiple-query optimization – Sellis - 1988
116 Efficient algorithms for mining outliers from large data sets – Ramaswamy, Rastogi, et al. - 2000
107 Real-world data is dirty: Data cleansing and the merge/purge problem – Hernández, Stolfo - 1998
100 NoDoSE: A tool for semiautomatically extracting structured and semistructured data from text documents – Adelberg - 1998
88 Towards heterogeneous multimedia information systems: The Garlic approach – Carey, Haas, et al. - 1995
74 XTRACT: A System for Extracting Document Type Descriptors from XML Documents – Garofalakis, Gionis, et al. - 2000
62 A linear method for deviation detection in large databases – Arning, Agrawal, et al. - 1996
58 Scaling access to heterogeneous data sources with disco – Tomasic, Raschid, et al. - 1998
55 The future of interactive systems and the emergence of direct manipulation – Shneiderman - 1982
51 Using schematically heterogeneous structures – Miller - 1998
49 Supporting FineGrained Data Lineage in a Database Visualization Environment – Woodruff, Stonebraker - 1997
46 Approximate inference of functional dependencies from relations – Kivinen, Mannila - 1995
38 Tables as a paradigm for querying and restructuring – Gyssens, Lakshmanan, et al. - 1996
33 Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE – Huhtala, Krkkinen, et al. - 1998
32 Online dynamic reordering for interactive data processing – Raman, Raman, et al. - 1999
30 Transforming heterogeneous data with database middleware: Beyond integration – Haas, Miller, et al. - 1999
30 Using common subexpressions to optimize multiple queries – Park, Segev - 1988
25 A uniform framework for integrating knowledge in heterogeneous knowledge systems – Adali, Emery - 1995
25 On efficiently implementing SchemaSQL on a SQL database system – Lakshmanan, Sadri - 1999
24 Approximate dependency inference from relations – Kivinen, Mannila - 1992
17 Tools for Data Translation and Integration – Abiteboul, Cluet, et al. - 1999
17 In-dependent, open enterprise data integration – Hellerstein, Stonebraker, et al.
13 Cleansing data for mining and warehousing – Lee, Ling, et al. - 1999
12 In search of the lost schema – Grumbach, Mecca - 1999
7 Scalable spreadsheets for interactive data analysis – Raman, Chou, et al. - 1999
4 SchemaSQL: A language for intereoperability in relational multi-database systems – Lakshmanan, Sadri, et al. - 1996
3 Frequently occurring first names and surnames from the 1990 census. http://www.census.gov/genealogy/www/freqnames.html – Bureau
2 Resolve complex B2B integration challenges once and for all. http://www.webmethods.com/content/1,1107,SolutionsIndex,FF.html – WebMethods
2 Crane’s Auto Import – White
1 Vignette eSeries Overview. http://www.vignette.com/Downloads/DS ESERIES OVERVIEW.pdf. formerly called OnDisplay – Corporation