See this document in CiteSeerX!

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem (1998)  (Make Corrections)  (43 citations)
Mauricio Hernandez, Salvatore Stolfo
Data Mining and Knowledge Discovery



  Home/Search   Context   Related

 
View or download:
columbia.edu/~sal/hpapers/mp.ps
utexas.edu/course/ee380l/199...mp.ps.gz
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  columbia.edu/~sal...recentpapers (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that ... (Update)

Cited by:   More
Schema Matching using Duplicates - Alexander Bilke Technische   (Correct)
Semantic Overlay Clusters within Super-Peer Networks - Löser, Naumann, Siberski.. (2003)   (Correct)
Enhancing Data Analysis with Noise Removal - Hui Xiong Member   (Correct)

Similar documents (at the sentence level):
49.4%:   A Generalization of Band Joins and The Merge/Purge Problem - Hernandez (1996)   (Correct)
21.6%:   The Merge/Purge Problem for Large Databases - Hernandez, Stolfo (1995)   (Correct)

Active bibliography (related documents):   More   All
2.1:   Real-world Data is Dirty: Data Cleansing and The Merge/Purge.. - Hernandez, Stolfo (1998)   (Correct)
0.2:   Unknown - Information Systems Vol   (Correct)
0.2:   Learning Object Identification Rules for Information.. - Tejada, Knoblock, Minton (2001)   (Correct)

Similar documents based on text:   More   All
0.4:   Data Cleansing: Beyond Integrity Analysis - Maletic, Marcus (2000)   (Correct)
0.3:   Automated Identification of Errors in Data Sets - Maletic, Marcus (2000)   (Correct)
0.3:   Utilizing Association Rules for the Identification of Errors.. - Marcus, Maletic (2000)   (Correct)

Related documents from co-citation:   More   All
11:   An extensible framework for data cleaning - Galhardas, Florescu et al. - 2000
11:   MergePurge problem large database - Stolfo, Purge et al. - 1995
10:   A theory for record linkage (context) - Felligi, Sunter - 1969

BibTeX entry:   (Update)

M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data Cleansing and the Merge/Purge problem. Journal of Data Mining and Knowledge Discovery, 2(1):9-37, 1998. http://citeseer.ist.psu.edu/hernandez98realworld.html   More

@article{ hernandez98realworld,
    author = "Mauricio A. Hernandez and Salvatore J. Stolfo",
    title = "Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem",
    journal = "Data Mining and Knowledge Discovery",
    volume = "2",
    number = "1",
    pages = "9-37",
    year = "1998",
    url = "citeseer.ist.psu.edu/hernandez98realworld.html" }
Citations (may not include all citations):
317   A Comparative Analysis of Methodologies for Database Schema .. (context) - Batini, Lenzerini et al. - 1986
104   Techniques for Automatically Correcting Words in Text (context) - Kukich - 1992
66   A Theory for Record Linkage (context) - Fellegi, Sunter - 1969
61   AlphaSort: A RISC Machine Sort (context) - Nyberg, Barclay et al. - 1994
48   A Comparative Review of Selected Methods for Learning from E.. (context) - Dietterich, Michalski - 1983
46   From Data Mining to Knowledge Discovery in Databases - Fayyad, Piatetsky-Shapiro et al. - 1996
43   Duplicate Record Elimination in Large Data Files (context) - Bitton, DeWitt - 1983
35   An Efficient Domain-independent Algorithm for Detecting Appr.. - Monge, Elkan - 1997
31   Technical Report CMU-CS (context) - Forgy, User's - 1981
23   Automatic spelling correction in scientific and scholarly te.. (context) - Pollock, Zamora - 1987
21   Multiprocessor Transitive Closure Algorithms (context) - Agrawal, Jagadish - 1988
19   A fuzzy representation of data for relational databases (context) - Buckles, Petry - 1982
15   Probability Scoring for Spelling Correction (context) - Church, Gale - 1991
8   The FinCEN Artificial Intelligence System: Identifying Poten.. (context) - Senator, Goldberg et al. - 1995
7   Purge Problem for Large Databases (context) - Hern'andez, Stolfo - 1995
7   Not the Path to Perdition: The Utility of Similarity-Based L.. (context) - Lebowitz - 1986
4   Physical Database Design in Multiprocessor Database Systems (context) - Ghandeharizadeh - 1990
3   Clustering Techniques: The User's Dilema (context) - Dubes, Jain - 1976
3   Analyzing Foster Childrens' Foster Home Payments Database (context) - Clark - 1995
2   Fuzzy Database Systems -- Challenges and Opportunities of a .. (context) - George, Petry et al. - 1996
1   A Hierarchical Clustering Strategy for Very Large Fuzzy Data.. (context) - Buckley - 1995



The graph only includes citing articles where the year of publication is known.


Documents on the same site (http://www.cs.columbia.edu/~sal/recent-papers.html):
Algorithms For Mining System Audit Data - Lee, Stolfo, Mok (1999)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC