(Enter summary)
Abstract: The problem of merging multiple databases of information about common entities is
frequently encountered in KDD and decision support applications in large commercial
and government organizations. The problem we study is often called the Merge/Purge
problem and is difficult to solve both in scale and accuracy. Large repositories of
data typically have numerous duplicate information entries about the same entities
that are difficult to cull together without an intelligent "equational theory" that ... (Update)
Cited by: More
Schema Matching using Duplicates - Alexander Bilke Technische
(Correct)
Semantic Overlay Clusters within Super-Peer Networks - Löser, Naumann, Siberski.. (2003)
(Correct)
Enhancing Data Analysis with Noise Removal - Hui Xiong Member
(Correct)
Similar documents (at the sentence level):
49.4%: A Generalization of Band Joins and The Merge/Purge Problem - Hernandez (1996)
(Correct)
21.6%: The Merge/Purge Problem for Large Databases - Hernandez, Stolfo (1995)
(Correct)
Active bibliography (related documents): More All
2.1: Real-world Data is Dirty: Data Cleansing and The Merge/Purge.. - Hernandez, Stolfo (1998)
(Correct)
0.2: Unknown - Information Systems Vol
(Correct)
0.2: Learning Object Identification Rules for Information.. - Tejada, Knoblock, Minton (2001)
(Correct)
Similar documents based on text: More All
0.4: Data Cleansing: Beyond Integrity Analysis - Maletic, Marcus (2000)
(Correct)
0.3: Automated Identification of Errors in Data Sets - Maletic, Marcus (2000)
(Correct)
0.3: Utilizing Association Rules for the Identification of Errors.. - Marcus, Maletic (2000)
(Correct)
Related documents from co-citation: More All
11: An extensible framework for data cleaning
- Galhardas, Florescu et al. - 2000
11: MergePurge problem large database
- Stolfo, Purge et al. - 1995
10: A theory for record linkage (context) - Felligi, Sunter - 1969
BibTeX entry: (Update)
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data Cleansing and the Merge/Purge problem. Journal of Data Mining and Knowledge Discovery, 2(1):9-37, 1998. http://citeseer.ist.psu.edu/hernandez98realworld.html More
@article{ hernandez98realworld,
author = "Mauricio A. Hernandez and Salvatore J. Stolfo",
title = "Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem",
journal = "Data Mining and Knowledge Discovery",
volume = "2",
number = "1",
pages = "9-37",
year = "1998",
url = "citeseer.ist.psu.edu/hernandez98realworld.html" }
Citations (may not include all citations):
317
A Comparative Analysis of Methodologies for Database Schema .. (context) - Batini, Lenzerini et al. - 1986
104
Techniques for Automatically Correcting Words in Text (context) - Kukich - 1992
66
A Theory for Record Linkage (context) - Fellegi, Sunter - 1969
61
AlphaSort: A RISC Machine Sort (context) - Nyberg, Barclay et al. - 1994
48
A Comparative Review of Selected Methods for Learning from E.. (context) - Dietterich, Michalski - 1983
46
From Data Mining to Knowledge Discovery in Databases
- Fayyad, Piatetsky-Shapiro et al. - 1996
43
Duplicate Record Elimination in Large Data Files (context) - Bitton, DeWitt - 1983
35
An Efficient Domain-independent Algorithm for Detecting Appr..
- Monge, Elkan - 1997
31
Technical Report CMU-CS (context) - Forgy, User's - 1981
23
Automatic spelling correction in scientific and scholarly te.. (context) - Pollock, Zamora - 1987
21
Multiprocessor Transitive Closure Algorithms (context) - Agrawal, Jagadish - 1988
19
A fuzzy representation of data for relational databases (context) - Buckles, Petry - 1982
15
Probability Scoring for Spelling Correction (context) - Church, Gale - 1991
8
The FinCEN Artificial Intelligence System: Identifying Poten.. (context) - Senator, Goldberg et al. - 1995
7
Purge Problem for Large Databases (context) - Hern'andez, Stolfo - 1995
7
Not the Path to Perdition: The Utility of Similarity-Based L.. (context) - Lebowitz - 1986
4
Physical Database Design in Multiprocessor Database Systems (context) - Ghandeharizadeh - 1990
3
Clustering Techniques: The User's Dilema (context) - Dubes, Jain - 1976
3
Analyzing Foster Childrens' Foster Home Payments Database (context) - Clark - 1995
2
Fuzzy Database Systems -- Challenges and Opportunities of a .. (context) - George, Petry et al. - 1996
1
A Hierarchical Clustering Strategy for Very Large Fuzzy Data.. (context) - Buckley - 1995
The graph only includes citing articles where the year of publication is known.
Documents on the same site (http://www.cs.columbia.edu/~sal/recent-papers.html):
Algorithms For Mining System Audit Data - Lee, Stolfo, Mok (1999)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC