43 citations found. Retrieving documents...
M. A. Hern andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

A Comparison of Fast Blocking Methods for Record Linkage - Baxter, Christen, Churches (2003)   (Correct)

....Cohen and Richman[3] and others [9] have proposed the use of highdimensional similarity indexing to improve the eciency of blocking methods. The similarity of blocking to clustering has previously been observed [3, 11] We compare Standard Blocking [8] the Sorted Neighbourhood method [6], Bigram Indexing [1] and Canopy Clustering with TFIDF [11] This paper s contribution is to empirically compare the speed up and accuracy (sensitivity and speci city) performance of these blocking methods. Blocking methods directly a ect sensitivity (if record pairs of true matches are not in ....

....the resulting number of record pair comparisons is O( b ) 4] This is of course the ideal case, hardly ever achievable with real data. Thus, the number of record pair comparisons can be dominated by the the largest block. 2. 2 Sorted Neighbourhood The Sorted Neighbourhood (SN) method [6] sorts the records based on a sorting key and then moves a window of xed size w sequentially over the sorted records. Records within the window are then paired with each other and included in the candidate record pair list. The use of the window limits the number of possible record pair ....

[Article contains additional citation context not shown here]

M. Hernandez and S. Stolfo. Real-world data is dirty: data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2), 1998.


Record Linkage: Current Practice and Future Directions - Gu, Baxter, Vickers..   (Correct)

....system model called the Induction Record Linkage Model. This model shows how training data can be incorporated in the record linkage system, where it is available. Other alternative record linkage systems that have been proposed include AJAX [40] WHIRL [24] Intelliclean [66] Merge Purge [48], and SchemaSQL [63] Some of the designs and architectures for the many available academic, government and commercial systems are found in Appendix A. 3.2 Standardisation Methods Standardisation is also called data cleaning or attribute level reconciliation. Without standardisation, many true ....

....Some commercial systems provide tools to derive phonetic codes for specific populations worldwide [97] Kelley [57] developed an algorithm of choosing the best blocking scheme in light of the trade o# between computation cost and false nonmatch (negative) rates. Sorted Neighbourhood Method (SNM) [48] SNM involves scanning the N sorted records from sources A and B using a fixed size of window, w. Every pair of records falling within the window are compared. SNM requires w N record comparisons. Note that the error rate induced by SNM is critically dependent on the choice of sorting keys. ....

[Article contains additional citation context not shown here]

M.A. Hernandez and S.J. Stolfo. Real-world data is dirty: data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2), 1998.


A Comparison of Fast Blocking Methods for Record Linkage - Baxter, Christen (2003)   (Correct)

....Cohen and Richman[3] and others [9] have proposed the use of highdimensional similarity indexing to improve the e#ciency of blocking methods. The similarity of blocking to clustering has previously been observed [3, 11] We compare Standard Blocking [8] the Sorted Neighbourhood method [6], Bigram Indexing [1] and Canopy Clustering with TFIDF [11] This paper s contribution is to empirically compare the speed up and accuracy (sensitivity and specificity) performance of these blocking methods. Blocking methods directly a#ect sensitivity (if record pairs of true matches are not in ....

....the resulting number of record pair comparisons is O( n 2 b ) 4] This is of course the ideal case, hardly ever achievable with real data. Thus, the number of record pair comparisons can be dominated by the the largest block. 2. 2 Sorted Neighbourhood The Sorted Neighbourhood (SN) method [6] sorts the records based on a sorting key and then moves a window of fixed size w sequentially over the sorted records. Records within the window are then paired with each other and included in the candidate record pair list. The use of the window limits the number of possible record pair ....

[Article contains additional citation context not shown here]

M. Hernandez and S. Stolfo. Real-world data is dirty: data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2), 1998.


Learning Domain-Independent String Transformation Weights.. - Tejada, Knoblock (2002)   (7 citations)  (Correct)

....Passive Atlas. The Passive Atlas system includes the candidate generator for proposing candidate mappings and a single C4.5 decision tree learner for learning the mapping rules, which is similar to previous methods for addressing the merge purge problem of removing duplicate records in a database [15, 9]. The second system is a baseline experiment that runs the candidate generator only and requires the user to review the ranked list of candidate mappings to choose an optimal mapping threshold. In this experiment only the stemming transformation is used, which is similar to an information ....

....In the database community the problem of object identification is also known as the merge purge problem, a form of data cleaning. Domain specific techniques for correcting format inconsistencies [4, 3, 12] have been applied by many object identification systems to measure text similarity [8, 9, 10, 21]. The main concern with domain specific transformations is that it is very expensive, in terms of user involvement, to generate a comprehensive set of transformations that are specific not only to the domain, but also to the data that is being integrated. There are also approaches [15, 17] that ....

M. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery, pages 1--31, New York, NY, 1998.


Interactive Deduplication using Active Learning - Sarawagi, Bhamidipaty (2002)   (16 citations)  (Correct)

....function that is guaranteed to bring together all duplicates. Examples of such grouping functions are, year of publication for citation entries and the first letter of the last name for address lists. Pairs are formed only within records of a group. Similar windowing ideas have been exploited in [11]. Sampling In the active learning phase, when we are learning a deduplication function, we may not need to work on the entire set of records D. Sampling appears like a natural recourse in such cases. However, simple random sampling will not work here because in most cases the number of duplicates ....

....Recently, there has been renewed interest in the database community on the data cleaning problem [26, 9, 25] comprising several aspects, including, data segmentation, deduplication, outlier detection, standardization and schema mapping. For the specific problem of deduplication, most recent work [11, 22] has concentrated on the performance aspects assuming that the deduplication function is input by the user. The problem of deduplication has long been relevant for library cataloging see [29] for a survey. 16, 12] concentrate on hand coding deduplication functions for the bibliography ....

M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


Managing Data Quality in Cooperative Information Systems - Mecella, Scannapieco.. (2002)   (1 citation)  (Correct)

....organizations, in general with different data semantics. For the assessment (i) and the heterogeneity (iv) issues, some of the results already achieved for traditional systems can be borrowed, specifically: the assessment phase can be based on the results achieved in the data cleaning area [16, 20, 22, 12], as well as on the results in the data warehouse area [24, 43] heterogeneity has been widely addressed in the literature, focusing on schema and data inte gration issues [1, 7, 27, 42, 11, 25] Methods and techniques for exchanging quality information (ii) have been only partially ad ....

....the terms of comparison have to be derived from the real world, thus verification of semantic accuracy may be expensive. A systematic way to check semantic accuracy when several data sources are available is to compare the information related to the same instance stored in different databases [22, 31]. Completeness Completeness is the degree to which values of a schema element are present in the schema element instance (i.e. it is the number of schema elements having a corresponding value in the schema instance) In evaluating completeness, it is important to consider the meaning of null ....

M.A. Hernandez and S.J. Stolo, Real-world Data is Dirty: Data Cleansing and The MergeJPurge Problem, Journal of Data Mining and Knowledge Discovery I (1998), no. 2.


Managing Data Quality in Cooperative Information Systems - Mecella, Scannapieco.. (2002)   (1 citation)  (Correct)

....organizations, in general with different data semantics. For the assessment (i) and the heterogeneity (iv) issues, some of the results already achieved for traditional systems can be borrowed, specifically: the assessment phase can be based on the results achieved in the data cleaning area [25, 26, 27, 28], as well as on the results in the data warehouse area [29, 30] heterogeneity has been widely addressed in the literature, focusing on schema and data integration issues [31, 32, 33, 34, 35, 36] Methods and techniques for exchanging quality information (ii) have been only partially addressed ....

....the terms of comparison have to be derived from the real world, thus verification of semantic accuracy may be expensive. A systematic way to check semantic accuracy when several data sources are available is to compare the information related to the same instance stored in different databases [27, 43]. i.e. it is the number of schema elements having a corresponding value in the schema instance) In evaluating completeness, it is important to consider the meaning of null values of an attribute, depending on the attribute being mandatory, optional, or inapplicable: a null value for a ....

M.A. Hernandez and S.J. Stolfo, "Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem," Journal of Data Mining and Knowledge Discovery, vol. 1, no. 2, 1998.


Enhancing Data Analysis with Noise Removal - Hui Xiong Member   Self-citation (Data)   (Correct)

No context found.

M.A. Hernandez and S.J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowldge Discovery, 2:9--37, 1998.


Enhancing Data Analysis with Noise Removal - Hui Xiong Student   Self-citation (Data)   (Correct)

No context found.

M.A. Hernandez and S.J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowldge Discovery, 2:9--37, 1998.


An Interactive Framework for Data Cleaning - Vijayshankar Raman Joseph (2000)   (1 citation)  Self-citation (Data)   (Correct)

No context found.

M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1), 1997.


A Parallel Open Source Data Linkage System - Christen, Churches, Hegland (2004)   Self-citation (Data)   (Correct)

No context found.

Hernandez, M.A. and Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery 2, Kluwer Academic Publishers, 1998.


Alias Detection in Link Data Sets - Hsiung (2004)   Self-citation (Data)   (Correct)

No context found.

M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. In Journal of Data Mining and Knowledge Discovery, 1997.


Schema Matching using Duplicates - Alexander Bilke Technische   (1 citation)  (Correct)

No context found.

M. A. Hern andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


Semantic Overlay Clusters within Super-Peer Networks - Löser, Naumann, Siberski.. (2003)   (Correct)

No context found.

M. A. Hern andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


Physical-Digital Divide - Shawn Jeffery Gustavo   (Correct)

No context found.

M. A. Hernandez et al.. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


A Multidimensional Model for Information Quality in.. - Missier, Batini   (Correct)

No context found.

M.A. Hernandez and S.J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 1(2), 1998.


The Object Identification Framework - Neiling, Jurk (2003)   (1 citation)  (Correct)

No context found.

M. Hernandez and S. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


A Comparison of Fast Blocking Methods for Record - Linkage Rohan Baxter (2003)   (Correct)

No context found.

M. Hernandez and S. Stolfo. Real-world data is dirty: data cleansing and the merge/purge problem. J. DMKD, 1(2), 1998.


Semantic Overlay Clusters within Super-Peer Networks - Löser, Naumann, Siberski.. (2003)   (Correct)

No context found.

M. A. Hern andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


Mining Reference Tables for Automatic Text Segmentation - Eugene Agichtein Columbia (2004)   (1 citation)  (Correct)

No context found.

M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


TAILOR: A Record Linkage Toolbox - Elfeky, Verykios, Elmagarmid (2002)   (11 citations)  (Correct)

No context found.

M.A. Hernandez and S.J. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9-37, 1998.


Adaptive Filtering for Ecient Record Linkage - Lifang Gu Csiro   (Correct)

No context found.

M. Hernandez and S. Stolfo. Real-world data is dirty: data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 1(2), 1998.


TAILOR: A Record Linkage Toolbox - Elfeky, Verykios, Elmagarmid (2002)   (11 citations)  (Correct)

No context found.

M.A. Hernandez and S.J. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9-37, 1998.


Semantic Overlay Clusters within Super-Peer Networks - Löser, Naumann, Siberski.. (2003)   (Correct)

No context found.

M. A. Hern andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.


Record Linkage: A Machine Learning Approach, A.. - Elfeky, Verykios, .. (2003)   (Correct)

No context found.

M. Hernandez and S. Stolfo. Real World Data is Dirty: Data Cleansing and the Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 2(1), pages 9-37, 1998.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC