| Winkler, W. E. 1999. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html. |
....the initial matching model. Unless such data is readily available from previous studies, human effort is required to validate the training data. To the best of our knowledge, there are no any techniques proposed for this reason. A more extensive discussion upon future directions can be found in [44]. 31 ....
William E. Winkler, The State of Record Linkage and Current Research Problems,Pro- ceedings of the Survey Methods Section (1999), 73--79. 34
....the optimal rule for minimising the number of possible matches given desired Type I and Type II errors, assuming conditional independence. We also described the EM method with the conditional independence assumption. The EM method has been derived without assuming conditional independence [109, 74]. We now consider several alternative decision models for deciding whether a record pair should be a match, non match or possible match. Statistical Models Copas and Hilton [28] propose a matching algorithm that depends on the statistical characteristics of the errors which are likely to arise. ....
....of this approach is that the model fit can be used to consider the validity of the modelling approach. Predictive Models Predictive models for learning the parameters (threshold values and attribute weights) have recently been proposed. Adequate training data is needed to train these models [109, 64]. Proposed models have included: Logistic regression [87] although this was found to not work for census data in [64] Support vector machines [10] Decision trees [103] Active learning techniques have also been proposed to optimise e#ciency in selection of training records [103] ....
[Article contains additional citation context not shown here]
W.E. Winkler. The State of Record Linkage and Current Research Problems. Technical Report RR/1999.
....datamining algorithms from discovering important regularities. This problem is typically handled during a tedious manual de duplication process. Some previous work has addressed the problem of identifying duplicate records, where it was referred to as record linkage [Fellegi and Sunter, 1969; Winkler, 1999] the merge purge problem [Hern andez and Stolfo, 1995] duplicate detection [Monge and Elkan, 1997; Sarawagi and Bhamidipaty, 2002] hardening soft databases [Cohen et al. 2000] reference matching [McCallum et al. 2000] object identification [Te j ada et al. 2002] and entity name ....
....Work Fellegi and Sunter [Fellegi and Sunter, 1969] developed a formal theory for record linkage and offered statistical methods for estimating matching parameters and error rates. In more recent work in statistics, Winkler proposed using EMbased methods for obtaining optimal matching rules [Winkler, 1999] . That work was highly specialized for the domain of census records and used hand tuned similarity measures. McCallum et al. introduced the efficient canopies clustering algorithm [McCallum et al. 2000] for the task of matching scientific citations. Monge and Elkan developed the iterative ....
William E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.
....vary greatly depending on how informative the fields are, it is necessary to weight fields according to their contribution to the true distance between records. While statistical aspects of combining similarity scores for individual fields have been addressed in previous work on record linkage [25], availability of labeled duplicates allows a more direct approach that uses a binary classifier that computes a pairing function [4] Given a database that contains records composed of k different fields and a set D = d1 ( dm ( of distance metrics, we can represent any ....
....The problem of finding a similarity threshold for separating duplicates from non duplicates arises at this point. A trivial solution would be to use the binary classification results to label some records as duplicates, and others as non duplicates. A traditional approach to this problem [25], however, requires assigning two thresholds: one that separates pairs of records that are high confidence duplicates, and another for possible duplicates that should be reviewed by a human expert. Since relative costs of labeling a nonduplicate as a duplicate (false positives) and overlooking ....
[Article contains additional citation context not shown here]
W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.
....rules that compute the extent of match between individual attributes and combine the match scores by using thresholds and several if then else conditions. One way of reducing the tedium of hand coding is to relegate the task of finding the deduplication function to a machine learning algorithm [31, 32, 14]. The algorithm takes as input training examples consisting of pairs of duplicates and non duplicates. A second input is a collection of various kinds of simple, domain specific matching functions on various attributes, provided by a domain expert. The learning algorithm can then use the examples ....
....105 were duplicates a skewness of 0.25 . Similarity functions We designed 20 similarity functions on each of the two datasets. For each text attribute, we had three functions: NGrams match (with ngrams of size 3) fraction of overlapping words, and, approximate edit distance as described in [31, 32]. For integer fields like year 50 100 150 200 250 300 Address Dataset Bibliography Dataset Figure 7: Running time. and page, we had a special number match that tolerated shift by 1. For attributes that are likely to get wrongly segmented as a neighboring field, we created new text fields ....
[Article contains additional citation context not shown here]
W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.
....is important in removing duplicates from a relation that has been drawn from the union of many di#erent information sources. Previous work in this area includes work in distance functions for matching [14, 3, 9, 8] and scalable matching [2] and clustering [13] algorithms. Work in record linkage [15, 10, 21, 20, 7] is similar but does not rely as heavily on textual similarities. In this paper we synthesize many of these ideas. We present techniques for entity name matching and clustering that are scalable and adaptive, in the sense that accuracy can be improved by training. 2. LEARNING TOMATCHAND CLUSTER ....
.... similarity metrics for entity names (e.g. 9, 14] or general frameworks for manually implementing similarity metrics (e.g. 8] The core idea of learning distance functions for entity pairs is not new there is a substantial literature on the record linkage problem in statistics (e.g. [10, 20] much of which based on a record linkage theory proposed by Felligi and Sunter [7] The maximum entropy learning approach we use has an advantage over Felligi Sunter in that it does not require features to be independent, allowing a broader range of potential similarity features to be used; at the ....
[Article contains additional citation context not shown here]
W. E. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html, 1999.
....preventing data mining algorithms from discovering important regularities. Such problems are typically handled during a tedious manual data cleaning or de duping process. Some previous work has addressed the problem of identifying duplicate records, where it is referred to as record linkage [17, 25], the merge purge problem [10] duplicate detection [14] hardening soft databases [3] and reference matching [12] Typically, a fixed textual similarity metric such as edit distance [22] or vector space cosine similarity [21] is used to determine whether two values or records are alike enough to ....
....and similarity across individual fields can vary greatly, it is necessary to weight fields according to their contribution to the true similarity between records. While statistical aspects of combining similarity scores for individual fields have been addressed in previous work on record linkage [25], availability of labeled duplicates allows a more direct approach that uses a binary classifier [2] Given a database that contains records composed of k different fields and a set D = fd 1 ; dm g of distance metrics, we can represent any pair of records by an mk dimensional vector of ....
[Article contains additional citation context not shown here]
W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.
No context found.
Winkler, W. E. The State of Record Linkage and Current Research Problems, Statistical Society of Canada, Proceedings of the Section on Survey Methods, (1999), 7379 (longer version report rr99/04 available at
No context found.
Winkler, W. E. 1999. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04. Available from http://www.census.gov/srd/www/byname.html.
No context found.
W. E. Winkler. The state of record linkage and current research problems. Technical report, U. S. Bureau of the Census, 1999.
No context found.
E.Winkler, W.: The state of record linkage and current research problems. In: Proceedings of of the Survey Methods Section. (1999) 73--79
No context found.
Winkler, W.E.: The State of Record Linkage and Current Research Problems. RR99/03, US Bureau of the Census, 1999.
No context found.
W. E. Winkler. The state of record linkage and current research problems. Technical Report, Statistical Research Division, U.S. Bureau of the Cenus, 1999.
No context found.
Winkler, W.E.: The State of Record Linkage and Current Research Problems. Research Report RR 1999-04, US Bureau of the Census, 1999.
No context found.
W.E. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999. http://www.census.gov/srd/papers/pdf/rr9904. pdf.
No context found.
W. Winkler. The state of record linkage and current research problems. Technical Report RR/1999.
No context found.
W.E. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999. http://www.census.gov/srd/papers/pdf/rr9904. pdf.
No context found.
Winkler, W.E.: The State of Record Linkage and Current Research Problems. Research Report RR99/03, US Bureau of the Census, 1999.
No context found.
W. E. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04., 1999.
No context found.
W. E. Winkler. The state of record linkage and current research problems. Technical Report, Statistical Research Division, U.S. Bureau of the Cenus, 1999.
No context found.
W. Winkler. The State of Record Linkage and Current Research Problems. U.S. Bureau of the Census, Research Report, 1999.
No context found.
W.E. Winkler, The State of Record Linkage and Current Research Problems, U.S. Census Bureach Research Report RR99/04, 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC