Results 1  10
of
332
Duplicate Record Detection: A Survey
, 2007
"... Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standa ..."
Abstract

Cited by 448 (11 self)
 Add to MetaCart
(Show Context)
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.
A Probabilistic Framework for SemiSupervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract

Cited by 275 (14 self)
 Add to MetaCart
(Show Context)
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototypebased clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and Idivergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semisupervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Integrating Constraints and Metric Learning in SemiSupervised Clustering
 In ICML
, 2004
"... Semisupervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraintbased methods that guide the clustering algorithm towards a better grouping of the data, and 2) distanc ..."
Abstract

Cited by 248 (7 self)
 Add to MetaCart
Semisupervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraintbased methods that guide the clustering algorithm towards a better grouping of the data, and 2) distancefunction learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semisupervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semisupervised clustering algorithms.
Aggregating inconsistent information: ranking and clustering
, 2005
"... We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc ..."
Abstract

Cited by 226 (17 self)
 Add to MetaCart
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a longstanding conjecture of BangJensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
Get out the vote: Determining support or opposition from Congressional floordebate transcripts
 In Proceedings of EMNLP
, 2006
"... We investigate whether one can determine from the transcripts of U.S. Congressional floor debates whether the speeches represent support of or opposition to proposed legislation. To address this problem, we exploit the fact that these speeches occur as part of a discussion; this allows us to use sou ..."
Abstract

Cited by 151 (4 self)
 Add to MetaCart
We investigate whether one can determine from the transcripts of U.S. Congressional floor debates whether the speeches represent support of or opposition to proposed legislation. To address this problem, we exploit the fact that these speeches occur as part of a discussion; this allows us to use sources of information regarding relationships between discourse segments, such as whether a given utterance indicates agreement with the opinion expressed by another. We find that the incorporation of such information yields substantial improvements over classifying speeches in isolation. 1
Conditional models of identity uncertainty with application to noun coreference
 In NIPS17, Lawrence K. Saul, Yair Weiss, and Léon Bottou, Eds
, 2005
"... Coreference analysis, also known as record linkage or identity uncertainty, is a difficult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditionalprobability models for coreference analysi ..."
Abstract

Cited by 144 (16 self)
 Add to MetaCart
(Show Context)
Coreference analysis, also known as record linkage or identity uncertainty, is a difficult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditionalprobability models for coreference analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of conditional random fields over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1
Maximum margin clustering
 Advances in Neural Information Processing Systems 17
, 2005
"... We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the ha ..."
Abstract

Cited by 136 (4 self)
 Add to MetaCart
(Show Context)
We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hardclustering constraints can be relaxed to a softclustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semisupervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 136 (9 self)
 Add to MetaCart
(Show Context)
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
Clustering with qualitative information
 In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
, 2003
"... We consider the problem of clustering a collection of elements based on pairwise judgments of similarity and dissimilarity. Bansal, Blum and Chawla [1] cast the problem thus: given a graph G whose edges are labeled “+ ” (similar) or “− ” (dissimilar), partition the vertices into clusters so that ..."
Abstract

Cited by 122 (9 self)
 Add to MetaCart
(Show Context)
We consider the problem of clustering a collection of elements based on pairwise judgments of similarity and dissimilarity. Bansal, Blum and Chawla [1] cast the problem thus: given a graph G whose edges are labeled “+ ” (similar) or “− ” (dissimilar), partition the vertices into clusters so that the number of pairs correctly (resp. incorrectly) classified with respect to the input labeling is maximized (resp. minimized). Complete graphs, where the classifier labels every edge, and general graphs, where some edges are not labeled, are both worth studying. We answer several questions left open in [1] and provide a sound overview of clustering with qualitative information. We give a factor 4 approximation for minimization on complete graphs, and a factor O(log n) approximation for general graphs. For the maximization version, a PTAS for complete graphs is shown in [1]; we give a factor 0.7664 approximation for general graphs, noting that a PTAS is unlikely by proving APXhardness. We also prove the APXhardness of minimization on complete graphs. 1.