Results 1 - 10
of
332
Duplicate Record Detection: A Survey
, 2007
"... Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standa ..."
Abstract
-
Cited by 448 (11 self)
- Add to MetaCart
(Show Context)
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area.
A Probabilistic Framework for Semi-Supervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract
-
Cited by 275 (14 self)
- Add to MetaCart
(Show Context)
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Integrating Constraints and Metric Learning in Semi-Supervised Clustering
- In ICML
, 2004
"... Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distanc ..."
Abstract
-
Cited by 248 (7 self)
- Add to MetaCart
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semisupervised clustering algorithms.
Aggregating inconsistent information: ranking and clustering
, 2005
"... We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc ..."
Abstract
-
Cited by 226 (17 self)
- Add to MetaCart
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a long-standing conjecture of Bang-Jensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
Get out the vote: Determining support or opposition from Congressional floor-debate transcripts
- In Proceedings of EMNLP
, 2006
"... We investigate whether one can determine from the transcripts of U.S. Congressional floor debates whether the speeches represent support of or opposition to proposed legislation. To address this problem, we exploit the fact that these speeches occur as part of a discussion; this allows us to use sou ..."
Abstract
-
Cited by 151 (4 self)
- Add to MetaCart
We investigate whether one can determine from the transcripts of U.S. Congressional floor debates whether the speeches represent support of or opposition to proposed legislation. To address this problem, we exploit the fact that these speeches occur as part of a discussion; this allows us to use sources of information regarding relationships between discourse segments, such as whether a given utterance indicates agreement with the opinion expressed by another. We find that the incorporation of such information yields substantial improvements over classifying speeches in isolation. 1
Conditional models of identity uncertainty with application to noun coreference
- In NIPS17, Lawrence K. Saul, Yair Weiss, and Léon Bottou, Eds
, 2005
"... Coreference analysis, also known as record linkage or identity uncer-tainty, is a difficult and important problem in natural language process-ing, databases, citation matching and many other tasks. This paper intro-duces several discriminative, conditional-probability models for coref-erence analysi ..."
Abstract
-
Cited by 144 (16 self)
- Add to MetaCart
(Show Context)
Coreference analysis, also known as record linkage or identity uncer-tainty, is a difficult and important problem in natural language process-ing, databases, citation matching and many other tasks. This paper intro-duces several discriminative, conditional-probability models for coref-erence analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of con-ditional random fields over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1
Maximum margin clustering
- Advances in Neural Information Processing Systems 17
, 2005
"... We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the ha ..."
Abstract
-
Cited by 136 (4 self)
- Add to MetaCart
(Show Context)
We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1
Active Semi-Supervision for Pairwise Constrained Clustering
- Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004
"... Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract
-
Cited by 136 (9 self)
- Add to MetaCart
(Show Context)
Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
Clustering with qualitative information
- In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
, 2003
"... We consider the problem of clustering a collection of el-ements based on pairwise judgments of similarity and dis-similarity. Bansal, Blum and Chawla [1] cast the problem thus: given a graph G whose edges are labeled “+ ” (sim-ilar) or “− ” (dissimilar), partition the vertices into clus-ters so that ..."
Abstract
-
Cited by 122 (9 self)
- Add to MetaCart
(Show Context)
We consider the problem of clustering a collection of el-ements based on pairwise judgments of similarity and dis-similarity. Bansal, Blum and Chawla [1] cast the problem thus: given a graph G whose edges are labeled “+ ” (sim-ilar) or “− ” (dissimilar), partition the vertices into clus-ters so that the number of pairs correctly (resp. incorrectly) classified with respect to the input labeling is maximized (resp. minimized). Complete graphs, where the classifier la-bels every edge, and general graphs, where some edges are not labeled, are both worth studying. We answer several questions left open in [1] and provide a sound overview of clustering with qualitative information. We give a factor 4 approximation for minimization on complete graphs, and a factor O(log n) approximation for general graphs. For the maximization version, a PTAS for complete graphs is shown in [1]; we give a factor 0.7664 approximation for general graphs, noting that a PTAS is unlikely by proving APX-hardness. We also prove the APX-hardness of minimization on complete graphs. 1.