Results 1  10
of
206
Distance metric learning, with application to clustering with sideinformation,”
 in Advances in Neural Information Processing Systems 15,
, 2002
"... Abstract Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many "plausible" ways, and if a clustering algorithm such as Kmeans initially fails to find one that is meaningful to a user, the only recourse may be for ..."
Abstract

Cited by 818 (13 self)
 Add to MetaCart
Abstract Many algorithms rely critically on being given a good metric over their inputs. For instance, data can often be clustered in many "plausible" ways, and if a clustering algorithm such as Kmeans initially fails to find one that is meaningful to a user, the only recourse may be for the user to manually tweak the input space's metric until sufficiently good clusters are found. For these and other applications requiring good metrics, it is desirable that we provide a more systematic way for users to indicate what they consider "similar." For instance, we may ask them to provide examples. In this paper, we present an algorithm that, given examples of similar (and, if desired, dissimilar) pairs of points in Ê Ò , learns a distance metric over Ê Ò that respects these relationships. Our method is based on posing metric learning as a convex optimization problem, which allows us to give efficient, localoptimafree algorithms. We also demonstrate empirically that the learned metrics can be used to significantly improve clustering performance.
From Instancelevel Constraints to Spacelevel Constraints: Making the Most of Prior Knowledge in Data Clustering
, 2002
"... We present an improved method for clustering in the presence of very limited supervisory information, given as pairwise instance constraints. By allowing instancelevel constraints to have spacelevel inductive implications, we are able to successfully incorporate constraints for a wide range o ..."
Abstract

Cited by 201 (2 self)
 Add to MetaCart
(Show Context)
We present an improved method for clustering in the presence of very limited supervisory information, given as pairwise instance constraints. By allowing instancelevel constraints to have spacelevel inductive implications, we are able to successfully incorporate constraints for a wide range of data set types. Our method greatly improves on the previously studied constrained kmeans algorithm, generally requiring less than half as many constraints to achieve a given accuracy on a range of realworld data, while also being more robust when overconstrained. We additionally discuss an active learning algorithm which increases the value of constraints even further.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 136 (9 self)
 Add to MetaCart
(Show Context)
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
Spectral learning
 In IJCAI
, 2003
"... We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, ..."
Abstract

Cited by 106 (6 self)
 Add to MetaCart
We present a simple, easily implemented spectral learning algorithm which applies equally whether we have no supervisory information, pairwise link constraints, or labeled examples. In the unsupervised case, it performs consistently with other spectral clustering algorithms. In the supervised case, our approach achieves high accuracy on the categorization of thousands of documents given only a few dozen labeled training documents for the 20 Newsgroups data set. Furthermore, its classification accuracy increases with the addition of unlabeled documents, demonstrating effective use of unlabeled data. By using normalized affinity matrices which are both symmetric and stochastic, we also obtain both a probabilistic interpretation of our method and certain guarantees of performance. 1
Clustering with Constraints: Feasibility Issues and the kMeans Algorithm
, 2005
"... Recent work has looked at extending the kMeans algorithm to incorporate background information in the form of instance level mustlink and cannotlink constraints. We introduce two ways of specifying additional background information in the form of # and # constraints that operate on all instances ..."
Abstract

Cited by 90 (9 self)
 Add to MetaCart
Recent work has looked at extending the kMeans algorithm to incorporate background information in the form of instance level mustlink and cannotlink constraints. We introduce two ways of specifying additional background information in the form of # and # constraints that operate on all instances but which can be interpreted as conjunctions or disjunctions of instance level constraints and hence are easy to implement. We present complexity results for the feasibility of clustering under each type of constraint individually and several types together. A key finding is that determining whether there is a feasible solution satisfying all constraints is, in general, NPcomplete. Thus, an iterative algorithm such as kMeans should not try to find a feasible partitioning at each iteration. This motivates our derivation of a new version of the kMeans algorithm that minimizes the constrained vector quantization error but at each iteration does not attempt to satisfy all constraints. Using standard UCI datasets, we find that using constraints improves accuracy as others have reported, but we also show that our algorithm reduces the number of iterations until convergence. Finally, we illustrate these benefits and our new constraint types on a complex real world object identification problem using the infrared detector on an Aibo robot.
NonRedundant Data Clustering
, 2004
"... Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to de ..."
Abstract

Cited by 87 (3 self)
 Add to MetaCart
Data clustering is a popular approach for automatically finding classes, concepts, or groups of patterns. In practice this discovery process should avoid redundancies with existing knowledge about class structures or groupings, and reveal novel, previously unknown aspects of the data. In order to deal with this problem, we present an extension of the information bottleneck framework, called coordinated conditional information bottleneck, which takes negative relevance information into account by maximizing a conditional mutual information score subject to constraints. Algorithmically, one can apply an alternating optimization scheme that can be used in conjunction with different types of numeric and nonnumeric attributes. We present experimental results for applications in text mining and computer vision.
Segmentation given partial grouping constraints
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... Abstract—We consider data clustering problems where partial grouping is known a priori. We formulate such biased grouping problems as a constrained optimization problem, where structural properties of the data define the goodness of a grouping and partial grouping cues define the feasibility of a gr ..."
Abstract

Cited by 80 (4 self)
 Add to MetaCart
(Show Context)
Abstract—We consider data clustering problems where partial grouping is known a priori. We formulate such biased grouping problems as a constrained optimization problem, where structural properties of the data define the goodness of a grouping and partial grouping cues define the feasibility of a grouping. We enforce grouping smoothness and fairness on labeled data points so that sparse partial grouping information can be effectively propagated to the unlabeled data. Considering the normalized cuts criterion in particular, our formulation leads to a constrained eigenvalue problem. By generalizing the RayleighRitz theorem to projected matrices, we find the global optimum in the relaxed continuous domain by eigendecomposition, from which a nearglobal optimum to the discrete labeling problem can be obtained effectively. We apply our method to real image segmentation problems, where partial grouping priors can often be derived based on a crude spatial attentional map that binds places with common salient features or focuses on expected object locations. We demonstrate not only that it is possible to integrate both image structures and priors in a single grouping process, but also that objects can be segregated from the background without specific object knowledge. Index Terms—Grouping, image segmentation, graph partitioning, bias, spatial attention, semisupervised clustering, partially labeled classification. æ
Document clustering with committees
 In Proc. of SIGIR’02
, 2002
"... Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoolike hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intragroup similarities are high and th ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
(Show Context)
Document clustering is useful in many information retrieval tasks: document browsing, organization and viewing of retrieval results, generation of Yahoolike hierarchies of documents, etc. The general goal of clustering is to group data elements such that the intragroup similarities are high and the intergroup similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intragroup similarity), called committees, that are well scattered in the similarity space (low intergroup similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology that is based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
Measuring constraintset utility for partitional clustering algorithms
 In: Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases
, 2006
"... Abstract. Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged ..."
Abstract

Cited by 49 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged over different randomly chosen constraint sets from a given set of labels, thereby masking interesting properties of individual sets. We demonstrate that constraint sets vary significantly in how useful they are for constrained clustering; some constraint sets can actually decrease algorithm performance. We create two quantitative measures, informativeness and coherence, that can be used to identify useful constraint sets. We show that these measures can also help explain differences in performance for four particular constrained clustering algorithms. 1
Spotsigs: robust and efficient near duplicate detection in large web collections
 In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
, 2008
"... Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage porti ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
(Show Context)
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching signatures for near duplicate detection in large Web crawls. Our spot signatures are designed to favor naturallanguage portions of Web pages over advertisements and navigational bars. The contributions of SpotSigs are twofold: 1) by combining stopword antecedents with short chains of adjacent content terms, we create robust document signatures with a natural ability to filter out noisy components of Web pages that would otherwise distract pure ngrambased approaches such as Shingling; 2) we provide an exact and efficient, selftuning matching algorithm that exploits a novel combination of collection partitioning and inverted index pruning for highdimensional similarity search. Experiments confirm an increase in combined precision and recall of more than 24 percent over stateoftheart approaches such as Shingling or IMatch and up to a factor of 3 faster execution times than Locality Sensitive Hashing (LSH), over a demonstrative “Gold Set ” of manually assessed nearduplicate news articles as well as the TREC WT10g Web collection.