• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Semi-supervised clustering with user feedback (2003)

by D Cohn, R Caruana, A McCallum
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 122
Next 10 →

A Probabilistic Framework for Semi-Supervised Clustering

by Sugato Basu , 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract - Cited by 271 (14 self) - Add to MetaCart
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
(Show Context)

Citation Context

... adaptive distance measures have been used for semisupervised clustering, including string-edit distance trained using Expectation Maximization (EM) [10], KL divergence trained using gradient descent =-=[13]-=-, Euclidean distance modified by a shortestpath algorithm [27], or Mahalanobis distances trained using convex optimization [39]. We propose a principled probabilistic framework based on Hidden Markov ...

Integrating Constraints and Metric Learning in Semi-Supervised Clustering

by Mikhail Bilenko, Sugato Basu, Raymond J. Mooney - In ICML , 2004
"... Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distanc ..."
Abstract - Cited by 245 (7 self) - Add to MetaCart
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semisupervised clustering algorithms.

Semi-supervised Clustering by Seeding

by Sugato Basu, Arindam Banerjee, R. Mooney - In Proceedings of 19th International Conference on Machine Learning (ICML-2002 , 2002
"... Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It intr ..."
Abstract - Cited by 206 (17 self) - Add to MetaCart
Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It introduces two semi-supervised variants of KMeans clustering that can be viewed as instances of the EM algorithm, where labeled data provides prior information about the conditional distributions of hidden category labels. Experimental results demonstrate the advantages of these methods over standard random seeding and COP-KMeans, a previously developed semi-supervised clustering algorithm.

Clustering with instance-level constraints

by Lou Wagstaff, Kiri Lou Wagstaff, Ph. D - In Proceedings of the Seventeenth International Conference on Machine Learning , 2000
"... One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningf ..."
Abstract - Cited by 202 (7 self) - Add to MetaCart
One goal of research in artificial intelligence is to automate tasks that currently require human expertise; this automation is important because it saves time and brings problems that were previously too large to be solved into the feasible domain. Data analysis, or the ability to identify meaningful patterns and trends in large volumes of data, is an important task that falls into this category. Clustering algorithms are a particularly useful group of data analysis tools. These methods are used, for example, to analyze satellite images of the Earth to identify and categorize different land and foliage types or to analyze telescopic observations to determine what distinct types of astronomical bodies exist and to categorize each observation. However, most existing clustering methods apply general similarity techniques rather than making use of problem-specific information. This dissertation first presents a novel method for converting existing clustering algorithms into constrained clustering algorithms. The resulting methods are able to accept domain-specific information in the form of constraints on the output clusters. At the most general level, each constraint is an instance-level statement
(Show Context)

Citation Context

...problem-specific information in unsupervised approaches. These new algorithms are neither supervised nor unsupervised but fall somewhere in between; they are sometimes referred to as semi-supervised (=-=Cohn et al., 2003-=-; Basu et al., 2002). In this chapter, we first provide some background on the various types of clustering algorithms (Section 2.1) before moving on to discuss how others have enhanced those algorithm...

Learning a distance metric from relative comparisons

by Matthew Schultz, Thorsten Joachims - In Proc. Advances in Neural Information Processing Systems , 2003
"... This paper presents a method for learning a distance metric from rel-ative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We sh ..."
Abstract - Cited by 191 (0 self) - Add to MetaCart
This paper presents a method for learning a distance metric from rel-ative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard meth-ods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1
(Show Context)

Citation Context

...here. Secondly, their method does not use regularization. Related are also techniques for semi-supervised clustering, as it is also considered in [11]. While [10] does not change the distance metric, =-=[2]-=- uses gradient descent to adapt a parameterized distance metric according to user feedback. Other related work are dimension reduction techniques such as Multidimensional Scaling (MDS) [4] and Latent ...

Active Semi-Supervision for Pairwise Constrained Clustering

by Sugato Basu, Arindam Banerjee, Raymond J. Mooney - Proc. 4th SIAM Intl. Conf. on Data Mining (SDM-2004
"... Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract - Cited by 134 (9 self) - Add to MetaCart
Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
(Show Context)

Citation Context

...ints while clustering. Other work with the pairwise constrained clustering model includes learning distance metrics for clustering from pairwise constraints [17, 22, 34]. In this domain, Cohn. et al. =-=[8]-=- have proposed iterative userfeedback to acquire constraints, but it was not an active learning algorithm. Active learning in the classification framework is a longstudied problem, where different pri...

Generative model-based document clustering: a comparative study

by Shi Zhong - Knowledge and Information Systems , 2005
"... Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervis ..."
Abstract - Cited by 48 (0 self) - Add to MetaCart
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.
(Show Context)

Citation Context

...priate unlabeled data instances to existing categories. In semi-supervised clustering, labeled data can be used as initial seeds (Basu et al., 2002), constraints (Wagstaff et al., 2001), or feedback (=-=Cohn et al., 2003-=-). All these existing approaches are based on model-based clustering (Zhong & Ghosh, 2003) where each cluster is represented by its “centroid”. 2 Seeded approaches use labeled data only to help initia...

Topic-bridged PLSA for Cross-Domain Text Classification

by Gui-rong Xue, Wenyuan Dai, Qiang Yang, Yong Yu
"... In many Web applications, such as blog classification and newsgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional ..."
Abstract - Cited by 44 (2 self) - Add to MetaCart
In many Web applications, such as blog classification and newsgroup classification, labeled data are in short supply. It often happens that obtaining labeled data in a new domain is expensive and time consuming, while there may be plenty of labeled data in a related but different domain. Traditional text classification approaches are not able to cope well with learning across different domains. In this paper, we propose a novel cross-domain text classification algorithm which extends the traditional probabilistic latent semantic analysis (PLSA) algorithm to integrate labeled and unlabeled data, which come from different but related domains, into a unified probabilistic model. We call this new model Topic-bridged PLSA, or TPLSA. By exploiting the common topics between two domains, we transfer knowledge across different domains through a topic-bridge to help the text classification in the target domain. A unique advantage of our method is its ability to maximally mine knowledge that can be transferred between domains, resulting in superior performance when compared to other state-of-the-art text classification approaches. Experimental evaluation on different kinds of datasets shows that our proposed algorithm can improve the performance of cross-domain text classification significantly.
(Show Context)

Citation Context

...ter). It finds a balance between satisfying these constraints and optimizing the original clustering objective function. Several semi-supervised clustering algorithms have been proposed, including [1]=-=[3]-=-[10]. Our algorithm is essentially a classification algorithm in which the constraints given by the training data provide a class structure. It will be shown theoretically and empirically that our alg...

Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases

by Mikhail Bilenko, Raymond J. Mooney , 2002
"... The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than pr ..."
Abstract - Cited by 42 (3 self) - Add to MetaCart
The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance metrics are learned for each field, adapting to the specific notion of similarity that is appropriate for the field's domain. Second, a classifier is employed that uses several diverse metrics for each field as distance features and classifies pairs of records as duplicates or non-duplicates. We also propose an extended model of learnable string distance which improves over an existing approach. Experimental results on real and synthetic datasets show that our method outperforms traditional techniques.
(Show Context)

Citation Context

... and deleted tokens, it would be desirable to develop learning methods for token-based metrics, such as Jaccard similarity or vector-space cosine distance. Previous work on semi-supervised clustering =-=[4]-=- has shown the 16 usefulness of a similar approach: learning weights of individual words when calculating distance between documents using Kullback-Leibler divergence. Another area for future work lie...

Unsupervised and semisupervised clustering: a brief survey

by Nizar Grira, Michel Crucianu, Nozha Boujemaa - 7th ACM SIGMM international workshop on Multimedia information retrieval
"... Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of ..."
Abstract - Cited by 39 (0 self) - Add to MetaCart
Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of the study, to domain-specific assumptions and to prior knowledge of the problem. Clustering is usually performed when no information is available concerning the membership of data items to predefined classes. For this reason, clustering is traditionally seen as part of unsupervised learning. We nevertheless speak here of unsupervised clustering to distinguish it from a more recent and less common approach that makes use of a small amount of supervision to “guide ” or “adjust ” clustering (see section 2). To support the extensive use of clustering in computer vision, pattern recognition, information retrieval, data mining, etc., very many different methods were developed in several communities. Detailed surveys of this domain can be found in [25], [27] or [26]. In the following, we attempt to briefly review a few core concepts of cluster analysis and describe categories of clustering methods that are best represented in the literature. We also take this opportunity to provide some pointers to more recent work on clustering.
(Show Context)

Citation Context

...available constraints can be easier satisfied. Several similarity measures were employed for similarity-adapting semisupervised clustering: the Jensen-Shannon divergence trained with gradient descent =-=[10]-=-, the Euclidean distance modified by a shortest-path algorithm [28] or Mahalanobis distances adjusted by convex optimization [38], [8]. Among the clustering algorithms using such adapted similarity me...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University