Results 11  20
of
206
Correlation Clustering in General Weighted Graphs
 Theoretical Computer Science
, 2006
"... We consider the following general correlationclustering problem [1]: given a graph with real nonnegative edge weights and a 〈+〉/〈− 〉 edge labeling, partition the vertices into clusters to minimize the total weight of cut 〈+ 〉 edges and uncut 〈− 〉 edges. Thus, 〈+ 〉 edges with large weights (represen ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
(Show Context)
We consider the following general correlationclustering problem [1]: given a graph with real nonnegative edge weights and a 〈+〉/〈− 〉 edge labeling, partition the vertices into clusters to minimize the total weight of cut 〈+ 〉 edges and uncut 〈− 〉 edges. Thus, 〈+ 〉 edges with large weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster while 〈− 〉 edges with large weights encourage the endpoints to belong to different clusters. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be the best possible by the problem definition. Correlation clustering was introduced by Bansal, Blum, and Chawla [1], motivated by both document clustering and agnostic learning. They proved NPhardness and gave constantfactor approximation algorithms for the special case in which the graph is complete (full information) and every edge has the same weight. We give an O(log n)approximation algorithm for the general case based on a linearprogramming rounding and the “regiongrowing ” technique. We also prove that this linear program has a gap of Ω(log n), and therefore our approximation is tight under this approach. We also give an O(r 3)approximation algorithm for Kr,rminorfree graphs. On the other hand, we show that the problem is equivalent to minimum multicut, and therefore APXhard and difficult to approximate better than Θ(logn). 1
Unsupervised and semisupervised clustering: a brief survey
 7th ACM SIGMM international workshop on Multimedia information retrieval
"... Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of the study, to domainspecific assumptions and to prior knowledge of the problem. Clustering is usually performed when no information is available concerning the membership of data items to predefined classes. For this reason, clustering is traditionally seen as part of unsupervised learning. We nevertheless speak here of unsupervised clustering to distinguish it from a more recent and less common approach that makes use of a small amount of supervision to “guide ” or “adjust ” clustering (see section 2). To support the extensive use of clustering in computer vision, pattern recognition, information retrieval, data mining, etc., very many different methods were developed in several communities. Detailed surveys of this domain can be found in [25], [27] or [26]. In the following, we attempt to briefly review a few core concepts of cluster analysis and describe categories of clustering methods that are best represented in the literature. We also take this opportunity to provide some pointers to more recent work on clustering.
Flexible constrained spectral clustering
 in KDD, 2010
"... Constrained clustering has been wellstudied for algorithms like Kmeans and hierarchical agglomerative clustering. However, how to encode constraints into spectral clustering remains a developing area. In this paper, we propose a flexible and generalized framework for constrained spectral clusterin ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
Constrained clustering has been wellstudied for algorithms like Kmeans and hierarchical agglomerative clustering. However, how to encode constraints into spectral clustering remains a developing area. In this paper, we propose a flexible and generalized framework for constrained spectral clustering. In contrast to some previous efforts that implicitly encode MustLink and CannotLink constraints by modifying the graph Laplacian or the resultant eigenspace, we present a more natural and principled formulation, which preserves the original graph Laplacian and explicitly encodes the constraints. Our method offers several practical advantages: it can encode the degree of belief (weight) in MustLink and CannotLink constraints; it guarantees to lowerbound how well the given constraints are satisfied using a userspecified threshold; and it can be solved deterministically in polynomial time through generalized eigendecomposition. Furthermore, by inheriting the objective function from spectral clustering and explicitly encoding the constraints, much of the existing analysis of spectral clustering techniques is still valid. Consequently our work can be posed as a natural extension to unconstrained spectral clustering and be interpreted as finding the normalized mincut of a labeled graph. We validate the effectiveness of our approach by empirical results on realworld data sets, with applications to constrained image segmentation and clustering benchmark data sets with both binary and degreeofbelief constraints.
Clustering by Committee
, 2003
"... children, the narratives that capture our thoughts, and the stories that shape our world. In this work, we present some recent advances in automatically acquiring knowledge from text. We propose a general purpose clustering algorithm called CBC (Clustering By Committee) from which we will organiz ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
children, the narratives that capture our thoughts, and the stories that shape our world. In this work, we present some recent advances in automatically acquiring knowledge from text. We propose a general purpose clustering algorithm called CBC (Clustering By Committee) from which we will organize documents according to topics as well as discover concepts and word senses. We will explore the value of these systems by experimenting with two novel evaluation methodologies that attempt to define what a word sense is and define the quality of a particular clustering.
Simultaneous Unsupervised Learning of Disparate Clusterings
"... Most clustering algorithms produce a single clustering for a given data set even when the data can be clustered naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in a totally unsupervised manner. We propose two new approache ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
(Show Context)
Most clustering algorithms produce a single clustering for a given data set even when the data can be clustered naturally in multiple ways. In this paper, we address the difficult problem of uncovering disparate clusterings from the data in a totally unsupervised manner. We propose two new approaches for this problem. In the first approach we aim to find good clusterings of the data that are also decorrelated with one another. To this end, we give a new and tractable characterization of decorrelation between clusterings, and present an objective function to capture it. We provide an iterative “decorrelated” kmeans type algorithm to minimize this objective function. In the second approach, we model the data as a sum of mixtures and associate each mixture with a clustering. This approach leads us to the problem of learning a convolution of mixture distributions. Though the latter problem can be formulated as one of factorial learning [8, 13, 16], the existing formulations and methods do not perform well on many real highdimensional data sets. We propose a new regularized factorial learning framework that is more suitable for capturing the notion of disparate clusterings in modern, highdimensional data sets. The resulting algorithm does well in uncovering multiple clusterings, and is much improved over existing methods. We evaluate our methods on two realworld data sets a music data set from the text mining domain, and a portrait data set from the computer vision domain. Our methods achieve a substantially higher accuracy than existing factorial learning as well as traditional clustering algorithms.
Active coanalysis of a set of shapes
 ACM Trans. on Graph (SIGGRAPH Asia
, 2012
"... Figure 1: Overview of our active coanalysis: (a) We start with an initial unsupervised cosegmentation of the input set. (b) During active learning, the system automatically suggests constraints which would refine results and the user interactively adds constraints as appropriate. In this example, ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
Figure 1: Overview of our active coanalysis: (a) We start with an initial unsupervised cosegmentation of the input set. (b) During active learning, the system automatically suggests constraints which would refine results and the user interactively adds constraints as appropriate. In this example, the user adds a cannotlink constraint (in red) and a mustlink constraint (in blue) between segments. (c) The constraints are propagated to the set and the cosegmentation is refined. The process from (b) to (c) is repeated until the desired result is obtained. Unsupervised coanalysis of a set of shapes is a difficult problem since the geometry of the shapes alone cannot always fully describe the semantics of the shape parts. In this paper, we propose a semisupervised learning method where the user actively assists in the coanalysis by iteratively providing inputs that progressively constrain the system. We introduce a novel constrained clustering method based on a spring system which embeds elements to better respect their interdistances in feature space together with the usergiven set of constraints. We also present an active learning method that suggests to the user where his input is likely to be the most effective in refining the results. We show that each single pair of constraints affects many relations across the set. Thus, the method requires only a sparse set of constraints to quickly converge toward a consistent and errorfree semantic labeling of the set.
Nearduplicate Detection by Instancelevel Constrained Clustering
 In Proceedings of the 29th ACM Conference on Research and Development in Information Retrieval (SIGIR06). 2006
, 2006
"... For the task of nearduplicated document detection, both traditional fingerprinting techniques used in database community and bagofword comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of nearduplicated d ..."
Abstract

Cited by 33 (5 self)
 Add to MetaCart
For the task of nearduplicated document detection, both traditional fingerprinting techniques used in database community and bagofword comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of nearduplicated documents are different from that of both “almostidentical ” documents in the data cleaning task and “relevant ” documents in the search task. This paper presents an instancelevel constrained clustering approach for nearduplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form nearduplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other nearduplicate detection algorithms and as about as effective as human assessors.
Detecting Communities in Social Networks using MaxMin Modularity
"... Many datasets can be described in the form of graphs or networks where nodes in the graph represent entities and edges represent relationships between pairs of entities. A common property of these networks is their community structure, considered as clusters of densely connected groups of vertices, ..."
Abstract

Cited by 33 (4 self)
 Add to MetaCart
Many datasets can be described in the form of graphs or networks where nodes in the graph represent entities and edges represent relationships between pairs of entities. A common property of these networks is their community structure, considered as clusters of densely connected groups of vertices, with only sparser connections between groups. The identification of such communities relies on some notion of clustering or density measure, which defines the communities that can be found. However, previous community detection methods usually apply the same structural measure on all kinds of networks, despite their distinct dissimilar features. In this paper, we present a new community mining measure, MaxMin Modularity, which considers both connected pairs and criteria defined by domain experts in finding communities, and then specify a hierarchical clustering algorithm to detect communities in networks. When applied to real world networks for which the community structures are already known, our method shows improvement over previous algorithms. In addition, when applied to randomly generated networks for which we only have approximate information about communities, it gives promising results which shows the algorithm’s robustness against noise.
Learning with constrained and unlabeled data
 In CVPR
, 2005
"... Classification problems abundantly arise in many computer vision tasks – being of supervised, semisupervised or unsupervised nature. Even when class labels are not available, a user still might favor certain grouping solutions over others. This bias can be expressed either by providing a clustering ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
Classification problems abundantly arise in many computer vision tasks – being of supervised, semisupervised or unsupervised nature. Even when class labels are not available, a user still might favor certain grouping solutions over others. This bias can be expressed either by providing a clustering criterion or cost function and, in addition to that, by specifying pairwise constraints on the assignment of objects to classes. In this work, we discuss a unifying formulation for labelled and unlabelled data that can incorporate constrained data for model fitting. Our approach models the constraint information by the maximum entropy principle. This modeling strategy allows us (i) to handle constraint violations and soft constraints, and, at the same time, (ii) to speed up the optimization process. Experimental results on face classification and image segmentation indicates that the proposed algorithm is computationally efficient and generates superior groupings when compared with alternative techniques. 1.
Spectral DomainTransfer Learning
 KDD'08
, 2008
"... Traditional spectral classification has been proved to be effective in dealing with both labeled and unlabeled data when these data are from the same domain. In many real world applications, however, we wish to make use of the labeled data from one domain (called indomain) to classify the unlabeled ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
(Show Context)
Traditional spectral classification has been proved to be effective in dealing with both labeled and unlabeled data when these data are from the same domain. In many real world applications, however, we wish to make use of the labeled data from one domain (called indomain) to classify the unlabeled data in a different domain (outofdomain). This problem often happens when obtaining labeled data in one domain is difficult while there are plenty of labeled data from a related but different domain. In general, this is a transfer learning problem where we wish to classify the unlabeled data through the labeled data even though these data are not from the same domain. In this paper, we formulate this domaintransfer learning problem under a novel spectral classification framework, where the objective function is introduced to seek consistency between the indomain supervision and the outofdomain intrinsic structure. Through optimization of the cost function, the label information from the indomain data is effectively transferred to help classify the unlabeled data from the outofdomain. We conduct extensive experiments to evaluate our method and show that our algorithm achieves significant improvements on classification performance over many stateoftheart algorithms.