Results 1  10
of
294
Clustering: Science or art
 NIPS 2009 Workshop on Clustering Theory
, 2009
"... This paper deals with the question whether the quality of different clustering algorithms can be compared by a general, scientifically sound procedure which is independent of particular clustering algorithms. In our opinion, the major obstacle is the difficulty to evaluate a clustering algorithm wit ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
This paper deals with the question whether the quality of different clustering algorithms can be compared by a general, scientifically sound procedure which is independent of particular clustering algorithms. In our opinion, the major obstacle is the difficulty to evaluate a clustering algorithm without taking into account the context: why does the user cluster his data in the first place, and what does he want to do with the clustering afterwards? We suggest that clustering should not be treated as an applicationindependent mathematical problem, but should always be studied in the context of its enduse. Different techniques to evaluate clustering algorithms have to be developed for different uses of clustering. To simplify this procedure it will be useful to build a “taxonomy of clustering problems ” to identify clustering applications which can be treated in a unified way. Preamble Every year, dozens of papers on clustering algorithms get published. Researchers continuously invent new clustering algorithms and work on improving existing ones.
A GameTheoretic Approach to Hypergraph Clustering
, 2009
"... Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using highorder (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a userdefined number of classes, thereby o ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using highorder (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a userdefined number of classes, thereby obtaining the clusters as a byproduct of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Specifically, we show that the hypergraph clustering problem can be naturally cast into a noncooperative multiplayer “clustering game”, whereby the notion of a cluster is equivalent to a classical gametheoretic equilibrium concept. From the computational viewpoint, we show that the problem of finding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discretetime dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over stateoftheart hypergraph clustering techniques.
MADBayes: MAPbased Asymptotic Derivations from Bayes
"... The classical mixture of Gaussians model is related to Kmeans via smallvariance asymptotics: as the covariances of the Gaussians tend to zero, the negative loglikelihood of the mixture of Gaussians model approaches the Kmeans objective, and the EM algorithm approaches the Kmeans algorithm. Kuli ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
The classical mixture of Gaussians model is related to Kmeans via smallvariance asymptotics: as the covariances of the Gaussians tend to zero, the negative loglikelihood of the mixture of Gaussians model approaches the Kmeans objective, and the EM algorithm approaches the Kmeans algorithm. Kulis & Jordan (2012) used this observation to obtain a novel Kmeanslike algorithm from a Gibbs sampler for the Dirichlet process (DP) mixture. We instead consider applying smallvariance asymptotics directly to the posterior in Bayesian nonparametric models. This framework is independent of any specific Bayesian inference algorithm, and it has the major advantage that it generalizes immediately to a range of models beyond the DP mixture. To illustrate, we apply our framework to the feature learning setting, where the beta process and Indian buffet process provide an appropriate Bayesian nonparametric prior. We obtain a novel objective function that goes beyond clustering to learn (and penalize new) groupings for which we relax the mutual exclusivity and exhaustivity assumptions of clustering. We demonstrate several other algorithms, all of which are scalable and simple to implement. Empirical results demonstrate the benefits of the new framework. Proceedings of the 30 th
Helping Users Sort Faster with Adaptive Machine Learning Recommendations
 In Proceedings of the 13th International conference on HumanComputer Interaction
, 2011
"... Abstract. Sorting and clustering large numbers of documents can be an overwhelming task: manual solutions tend to be slow, while machine learning systems often present results that don't align well with users' intents. We created and evaluated a system for helping users sort large numbers ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Sorting and clustering large numbers of documents can be an overwhelming task: manual solutions tend to be slow, while machine learning systems often present results that don't align well with users' intents. We created and evaluated a system for helping users sort large numbers of documents into clusters. iCluster has the capability to recommend new items for existing clusters and appropriate clusters for items. The recommendations are based on a learning model that adapts over time as the user adds more items to a cluster, the system's model improves and the recommendations become more relevant. Thirtytwo subjects used iCluster to sort hundreds of data items both with and without recommendations; we found that recommendations allow users to sort items more rapidly. A pool of 161 raters then assessed the quality of the resulting clusters, finding that clusters generated with recommendations were of statistically indistinguishable quality. Both the manual and assisted methods were substantially better than a fully automatic method.
Better Cross Company Defect Prediction
"... Abstract—How can we find data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters filter whi ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract—How can we find data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters filter which is based on the following conjecture: When local data is scarce, more information exists in other projects. Accordingly, this filter selects training data via the structure of other projects. To assess the performance of the Peters filter, we compare it with two other approaches for quality prediction. Withincompany learning and crosscompany learning with the Burak filter (the stateoftheart relevancy filter). This paper finds that: 1) withincompany predictors are weak for small datasets; 2) the Peters filter+crosscompany builds better predictors than both withincompany and the Burak filter+crosscompany; and 3) the Peters filter builds 64 % more useful predictors than both withincompany and the Burak filter+crosscompany approaches. Hence, we recommend the Peters filter for crosscompany learning. Index Terms—Cross company; defect prediction; data mining I.
Efficient kernel clustering using random fourier features
 In Proceedings of ICDM’12
, 2012
"... Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nons ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nonscalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a lowdimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by stateoftheart kernel clustering algorithms. KeywordsKernel clustering, Kernel kmeans, Random Fourier features, Scalability
Automatic virtual machine clustering based on Bhattacharyya distance for multicloud systems
 in Proc. of International Workshop on Multicloud Applications and Federated Clouds
, 2013
"... Size and complexity of modern data centers pose scalability issues for the resource monitoring system supporting management operations, such as server consolidation. When we pass from cloud to multicloud systems, scalability issues are exacerbated by the need to manage geographically distributed d ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
Size and complexity of modern data centers pose scalability issues for the resource monitoring system supporting management operations, such as server consolidation. When we pass from cloud to multicloud systems, scalability issues are exacerbated by the need to manage geographically distributed data centers and exchange monitored data across them. While existing solutions typically consider every Virtual Machine (VM) as a black box with independent characteristics, we claim that scalability issues in multicloud systems could be addressed by clustering together VMs that show similar behaviors in terms of resource usage. In this paper, we propose an automated methodology to cluster VMs starting from the usage of multiple resources, assuming no knowledge of the services executed on them. This innovative methodology exploits the Bhattacharyya distance to measure the similarity of the probability distributions of VM resources usage, and automatically selects the most relevant resources to consider for the clustering process. The methodology is evaluated through a set of experiments with data from a cloud provider. We show that our proposal achieves high and stable performance in terms of automatic VM clustering. Moreover, we estimate the reduction in the amount of data collected to support system management in the considered scenario, thus showing how the proposed methodology may reduce the monitoring requirements in multicloud systems.
Semisupervised Clustering by Input Pattern Assisted Pairwise Similarity Matrix Completion
"... Many semisupervised clustering algorithms have been proposed to improve the clustering accuracy by effectively exploring the available side information that is usually in the form of pairwise constraints. However, there are two main shortcomings of the existing semisupervised clustering algorithms ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Many semisupervised clustering algorithms have been proposed to improve the clustering accuracy by effectively exploring the available side information that is usually in the form of pairwise constraints. However, there are two main shortcomings of the existing semisupervised clustering algorithms. First, they have to deal with nonconvex optimization problems, leading to clustering results that are sensitive to the initialization. Second, none of these algorithms is equipped with theoretical guarantee regarding the clustering performance. We address these limitations by developing a framework for semisupervised clustering based on input pattern assisted matrix completion. The key idea is to cast clustering into a matrix completion problem, and solve it efficiently by exploiting the correlation between input patterns and cluster assignments. Our analysis shows that under appropriate conditions, only O(log n) pairwise constraints are needed to accurately recover the true cluster partition. We verify the effectiveness of the proposed algorithm by comparing it to the stateoftheart semisupervised clustering algorithms on several benchmark datasets. 1.
CLUSTERING ON GRASSMANN MANIFOLDS VIA KERNEL EMBEDDING WITH APPLICATION TO ACTION ANALYSIS
"... With the aim of improving the clustering of data (such as image sequences) lying on Grassmann manifolds, we propose to embed the manifolds into Reproducing Kernel Hilbert Spaces. To this end, we define a measure of cluster distortion and embed the manifolds such that the distortion is minimised. We ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
With the aim of improving the clustering of data (such as image sequences) lying on Grassmann manifolds, we propose to embed the manifolds into Reproducing Kernel Hilbert Spaces. To this end, we define a measure of cluster distortion and embed the manifolds such that the distortion is minimised. We show that the optimal solution is a generalised eigenvalue problem that can be solved very efficiently. Experiments on several clustering tasks (including human action clustering) show that in comparison to the recent intrinsic Grassmann kmeans algorithm, the proposed approach obtains notable improvements in clustering accuracy, while also being several orders of magnitude faster.