Results 1  10
of
417
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 499 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
SurroundSense: Mobile Phone Localization via Ambience
"... A growing number of mobile computing applications are centered around the user’s location. The notion of location is broad, ranging from physical coordinates (latitude/longitude) to logical labels (like Starbucks, McDonalds). While extensive research has been performed in physical localization, ther ..."
Abstract

Cited by 156 (3 self)
 Add to MetaCart
(Show Context)
A growing number of mobile computing applications are centered around the user’s location. The notion of location is broad, ranging from physical coordinates (latitude/longitude) to logical labels (like Starbucks, McDonalds). While extensive research has been performed in physical localization, there have been few attempts in recognizing logical locations. This paper argues that the increasing number of sensors on mobile phones presents new opportunities for logical localization. We postulate that ambient sound, light, and color in a place convey a photoacoustic signature that can be sensed by the phone’s camera and microphone. Inbuilt accelerometers in some phones may also be useful in inferring broad classes of usermotion, often dictated by the nature of the place. By combining these optical, acoustic, and motion attributes, it may be feasible to construct an identifiable fingerprint for logical localization. Hence, users in adjacent stores can be separated logically, even when their physical positions are extremely close. We propose SurroundSense, a mobile phone based system that explores logical localization via ambience fingerprinting. Evaluation results from 51 different stores show that SurroundSense can achieve an average accuracy of 87 % when all sensing modalities are employed. We believe this is an encouraging result, opening new possibilities in indoor localization.
A local search approximation algorithm for kmeans clustering
, 2004
"... In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are kno ..."
Abstract

Cited by 113 (1 self)
 Add to MetaCart
(Show Context)
In kmeans clustering we are given a set of n data points in ddimensional space ℜd and an integer k, and the problem is to determine a set of k points in ℜd, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomialtime algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for kmeans clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with
The effectiveness of lloydtype methods for the kmeans problem
 In FOCS
, 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract

Cited by 84 (3 self)
 Add to MetaCart
(Show Context)
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably nearoptimal clustering solutions when applied to wellclusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on wellclusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloydtype iteration. 1
Maximum margin clustering made practical.
 IEEE Transactions on Neural Networks,
, 2009
"... ..."
(Show Context)
Fast Approximate Spectral Clustering
, 2009
"... Spectral clustering refers to a flexible class of clustering procedures that can produce highquality clusterings on small data sets but which has limited applicability to largescale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spe ..."
Abstract

Cited by 58 (1 self)
 Add to MetaCart
(Show Context)
Spectral clustering refers to a flexible class of clustering procedures that can produce highquality clusterings on small data sets but which has limited applicability to largescale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortionminimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the misclustering rate. We develop two concrete instances of our general framework, one based on local kmeans clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform kmeans by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes. 1
Convergence of the Lloyd algorithm for computing centroidal Voronoi tessellations
 SIAM Journal on Numerical Analysis
"... Abstract. Centroidal Voronoi tessellations (CVTs) are Voronoi tessellations of a bounded geometric domain such that the generating points of the tessellations are also the centroids (mass centers) of the corresponding Voronoi regions with respect to a given density function. Centroidal Voronoi tesse ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Centroidal Voronoi tessellations (CVTs) are Voronoi tessellations of a bounded geometric domain such that the generating points of the tessellations are also the centroids (mass centers) of the corresponding Voronoi regions with respect to a given density function. Centroidal Voronoi tessellations may also be defined in more abstract and more general settings. Due to the natural optimization properties enjoyed by CVTs, they have many applications in diverse fields. The Lloyd algorithm is one of the most popular iterative schemes for computing the CVTs but its theoretical analysis is far from complete. In this paper, some new analytical results on the local and global convergence of the Lloyd algorithm are presented. These results are derived through careful utilization of the optimization properties shared by CVTs. Numerical experiments are also provided to substantiate the theoretical analysis.
Correlation Clustering in General Weighted Graphs
 Theoretical Computer Science
, 2006
"... We consider the following general correlationclustering problem [1]: given a graph with real nonnegative edge weights and a 〈+〉/〈− 〉 edge labeling, partition the vertices into clusters to minimize the total weight of cut 〈+ 〉 edges and uncut 〈− 〉 edges. Thus, 〈+ 〉 edges with large weights (represen ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
(Show Context)
We consider the following general correlationclustering problem [1]: given a graph with real nonnegative edge weights and a 〈+〉/〈− 〉 edge labeling, partition the vertices into clusters to minimize the total weight of cut 〈+ 〉 edges and uncut 〈− 〉 edges. Thus, 〈+ 〉 edges with large weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster while 〈− 〉 edges with large weights encourage the endpoints to belong to different clusters. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be the best possible by the problem definition. Correlation clustering was introduced by Bansal, Blum, and Chawla [1], motivated by both document clustering and agnostic learning. They proved NPhardness and gave constantfactor approximation algorithms for the special case in which the graph is complete (full information) and every edge has the same weight. We give an O(log n)approximation algorithm for the general case based on a linearprogramming rounding and the “regiongrowing ” technique. We also prove that this linear program has a gap of Ω(log n), and therefore our approximation is tight under this approach. We also give an O(r 3)approximation algorithm for Kr,rminorfree graphs. On the other hand, we show that the problem is equivalent to minimum multicut, and therefore APXhard and difficult to approximate better than Θ(logn). 1
Minimum spanning tree partitioning algorithm for microaggregation
 IEEE Transactions on Knowledge and Data Engineering
, 2005
"... Abstract—This paper presents a clustering algorithm for partitioning a minimum spanning tree with a constraint on minimum group size. The problem is motivated by microaggregation, a disclosure limitation technique in which similar records are aggregated into groups containing a minimum of k records. ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
Abstract—This paper presents a clustering algorithm for partitioning a minimum spanning tree with a constraint on minimum group size. The problem is motivated by microaggregation, a disclosure limitation technique in which similar records are aggregated into groups containing a minimum of k records. Heuristic clustering methods are needed since the minimum information loss microaggregation problem is NPhard. Our MST partitioning algorithm for microaggregation is sufficiently efficient to be practical for large data sets and yields results that are comparable to the best available heuristic methods for microaggregation. For data that contain pronounced clustering effects, our method results in significantly lower information loss. Our algorithm is general enough to accommodate different measures of information loss and can be used for other clustering applications that have a constraint on minimum group size. Index Terms—Clustering, partitioning, minimum spanning tree, microdata protection, disclosure control. 1
HigherOrder Correlation Clustering for Image Segmentation
"... For many of the stateoftheart computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many handtuning parameters. Correlation cluster ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
(Show Context)
For many of the stateoftheart computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many handtuning parameters. Correlation clustering, a graphpartitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higherorder cluster relationships. This improves clustering in the presence of local boundary ambiguities. We first apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higherorder correlation clustering over a hypergraph that considers higherorder relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured support vector machine is possible. Experimental results on various datasets show that the proposed higherorder correlation clustering outperforms other stateoftheart image segmentation algorithms. 1