Results 1  10
of
409
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 483 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Robust Data Clustering
, 2003
"... We address the problem of robust clustering by combining data partitions (forming a clustering ensemble) produced by multiple clusterings. We formulate robust clustering under an informationtheoretical framework; mutual information is the underlying concept used in the definition of quantitative me ..."
Abstract

Cited by 276 (8 self)
 Add to MetaCart
We address the problem of robust clustering by combining data partitions (forming a clustering ensemble) produced by multiple clusterings. We formulate robust clustering under an informationtheoretical framework; mutual information is the underlying concept used in the definition of quantitative measures of agreement or consistency between data partitions. Robustness is assessed by variance of the cluster membership, based on bootstrapping. We propose and analyze a voting mechanism on pairwise associations of patterns for combining data partitions. We show that the proposed technique attempts to optimize the mutual information based criteria, although the optimality is not ensured in all situations. This evidence accumulation method is demonstrated by combining the wellknown Kmeans algorithm to produce clustering ensembles. Experimental results show the ability of the technique to identify clusters with arbitrary shapes and sizes.
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 274 (6 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Analysis of Planar Shapes Using Geodesic Paths on Shape Spaces
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... For analyzing shapes of planar, closed curves, we propose di#erential geometric representations of curves using their direction functions and curvature functions. Shapes are represented as elements of infinitedimensional spaces and their pairwise di#erences are quantified using the lengths of ge ..."
Abstract

Cited by 172 (39 self)
 Add to MetaCart
(Show Context)
For analyzing shapes of planar, closed curves, we propose di#erential geometric representations of curves using their direction functions and curvature functions. Shapes are represented as elements of infinitedimensional spaces and their pairwise di#erences are quantified using the lengths of geodesics connecting them on these spaces. We use a Fourier basis to represent tangents to the shape spaces and then use a gradientbased shooting method to solve for the tangent that connects any two shapes via a geodesic.
M.: Person reidentification by symmetrydriven accumulation of local features
 In: IEEE Conf. Computer Vision and Pattern Recognition
, 2010
"... In this paper, we present an appearancebased method for person reidentification. It consists in the extraction of features that model three complementary aspects of the human appearance: the overall chromatic content, the spatial arrangement of colors into stable regions, and the presence of recur ..."
Abstract

Cited by 148 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we present an appearancebased method for person reidentification. It consists in the extraction of features that model three complementary aspects of the human appearance: the overall chromatic content, the spatial arrangement of colors into stable regions, and the presence of recurrent local motifs with high entropy. All this information is derived from different body parts, and weighted opportunely by exploiting symmetry and asymmetry perceptual principles. In this way, robustness against very low resolution, occlusions and pose, viewpoint and illumination changes is achieved. The approach applies to situations where the number of candidates varies continuously, considering single images or bunch of frames for each individual. It has been tested on several public benchmark datasets (ViPER, iLIDS, ETHZ), gaining new stateoftheart performances. 1.
Feature selection for unsupervised learning
 Journal of Machine Learning Research
, 2004
"... In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of feature selection criteria with respect to dime ..."
Abstract

Cited by 139 (4 self)
 Add to MetaCart
In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of feature selection criteria with respect to dimension. We explore the feature selection problem and these issues through FSSEM (Feature Subset Selection using ExpectationMaximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. We present proofs on the dimensionality biases of these feature criteria, and present a crossprojection normalization scheme that can be applied to any criterion to ameliorate these biases. Our experiments show the need for feature selection, the need for addressing these two issues, and the effectiveness of our proposed solutions.
Data Clustering Using Evidence Accumulation
, 2002
"... the results of multiple clusterings. Initially, n ddimensional data is decomposed into a large number of compact clusters; the Kmeans algorithm performs this decomposition, with several clusterings obtained by N random initializations of the Kmeans. Taking the cooccurrences of pairs of patterns i ..."
Abstract

Cited by 125 (12 self)
 Add to MetaCart
the results of multiple clusterings. Initially, n ddimensional data is decomposed into a large number of compact clusters; the Kmeans algorithm performs this decomposition, with several clusterings obtained by N random initializations of the Kmeans. Taking the cooccurrences of pairs of patterns in the same cluster as votes for their association, the data partitions are mapped into a coassociation matrix of patterns. This n n matrix represents a new similarity measure between patterns. The final clusters are obtained by applying a MSTbased clustering algorithm on this matrix. Results on both synthetic and real data show the ability of the method to identify arbitrary shaped clusters in multidimensional data.
Simultaneous feature selection and clustering using mixture models
 IEEE TRANS. PATTERN ANAL. MACH. INTELL
, 2004
"... Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched u ..."
Abstract

Cited by 118 (1 self)
 Add to MetaCart
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectationmaximization (EM) algorithm to estimate it, in the context of mixturebased clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.
Segmentation of multivariate mixed data via lossy coding and compression
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2007
"... Abstract—In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmen ..."
Abstract

Cited by 108 (17 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation that minimizes the overall coding length of the segmented data, subject to a given distortion. By analyzing the coding length/rate of mixed data, we formally establish some strong connections of data segmentation to many fundamental concepts in lossy data compression and ratedistortion theory. We show that a deterministic segmentation is approximately the (asymptotically) optimal solution for compressing mixed data. We propose a very simple and effective algorithm that depends on a single parameter, the allowable distortion. At any given distortion, the algorithm automatically determines the corresponding number and dimension of the groups and does not involve any parameter estimation. Simulation results reveal intriguing phasetransitionlike behaviors of the number of segments when changing the level of distortion or the amount of outliers. Finally, we demonstrate how this technique can be readily applied to segment real imagery and bioinformatic data. Index Terms—Multivariate mixed data, data segmentation, data clustering, rate distortion, lossy coding, lossy compression, image segmentation, microarray data clustering. 1
Fast approximate energy minimization with label costs
, 2010
"... The αexpansion algorithm [7] has had a significant impact in computer vision due to its generality, effectiveness, and speed. Thus far it can only minimize energies that involve unary, pairwise, and specialized higherorder terms. Our main contribution is to extend αexpansion so that it can simult ..."
Abstract

Cited by 108 (9 self)
 Add to MetaCart
(Show Context)
The αexpansion algorithm [7] has had a significant impact in computer vision due to its generality, effectiveness, and speed. Thus far it can only minimize energies that involve unary, pairwise, and specialized higherorder terms. Our main contribution is to extend αexpansion so that it can simultaneously optimize “label costs ” as well. An energy with label costs can penalize a solution based on the set of labels that appear in it. The simplest special case is to penalize the number of labels in the solution. Our energy is quite general, and we prove optimality bounds for our algorithm. A natural application of label costs is multimodel fitting, and we demonstrate several such applications in vision: homography detection, motion segmentation, and unsupervised image segmentation. Our C++/MATLAB implementation is publicly available.