Results 11  20
of
116
Clustering large data sets with mixed numeric and categorical values
 In The First PacificAsia Conference on Knowledge Discovery and Data Mining
, 1997
"... Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The kmeans based methods are promising for their efficiency in proce ..."
Abstract

Cited by 58 (3 self)
 Add to MetaCart
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The kmeans based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a kprototypes algorithm which is based on the kmeans paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters. 1
Optimal cluster preserving embedding of nonmetric proximity data
 IEEE Trans. Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concern ..."
Abstract

Cited by 53 (4 self)
 Add to MetaCart
(Show Context)
Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1
A Theory of Proximity Based Clustering: Structure Detection by Optimization
 Pattern Recognition
, 1999
"... In this paper, a systematic optimization approach for clustering proximity or similarity data is developed. Starting from fundamental invariance and robustness properties, a set of axioms is proposed and discussed to distinguish different cluster compactness and separation criteria. The approach cov ..."
Abstract

Cited by 43 (8 self)
 Add to MetaCart
In this paper, a systematic optimization approach for clustering proximity or similarity data is developed. Starting from fundamental invariance and robustness properties, a set of axioms is proposed and discussed to distinguish different cluster compactness and separation criteria. The approach covers the case of sparse proximity matrices, and is extended to nested partitionings for hierarchical data clustering. To solve the associated optimization problems, a rigorous mathematical framework for deterministic annealing and meanfield approximation is presented. Efficient optimization heuristics are derived in a canonical way, which also clarifies the relation to stochastic optimization by Gibbs sampling. Similaritybased clustering techniques have a broad range of possible applications in computer vision, pattern recognition, and data analysis. As a major practical application we present a novel approach to the problem of unsupervised texture segmentation, which relies on statistical...
Stabilitybased model selection
 In In Advances in Neural Information Processing Systems
, 2002
"... Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semisupervised and unsupervised settings. I ..."
Abstract

Cited by 42 (7 self)
 Add to MetaCart
(Show Context)
Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semisupervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to crossvalidation in the supervised case, but extends to semisupervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area. 1
Histogram Clustering for Unsupervised Image Segmentation
 Proceedings of CVPR ’99
, 1999
"... This paper introduces a novel statistical mixturemodel for probabilistic grouping of distributional (histogram) data. Adopting the Bayesian framework, we propose to perform annealed maximum a posteriori estimation to compute optimal clustering solutions. In order to accelerate the optimization proce ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
This paper introduces a novel statistical mixturemodel for probabilistic grouping of distributional (histogram) data. Adopting the Bayesian framework, we propose to perform annealed maximum a posteriori estimation to compute optimal clustering solutions. In order to accelerate the optimization process, an efficient multiscale formulation is developed. We present a prototypical application of this method for the unsupervised segmentation of textured images based on local distributions of Gabor coefficients. Benchmark results indicate superior performance compared to Kmeans clustering and proximitybased algorithms.
Path Based Pairwise Data Clustering with Application to Texture Segmentation
, 2001
"... Most cost function based clustering or partitioning methods measure the compactness of groups of data. In contrast to this picture of a point source in feature space, some data sources are spread out on a lowdimensional manifold which is embedded in a high dimensional data space. This property ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
Most cost function based clustering or partitioning methods measure the compactness of groups of data. In contrast to this picture of a point source in feature space, some data sources are spread out on a lowdimensional manifold which is embedded in a high dimensional data space. This property is adequately captured by the criterion of connectedness which is approximated by graph theoretic partitioning methods.
On spatial quantization of color images
 in Proceedings of the European Conference on Computer Vision
, 1998
"... Abstract Image quantization and digital halftoning are fundamental image processing problems in computer vision and graphics. Both steps are generally performed sequentially and, in most cases, independent of each other. Color reduction with a pixelwise defined distortion measure and the halftoning ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
(Show Context)
Abstract Image quantization and digital halftoning are fundamental image processing problems in computer vision and graphics. Both steps are generally performed sequentially and, in most cases, independent of each other. Color reduction with a pixelwise defined distortion measure and the halftoning process with its local averaging neighborhood typically optimize different quality criteria or, frequently, follow a heuristic approach without reference to any quality measure. In this paper we propose a new model to simultaneously quantize and halftone color images. The method is based on a rigorous costfunction approach which optimizes a quality criterion derived from a simplified model of human perception. It incorporates spatial and contextual information into the quantization and thus overcomes the artificial separation of quantization and halftoning. Optimization is performed by an efficient multiscale procedure which substantially alleviates the computational burden.
Scalebased Clustering using the Radial Basis Function Network
 IEEE Trans. Neural Networks
, 1996
"... This paper shows how scalebased clustering can be done using the Radial Basis Function (RBF) Network, with the RBF width as the scale parameter and a dummy target as the desired output. The technique suggests the "right" scale at which the given data set should be clustered, thereby provi ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
This paper shows how scalebased clustering can be done using the Radial Basis Function (RBF) Network, with the RBF width as the scale parameter and a dummy target as the desired output. The technique suggests the "right" scale at which the given data set should be clustered, thereby providing a solution to the problem of determining the number of RBF units and the widths required to get a good network solution. The network compares favorably with other standard techniques on benchmark clustering examples. Properties that are required of nongaussian basis functions, if they are to serve in alternative clustering networks, are identified. The work on the whole points out an important role played by the width parameter in RBFN, when observed over several scales, and provides a fundamental link to the scale space theory developed in computational vision. The work described here is supported in part by the National Science Foundation under grant ECS9307632 and in part by ONR Contract N...
Data Clustering and Learning
, 2002
"... Intelligent data analysis extracts symbolic information and relations between objects from quantitative or qualitative data. A prominent class of methods are clustering or grouping principles which are designed to discover and extract structures hidden in data sets [Jain and Dubes, 1988]. The parame ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Intelligent data analysis extracts symbolic information and relations between objects from quantitative or qualitative data. A prominent class of methods are clustering or grouping principles which are designed to discover and extract structures hidden in data sets [Jain and Dubes, 1988]. The parameters which