Results 1 - 10
of
53
Survey of clustering algorithms
- IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract
-
Cited by 499 (4 self)
- Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract
-
Cited by 408 (0 self)
- Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Bayesian Approaches to Gaussian Mixture Modelling”,
- IEEE Transactions on Pattern Analysis and Machine Intelligence,
, 1998
"... ..."
(Show Context)
On Fitting Mixture Models
, 1999
"... Consider the problem of fitting a finite Gaussian mixture, with an unknown number of components, to observed data. This paper proposes a new minimum description length (MDL) type criterion, termed MMDL (for mixture MDL), to select the number of components of the model. MMDL is based on the ident ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Consider the problem of fitting a finite Gaussian mixture, with an unknown number of components, to observed data. This paper proposes a new minimum description length (MDL) type criterion, termed MMDL (for mixture MDL), to select the number of components of the model. MMDL is based on the identification of an "equivalent sample size", for each component, which does not coincide with the full sample size. We also introduce an algorithm based on the standard expectationmaximization (EM) approach together with a new agglomerative step, called agglomerative EM (AEM). The experiments here reported have shown that MMDL outperforms existing criteria of comparable computational cost. The good behavior of AEM, namely its good robustness with respect to initialization, is also illustrated experimentally.
A Fast and Robust General Purpose Clustering Algorithm
- In Pacific Rim International Conference on Artificial Intelligence
, 2000
"... General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very larg ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-Means has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multi-dimensional but is more robust to noise and outliers. We achieve this by using the discrete median rather than the mean as the estimator of the center of a cluster. Comparison with k-Means, Expectation Maximization and Gibbs sampling demonstrates the advantages of our algorithm.
comparative study of rnn for outlier detection in data mining
- in ICDM
, 2002
"... We have proposed replicator neural networks (RNNs) as an outlier detecting algorithm [15]. Here we compare RNN for outlier detection with three other methods using both publicly available statistical datasets (generally small) and data mining datasets (generally much larger and generally real data). ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
We have proposed replicator neural networks (RNNs) as an outlier detecting algorithm [15]. Here we compare RNN for outlier detection with three other methods using both publicly available statistical datasets (generally small) and data mining datasets (generally much larger and generally real data). The smaller datasets provide insights into the relative strengths and weaknesses of RNNs against the compared methods. The larger datasets particularly test scalability and practicality of application. This paper also develops a methodology for comparing outlier detectors and provides performance benchmarks against which new outlier detection methods can be assessed.
Visualization and analysis of full-waveform airborne laser scanner data
- In The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
"... Recent development of airborne laser scanner systems allows for digitization and recording of the full-waveform, i.e. the received signal of the reflected laser pulse. In this paper, some visualizations and analysis of waveform data are presented. The purpose is to study how waveform data can be use ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
Recent development of airborne laser scanner systems allows for digitization and recording of the full-waveform, i.e. the received signal of the reflected laser pulse. In this paper, some visualizations and analysis of waveform data are presented. The purpose is to study how waveform data can be used to extract additional information. As a first step, 3D point data are extracted and parameterized. The approach for extracting 3D point data is based on unsupervised learning where a mixture of Gaussian distributions are fitted to the waveforms to detect the peaks using the EM algorithm. The performance of this approach is compared to the real-time processing echo extraction by the system. The number of points extracted per waveform is studied and examples will illustrate where additional points have been extracted. * Corresponding author. 1.
Unsupervised selection and estimation of finite mixture models
- in Proc. Int. Conf. Pattern Recognition
, 2000
"... We describe a new method for fitting mixture models to multivariate data which performs component selection and does not require external initialization. The novelty of our approach includes: an MML-like (minimum message length) model selection criterion; inclusion of the criterion into the expectat ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
(Show Context)
We describe a new method for fitting mixture models to multivariate data which performs component selection and does not require external initialization. The novelty of our approach includes: an MML-like (minimum message length) model selection criterion; inclusion of the criterion into the expectation-maximization (EM) algorithm (increasing its ability to escape from local maxima); an initialization strategy supported on the interpretation of EM as a selfannealing algorithm. 1.
Learning non-linear image manifolds by global alignment of local linear models
, 2005
"... Appearance based methods, based on statistical models of the pixels values in an image (region) rather than geometrical object models, are increasingly popular in computer vision. In many applications the number of degrees of freedom (DOF) in the image generating process is much lower than the numb ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
Appearance based methods, based on statistical models of the pixels values in an image (region) rather than geometrical object models, are increasingly popular in computer vision. In many applications the number of degrees of freedom (DOF) in the image generating process is much lower than the number of pixels in the image. If there is a smooth function that maps the DOF to the pixel values, then the images are confined to a low dimensional manifold embedded in the image space. We propose a method based on probabilistic mixtures of factor analyzers to (i) model the density of images sampled from such manifolds and (ii) recover global parameterizations of the manifold. A globally non-linear probabilistic two-way mapping between coordinates on the manifold and images is obtained by combining several, locally valid, linear mappings. We propose a parameter estimation scheme that improves upon an existing scheme, and experimentally compare the presented approach to self-organizing maps, generative topographic mapping, and mixtures of factor analyzers. In addition, we show that the approach also applies to find mappings between different embeddings of the same manifold.
An evaluation of criteria for measuring the quality of clusters
- In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI ’99
, 1999
"... An important problem in clustering is how to decide what is the best set of clusters for a given data set, in terms of both the number of clusters and the member-ship of those clusters. In this paper we develop four criteria for measuring the quality of different sets of clusters. These criteria are ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
An important problem in clustering is how to decide what is the best set of clusters for a given data set, in terms of both the number of clusters and the member-ship of those clusters. In this paper we develop four criteria for measuring the quality of different sets of clusters. These criteria are designed so that different criteria prefer cluster sets that generalise at different levels of granularity. We evaluate the suitability of these criteria for non-hierarchical clustering of the results returned by a search engine. We also compare the number of clusters chosen by these criteria with the number of clusters chosen by a group of human subjects. Our results demonstrate that our criteria match the variability exhibited by human subjects, indicating there is no single perfect criterion. Instead, it is necessary to select the correct criterion to match a human subject's generalisation needs. 1