Results 1  10
of
132
Cluster analysis and mathematical programming
 MATHEMATICAL PROGRAMMING
, 1997
"... ..."
(Show Context)
On Clustering Properties of Hierarchical SelfOrganizing Maps
 ARTIFICIAL NEURAL NETWORKS
, 1992
"... A very important theoretical result giving impetus to increasing interest in neural networks is that a multilayer feedforward network can approximate any function to arbitrary precision, or as a classifier it can form arbitraryly complex class boundaries [2]. In difficult practical classification p ..."
Abstract

Cited by 62 (6 self)
 Add to MetaCart
(Show Context)
A very important theoretical result giving impetus to increasing interest in neural networks is that a multilayer feedforward network can approximate any function to arbitrary precision, or as a classifier it can form arbitraryly complex class boundaries [2]. In difficult practical classification problems, like in pattern recognition and machine vision, the class boundaries will inevitably be very complex due to variations and distortions in the input images. To reduce the amount of trainig data needed the number of independent weights in the classifier must be reduced [1]. The tradeoff is between the capability of the classifier and the amount of training data. In machine vision problems it is often possible to acquire large amounts of training data as long as manual classification of the objects is not required. Thus unsupervised methods can be used in the preprocessing stage without large extra cost. The essential requirement for the preprocessor is that the (unknown) class boundaries shoud be simpler that in the original data, while any two separable classes should keep separable. Since the class boundaries are not known, the best preprocessing can do is to follow the distributions of the data samples, or in other words, clustering.
A methodology for clustering XML documents by structure
 Information Systems
, 2006
"... The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
(Show Context)
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. Such grouping results from the application of clustering methods with distances that estimate the similarity between tree structures. This paper presents a framework for clustering XML documents by structure. Modeling the XML documents as rooted ordered labeled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Our approach is tested using a prototype testbed.
A nonparametric statistical approach to clustering via mode identification
 Journal of Machine Learning Research
"... A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated m ..."
Abstract

Cited by 34 (11 self)
 Add to MetaCart
(Show Context)
A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EMstyle algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the modebased clustering yields a density description for every cluster, a major advantage of mixturemodelbased clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the modebased clustering approach tends to combine the strengths of linkage and mixturemodelbased clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at
On the uniqueness of the selection criterion in neighborjoining
 Journal of Classification
"... The NeighborJoining (NJ) method of Saitou and Nei is the most widely used distance based method in phylogenetic analysis. Central to the method is the selection criterion, the formula used to choose which pair of objects to amalgamate next. Here we analyze the NJ selection criterion using an axioma ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
The NeighborJoining (NJ) method of Saitou and Nei is the most widely used distance based method in phylogenetic analysis. Central to the method is the selection criterion, the formula used to choose which pair of objects to amalgamate next. Here we analyze the NJ selection criterion using an axiomatic approach. We show that any selection criterion that is linear, permutation equivariant, statistically consistent and based solely on distance data will give the same trees as those created by NJ. 1
Minimum spanning trees for gene expression data clustering
 Genome Informatics
, 2001
"... This paper describes a new framework for microarray geneexpression data clustering. The foundation of this framework is a minimum spanning tree (MST) representation of a set of multidimensional gene expression data. A key property of this representation is that each cluster of the expression data c ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
(Show Context)
This paper describes a new framework for microarray geneexpression data clustering. The foundation of this framework is a minimum spanning tree (MST) representation of a set of multidimensional gene expression data. A key property of this representation is that each cluster of the expression data corresponds to one subtree of the MST, which rigorously converts a multidimensional clustering problem to a tree partitioning problem. We have demonstrated that though the interdata relationship is greatly simplified in the MST representation, no essential information is lost for the purpose of clustering. Two key advantages in representing a set of multidimensional data as an MST are: (1) the simple structure of a tree facilitates efficient implementations of rigorous clustering algorithms, which otherwise are highly computationally challenging; and (2) as an MSTbased clustering does not depend on detailed geometric shape of a cluster, it can overcome many of the problems faced by classical clustering algorithms. Based on the MST representation, we have developed a number of rigorous and efficient clustering algorithms, including two with guaranteed global optimality. We have implemented these algorithms as a computer software EXCAVATOR. To demonstrate its effectiveness, we have tested it on two data sets, i.e., expression data from yeast Saccharomyces cerevisiae, and Arabidopsis expression data in response to chitin elicitation.
An interior point algorithm for minimum sum of squares clustering
 SIAM J. Sci. Comput
, 1997
"... Abstract. An exact algorithm is proposed for minimum sumofsquares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean mspace into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to wh ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
(Show Context)
Abstract. An exact algorithm is proposed for minimum sumofsquares nonhierarchical clustering, i.e., for partitioning a given set of points from a Euclidean mspace into a given number of clusters in order to minimize the sum of squared distances from all points to the centroid of the cluster to which they belong. This problem is expressed as a constrained hyperbolic program in 01 variables. The resolution method combines an interior point algorithm, i.e., a weighted analytic center column generation method, with branchandbound. The auxiliary problem of determining the entering column (i.e., the oracle) is an unconstrained hyperbolic program in 01 variables with a quadratic numerator and linear denominator. It is solved through a sequence of unconstrained quadratic programs in 01 variables. To accelerate resolution, variable neighborhood search heuristics are used both to get a good initial solution and to solve quickly the auxiliary problem as long as global optimality is not reached. Estimated bounds for the dual variables are deduced from the heuristic solution and used in the resolution process as a trust region. Proved minimum sumofsquares partitions are determined for the first time for several fairly large data sets from the literature, including Fisher’s 150 iris. Key words. classification and discrimination, cluster analysis, interiorpoint methods, combinatorial optimization
Faster reliable phylogenetic analysis
, 1999
"... We present fast new algorithms for phylogenetic reconstruction from distance data or weighted quartets. The methods are conservativethey will only return edges that are well supported by the input data. This approach is not only philosophically attractive; the conservative tree estimate can be used ..."
Abstract

Cited by 25 (5 self)
 Add to MetaCart
We present fast new algorithms for phylogenetic reconstruction from distance data or weighted quartets. The methods are conservativethey will only return edges that are well supported by the input data. This approach is not only philosophically attractive; the conservative tree estimate can be used as a basis for further tree refinement or divide and conquer algorithms. The capability to process quartet data allows these algorithms to be used in tandem with ordinal or qualitative phylogenetic analysis methods. We provide algorithms for three standard conservative phylogenetic constructions: the Buneman tree, the Refined Buneman tree, and split decomposition. We introduce and exploit combinatorial formalisms involving trees, quartets, and splits, and make particular use of an attractive duality between unrooted trees, splits, and dissimilarities on one hand, and rooted trees, clusters, and similarity measures on the other. Using these techniques, we achieve O(n) improvements in the time complexity of the best previously published algorithms (where n is the number of studied species). Our algorithms will be included in the next edition of the popular Splitslkee software package.
Minimum Spanning Tree Based Clustering Algorithms
"... The minimum spanning tree clustering algorithm is known to be capable of detecting clusters with irregular boundaries. In this paper, we propose two minimum spanning tree based clustering algorithms. The first algorithm produces a kpartition of a set of points for any given k. The algorithm constru ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
The minimum spanning tree clustering algorithm is known to be capable of detecting clusters with irregular boundaries. In this paper, we propose two minimum spanning tree based clustering algorithms. The first algorithm produces a kpartition of a set of points for any given k. The algorithm constructs a minimum spanning tree of the point set and removes edges that satisfy a predefined criterion. The process is repeated until k clusters are produced. The second algorithm partitions a point set into a group of clusters by maximizing the overall standard deviation reduction, without a given k value. We present our experimental results comparing our proposed algorithms to kmeans and EM. We also apply our algorithms to image color clustering and compare our algorithms to the standard minimum spanning tree clustering algorithm. 1.