Results 1 - 10
of
149
Survey of clustering algorithms
- IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract
-
Cited by 499 (4 self)
- Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Minimum sumsquared residue co-clustering of gene expression data
- In SDM
, 2004
"... Microarray experiments have been extensively used for simultaneously measuring DNA expression levels of thousands of genes in genome research. A key step in the analysis of gene expression data is the clustering of genes into groups that show similar expression values over a range of conditions. Sin ..."
Abstract
-
Cited by 116 (6 self)
- Add to MetaCart
(Show Context)
Microarray experiments have been extensively used for simultaneously measuring DNA expression levels of thousands of genes in genome research. A key step in the analysis of gene expression data is the clustering of genes into groups that show similar expression values over a range of conditions. Since only a small subset of the genes participate in any cellular process of interest, by focusing on subsets of genes and conditions, we can lower the noise induced by other genes and conditions — a co-cluster characterizes such a subset of interest. Cheng and Church [3] introduced an effective measure of co-cluster quality based on mean squared residue. In this paper, we use two similar squared residue measures and propose two fast k-means like co-clustering algorithms corresponding to the two residue measures. Our algorithms discover k row clusters and l column clusters simultaneously while monotonically decreasing the respective squared residues. Our co-clustering algorithms inherit the simplicity, efficiency and wide applicability of the k-means algorithm. Minimizing the residues may also be formulated as trace optimization problems that allow us to obtain a spectral relaxation that we use for a principled initialization for our iterative algorithms. We further enhance our algorithms by an incremental local search strategy that helps avoid empty clusters and escape poor local minima. We illustrate co-clustering results on a yeast cell cycle dataset and a human B-cell lymphoma dataset. Our experiments show that our co-clustering algorithms are efficient and are able to discover coherent co-clusters. Keywords: Gene-expression, co-clustering, biclustering, residue, spectral relaxation
Clique relaxations in social network analysis: The maximum k-plex problem
, 2006
"... This paper introduces and studies the maximum k-plex problem, which arises in social network analysis, but can also be used in several other important application areas, including wireless networks, telecommunications, and graph-based data mining. We establish NP-completeness of the decision version ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
(Show Context)
This paper introduces and studies the maximum k-plex problem, which arises in social network analysis, but can also be used in several other important application areas, including wireless networks, telecommunications, and graph-based data mining. We establish NP-completeness of the decision version of the problem on arbitrary graphs. An integer programming formulation is presented and basic polyhedral study of the problem is carried out. A branch-and-cut implementation is discussed and computational test results on the proposed benchmark instances and real-life scale-free graphs are also provided.
Tree-Based Pursuit: Algorithm and Properties
, 2005
"... This paper proposes a tree-based pursuit algorithm that efficiently trades off complexity and approximation performance for overcomplete signal expansions. Finding the sparsest representation of a signal using a redundant dictionary is, in general, a NP-Hard problem. Even sub-optimal algorithms such ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
This paper proposes a tree-based pursuit algorithm that efficiently trades off complexity and approximation performance for overcomplete signal expansions. Finding the sparsest representation of a signal using a redundant dictionary is, in general, a NP-Hard problem. Even sub-optimal algorithms such as Matching Pursuit remain highly complex. We propose a structuring strategy that can be applied to any redundant set of functions, and which basically groups similar atoms together. A measure of similarity based on coherence allows for representing a highly redundant sub-dictionary of atoms by a unique element, called molecule. When the clustering is applied recursively on atoms and then on molecules, it naturally leads to the creation of a tree structure. We then present a new pursuit algorithm that uses the structure created by clustering as a decision tree. This tree-based algorithm offers important complexity reduction with respect to Matching Pursuit, as it prunes important parts of the dictionary when traversing the tree. Recent results on incoherent dictionaries are extended to molecules, while the true highly redundant nature of the dictionary stays hidden by the tree structure. We then derive recovery conditions on the structured dictionary, under which tree-based pursuit is guaranteed to converge. Experimental results finally show that the gain in complexity offered by tree-based pursuit does in general not have a high penalty on the approximation performance. They show that the dimensionality of the problem is reduced thanks to the tree construction, without significant loss of information at hand.
Approximate kernel k-means: Solution to large scale kernel clustering
- in Proceedings of the International Conference on Knowledge Discovery and Data mining
"... Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and h ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
(Show Context)
Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernel-based clustering algorithms can capture the non-linear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel k-means, termed approximate kernel k-means, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel k-means is similar to that of the kernel k-means algorithm, but with dramatically reduced run-time complexity and memory requirements.
Duplessis: Mining gene expression data with pattern structures in formal concept analysis
- Information Sciences
, 2011
"... concept analysis ..."
(Show Context)
Development of Two-Stage SVM-RFE gene selection . . .
- IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
"... Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine–Recursive Feature Elimination (SVM-RFE) algorithm is o ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine–Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out " factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out " factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is O(d * log2d), where d is the size of the original gene set. Supplementary material is available at
A Survey of Evolutionary Algorithms for Clustering
"... Abstract — This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algor ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
(Show Context)
Abstract — This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algorithms that look for hard clusterings of data, though overlapping (i.e., soft and fuzzy) approaches are also covered in the manuscript. The paper is original in what concerns two main aspects. First, it provides an up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics, like multi-objective and ensemble-based evolutionary clustering. Second, it provides a taxonomy that highlights some very important aspects in the context of evolutionary data clustering, namely, fixed or variable number of clusters, cluster-oriented or non-oriented operators, context-sensitive or context-insensitive operators, guided or unguided operators, binary, integer or real encodings, centroid-based, medoid-based, label-based, tree-based or graph-based representations, among others. A number of references is provided that describe applications of evolutionary algorithms for clustering in different domains, such as image processing, computer security, and bioinformatics. The paper ends by addressing some important issues and open questions that can be subject of future research. Index Terms — evolutionary algorithms, clustering, applications. I.