Results 1 - 10
of
235
Consensus clustering -- A resampling-based method for class discovery and visualization of gene expression microarray data
- MACHINE LEARNING 52 (2003) 91–118 FUNCTIONAL GENOMICS SPECIAL ISSUE
, 2003
"... ..."
(Show Context)
Exploring the Conditional Coregulation of Yeast Gene Expression Through Fuzzy K-Means Clustering
, 2002
"... Background: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups o ..."
Abstract
-
Cited by 137 (0 self)
- Add to MetaCart
Background: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing wholegenome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed.
Bagging to improve the accuracy of a clustering procedure
- Bioinformatics
, 2003
"... Motivation: The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gen ..."
Abstract
-
Cited by 132 (0 self)
- Add to MetaCart
(Show Context)
Motivation: The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gene expression profiles. Essential aspects of this clustering problem include identifying accurate partitions of the tumor samples into clusters and assessing the confidence of cluster assignments for individual samples. Results: Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure. In these ensemble methods, a partitioning clustering procedure is applied to bootstrap learning sets and the resulting multiple partitions are combined by voting or the creation of anew dissimilarity matrix. As in prediction, the motivation behind bagging is to reduce variability in the partitioning results via averaging. The performances of the new and existing methods were compared using simulated data and gene expression data from two recently published cancer microarray studies. The bagged clustering procedures were in general at least as accurate and often substantially more accurate than a single application of the partitioning clustering procedure. A valuable by-product of bagged clustering are the cluster votes which can be used to assess the confidence of cluster assignments for individual observations.
Stability-Based Validation of Clustering Solutions
, 2004
"... Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a ..."
Abstract
-
Cited by 99 (7 self)
- Add to MetaCart
Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.
Optimal cluster preserving embedding of nonmetric proximity data
- IEEE Trans. Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concern ..."
Abstract
-
Cited by 54 (4 self)
- Add to MetaCart
(Show Context)
Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1
Tight clustering: a resampling-based approach for identifying stable and tight patterns in data
- Biometrics
, 2005
"... In this paper we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. Fo ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
(Show Context)
In this paper we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight and stable clusters of sizes, say, 20-60 genes for further investigation. We want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. “Tight Clustering ” has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.
Stability-based model selection
- In In Advances in Neural Information Processing Systems
, 2002
"... Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. I ..."
Abstract
-
Cited by 43 (7 self)
- Add to MetaCart
(Show Context)
Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area. 1
Finding Predictive Gene Groups from Microarray Data
- Journal of Multivariate Analysis
, 2004
"... Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable ..."
Abstract
-
Cited by 42 (5 self)
- Add to MetaCart
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To nd these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classi cation in a supervised, simultaneous way. With an empirical study on six dierent microarray datasets, we show that Pelora identi es gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classi cation methods based on single genes. Thus, our gene groups can be bene cial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.
Techniques for clustering gene expression data
- COMPUT BIOL MED
, 2007
"... Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data pro ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
(Show Context)
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered.
A CLUE for CLUster Ensembles
- Journal of Statistical Software
, 2005
"... Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structu ..."
Abstract
-
Cited by 34 (7 self)
- Add to MetaCart
Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and “secondary ” clusterings. 1