• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A prediction-based resampling method for estimating the number of clusters in a dataset (0)

by S Dudoit, J Fridlyand
Venue:Genome Biol
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 235
Next 10 →

Consensus clustering -- A resampling-based method for class discovery and visualization of gene expression microarray data

by Stefano Monti, Pablo Tamayo, Jill Mesirov, Todd Golub - MACHINE LEARNING 52 (2003) 91–118 FUNCTIONAL GENOMICS SPECIAL ISSUE , 2003
"... ..."
Abstract - Cited by 255 (11 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...te perturbations of the original data set, so as to assess the stability of the clustering results with respect to sampling variability (Ben-Hur, Elisseeff, & Guyon, 2002; Bhattacharjee et al., 2001; =-=Dudoit & Fridlyand, 2002-=-; Jain & Moreau, 1988; Levine & Domany, 2001; Tibshirani et al., 2001a). In particular, in Bhattacharjee et al. (2001) the use of bootstrapping to assess clustering stability and to validate the resul...

Exploring the Conditional Coregulation of Yeast Gene Expression Through Fuzzy K-Means Clustering

by A P Gasch, M P Eisen , 2002
"... Background: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups o ..."
Abstract - Cited by 137 (0 self) - Add to MetaCart
Background: Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing wholegenome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed.

Bagging to improve the accuracy of a clustering procedure

by Rine Dudoit, Jane Fridly - Bioinformatics , 2003
"... Motivation: The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gen ..."
Abstract - Cited by 132 (0 self) - Add to MetaCart
Motivation: The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gene expression profiles. Essential aspects of this clustering problem include identifying accurate partitions of the tumor samples into clusters and assessing the confidence of cluster assignments for individual samples. Results: Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure. In these ensemble methods, a partitioning clustering procedure is applied to bootstrap learning sets and the resulting multiple partitions are combined by voting or the creation of anew dissimilarity matrix. As in prediction, the motivation behind bagging is to reduce variability in the partitioning results via averaging. The performances of the new and existing methods were compared using simulated data and gene expression data from two recently published cancer microarray studies. The bagged clustering procedures were in general at least as accurate and often substantially more accurate than a single application of the partitioning clustering procedure. A valuable by-product of bagged clustering are the cluster votes which can be used to assess the confidence of cluster assignments for individual observations.
(Show Context)

Citation Context

...ing (see Kaufman and Rousseeuw (1990) for a discussion of fuzzy clustering). An interesting feature of the BagClust1 procedure was raised in the application to the NCI 60 dataset using K = 8clusters (=-=Dudoit and Fridlyand, 2001-=-). Although each application of PAM to a bootstrap learning set produced eight clusters, the plurality voting reduced the number of clusters to 2. This suggests that BagClust1 may be able to correct f...

Stability-Based Validation of Clustering Solutions

by Tilman Lange, Volker Roth, Mikio L. Braun, Joachim M. Buhmann , 2004
"... Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a ..."
Abstract - Cited by 99 (7 self) - Add to MetaCart
Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.

Optimal cluster preserving embedding of nonmetric proximity data

by Volker Roth, Julian Laub, Motoaki Kawanabe, Joachim M. Buhmann - IEEE Trans. Pattern Analysis and Machine Intelligence , 2003
"... Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concern ..."
Abstract - Cited by 54 (4 self) - Add to MetaCart
Abstract—For several major applications of data analysis, objects are often not represented as feature vectors in a vector space, but rather by a matrix gathering pairwise proximities. Such pairwise data often violates metricity and, therefore, cannot be naturally embedded in a vector space. Concerning the problem of unsupervised structure detection or clustering, in this paper, a new embedding method for pairwise data into Euclidean vector spaces is introduced. We show that all clustering methods, which are invariant under additive shifts of the pairwise proximities, can be reformulated as grouping problems in Euclidian spaces. The most prominent property of this constant shift embedding framework is the complete preservation of the cluster structure in the embedding space. Restating pairwise clustering problems in vector spaces has several important consequences, such as the statistical description of the clusters by way of cluster prototypes, the generic extension of the grouping procedure to a discriminative prediction rule, and the applicability of standard preprocessing methods like denoising or dimensionality reduction. Index Terms—Clustering, pairwise proximity data, cost function, embedding, MDS. 1
(Show Context)

Citation Context

... the embedding space, a deterministic annealing method was applied. Concerning the selection of the “correct” number of clusters, we used the concept of cluster stability which has been introduced in =-=[25]-=- and refined in [26]. The main idea is to draw resamples from the data set and then to compare the inferred data-partitions across these resamples. The variations of the partitions are transformed int...

Tight clustering: a resampling-based approach for identifying stable and tight patterns in data

by George C. Tseng, Wing Hung Wong - Biometrics , 2005
"... In this paper we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. Fo ..."
Abstract - Cited by 54 (5 self) - Add to MetaCart
In this paper we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight and stable clusters of sizes, say, 20-60 genes for further investigation. We want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. “Tight Clustering ” has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.
(Show Context)

Citation Context

...ter rules. Very recently, Tibshirani et al. (2001) introduced a promising method that utilized resampling techniques. They selected k to maximize the prediction rate estimated by resampling (see also =-=Dudoit and Fridlyand, 2002-=-). In this paper, we have further developed the resampling approach to identify tight and stable clusters. In our approach, the tight clusters are obtained sequentially, usually in the order of decrea...

Stability-based model selection

by Tilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann - In In Advances in Neural Information Processing Systems , 2002
"... Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. I ..."
Abstract - Cited by 43 (7 self) - Add to MetaCart
Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area. 1
(Show Context)

Citation Context

... not for model order selection, his study suggests the usefulness of such an approach for the purpose of validation. Our method can be considered as a refinement of his approach. Fridlyand and Dudoit =-=[6]-=- propose a model order selection procedure, called Clest, that also builds upon Breckenridge’s work. Their method employs the replication analysis idea by repeatedly splitting the available data into ...

Finding Predictive Gene Groups from Microarray Data

by Marcel Dettling, Peter Bühlmann - Journal of Multivariate Analysis , 2004
"... Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable ..."
Abstract - Cited by 42 (5 self) - Add to MetaCart
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of genes which act together and whose collective expression is strongly associated with an outcome variable of interest. To nd these groups, we suggest the use of supervised algorithms: these are procedures which use external information about the response variable for grouping the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, gene grouping and sample classi cation in a supervised, simultaneous way. With an empirical study on six dierent microarray datasets, we show that Pelora identi es gene groups whose expression centroids have very good predictive potential and yield results that can keep up with state-of-the-art classi cation methods based on single genes. Thus, our gene groups can be bene cial in medical diagnostics and prognostics, but they may also provide more biological insights into gene function and regulation.

Techniques for clustering gene expression data

by G. Kerr, H. J. Ruskin, M. Crane, P. Doolan - COMPUT BIOL MED , 2007
"... Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data pro ..."
Abstract - Cited by 34 (3 self) - Add to MetaCart
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered.
(Show Context)

Citation Context

...nes with common promoter sequence are likely to be expressed together and thus are likely to be placed in the same group. Methods for determining optimal number of groups, K, are discussed in [7] and =-=[8]-=-. Clustering a GE matrix can be achieved in two ways: (i) genes can form a group which show similar expression across conditions, (ii) samples can form a group which show similar expression across all...

A CLUE for CLUster Ensembles

by Kurt Hornik - Journal of Statistical Software , 2005
"... Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structu ..."
Abstract - Cited by 34 (7 self) - Add to MetaCart
Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and “secondary ” clusterings. 1
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University