Results 1  10
of
573
Hierarchical Dirichlet processes.
 Journal of the American Statistical Association,
, 2006
"... We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this s ..."
Abstract

Cited by 942 (78 self)
 Add to MetaCart
(Show Context)
We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the wellknown clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stickbreaking process, and a generalization of the Chinese restaurant process that we refer to as the "Chinese restaurant franchise." We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.
Consistency of spectral clustering
, 2004
"... Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spe ..."
Abstract

Cited by 572 (15 self)
 Add to MetaCart
Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spectral clustering algorithms, which cluster the data with the help of eigenvectors of graph Laplacian matrices. We show that one of the two of major classes of spectral clustering (normalized clustering) converges under some very general conditions, while the other (unnormalized), is only consistent under strong additional assumptions, which, as we demonstrate, are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering in practical applications. We believe that methods used in our analysis will provide a basis for future exploration of Laplacianbased methods in a statistical setting.
ModelBased Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract

Cited by 200 (9 self)
 Add to MetaCart
(Show Context)
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, modelbased clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Clustering of timecourse gene expression data using a mixedeffects model with splines
 04, 2002, Rowe Program in Human Genetics, UC Davis School of Medicine
, 2002
"... Motivation: Timecourse gene expression data are often measured to study dynamic biological systems and gene regulatory networks. To account for time dependency of the gene expression measurements over time and the noisy nature of the microarray data, the mixedeffects model using Bsplines was intr ..."
Abstract

Cited by 138 (4 self)
 Add to MetaCart
(Show Context)
Motivation: Timecourse gene expression data are often measured to study dynamic biological systems and gene regulatory networks. To account for time dependency of the gene expression measurements over time and the noisy nature of the microarray data, the mixedeffects model using Bsplines was introduced. This paper further explores such mixedeffects model in analyzing the timecourse gene expression data and in performing clustering of genes in a mixture model framework. Results: After fitting the mixture model in the framework of the mixedeffects model using an EM algorithm, we obtained the smooth mean gene expression curve for each cluster. For each gene, we obtained the best linear unbiased smooth estimate of its gene expression trajectory over time, combining data from that gene and other genes in the same cluster. Simulated data indicate that the methods can effectively cluster noisy curves into clusters differing in either the shapes of the curves or the times to the peaks of the curves. We further demonstrate the proposed method by clustering the yeast genes based on their cell cycle gene expression data and the human genes based on the temporal transcriptional response of fibroblasts to serum. Clear periodic patterns and varying times to peaks are observed for different clusters of the cellcycle regulated genes. Results of the analysis of the human fibroblasts data show seven distinct transcriptional response profiles with biological relevance. Availability: Matlab programs are available on request from the authors.
From patterns to pathways: gene expression data analysis comes of age.
 Nature Genetics
, 2002
"... ..."
Variable Selection for ModelBased Clustering
 Journal of the American Statistical Association
, 2006
"... We consider the problem of variable or feature selection for modelbased clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in m ..."
Abstract

Cited by 98 (7 self)
 Add to MetaCart
We consider the problem of variable or feature selection for modelbased clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples, and found that removing irrelevant variables often improved performance. Compared to methods based on all the variables, our variable selection method consistently yielded more accurate estimates of the number of clusters, and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
AE: MCLUST Version 3 for R: Normal Mixture Modeling and ModelBased Clustering
 Department of Statistics, University of Washington
, 2006
"... MCLUST is a contributed R package for normal mixture modeling and modelbased clustering. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions ..."
Abstract

Cited by 87 (1 self)
 Add to MetaCart
(Show Context)
MCLUST is a contributed R package for normal mixture modeling and modelbased clustering. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine modelbased hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. There is additional functionality for displaying and visualizing the models along with clustering and classification results. A number of features of the software have been changed in this version, and the functionality has been expanded to include regularization for normal mixture models via a Bayesian prior. MCLUST is licensed by the University of Washington and distributed through
A bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases
 In Proceedings of the First Workshop on Social Media Analytics, SOMA ’10
"... Political scientists lack methods to efficiently measure the priorities political actors emphasize in statements. To address this limitation, I introduce a statistical model that attends to the structure of political rhetoric when measuring expressed priorities: statements are naturally organized b ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
(Show Context)
Political scientists lack methods to efficiently measure the priorities political actors emphasize in statements. To address this limitation, I introduce a statistical model that attends to the structure of political rhetoric when measuring expressed priorities: statements are naturally organized by author. The expressed agenda model exploits this structure to simultaneously estimate the topics in the texts, as well as the attention political actors allocate to the estimated topics. I apply the method to a collection of over 64,000 press releases from senators from 20052007, which I demonstrate is an ideal medium to measure how senators explain their work in Washington to constituents. A set of examples validates the estimated priorities and demonstrates that the additional information included in the model provides better classification than expert human coders or statistical models for clustering that ignore the author of a document. The statistical model and its extensions will be made available in a forthcoming free software package for the R computing language and the press release data will be made available for download. ∗PhD Candidate, Harvard University Department of Government. I thank the Center for American Political Studies
Bayesian regularization for normal mixture estimation and modelbased clustering
, 2005
"... Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum ..."
Abstract

Cited by 57 (4 self)
 Add to MetaCart
Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC, where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing a small fraction of one observation’s worth of information. The resulting method avoids degeneracies and singularities, but when these are not present it gives similar results to the standard method using MLE, EM and BIC. Key words: BIC; EM algorithm; mixture models; modelbased clustering; conjugate prior; posterior mode. 1