Results 1  10
of
1,720
Dynamic topic models
 In ICML
, 2006
"... Scientists need new tools to explore and browse large collections of scholarly literature. Thanks to organizations such as JSTOR, which scan and index the original bound archives of many journals, modern scientists can search digital libraries spanning hundreds of years. A scientist, suddenly ..."
Abstract

Cited by 656 (28 self)
 Add to MetaCart
(Show Context)
Scientists need new tools to explore and browse large collections of scholarly literature. Thanks to organizations such as JSTOR, which scan and index the original bound archives of many journals, modern scientists can search digital libraries spanning hundreds of years. A scientist, suddenly
Mixed membership stochastic block models for relational data with application to proteinprotein interactions
 In Proceedings of the International Biometrics Society Annual Meeting
, 2006
"... We develop a model for examining data that consists of pairwise measurements, for example, presence or absence of links between pairs of objects. Examples include protein interactions and gene regulatory networks, collections of authorrecipient email, and social networks. Analyzing such data with p ..."
Abstract

Cited by 366 (51 self)
 Add to MetaCart
(Show Context)
We develop a model for examining data that consists of pairwise measurements, for example, presence or absence of links between pairs of objects. Examples include protein interactions and gene regulatory networks, collections of authorrecipient email, and social networks. Analyzing such data with probabilistic models requires special assumptions, since the usual independence or exchangeability assumptions no longer hold. We introduce a class of latent variable models for pairwise measurements: mixed membership stochastic blockmodels. Models in this class combine a global model of dense patches of connectivity (blockmodel) and a local model to instantiate nodespecific variability in the connections (mixed membership). We develop a general variational inference algorithm for fast approximate posterior inference. We demonstrate the advantages of mixed membership stochastic blockmodels with applications to social networks and protein interaction networks.
Population structure and eigenanalysis
 PLoS Genet 2(12): e190 DOI: 10.1371/journal.pgen.0020190
, 2006
"... Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by CavalliSforza and colleague ..."
Abstract

Cited by 237 (6 self)
 Add to MetaCart
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by CavalliSforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general ‘‘phase change’ ’ phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like F ST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
Probabilistic topic models
 IEEE Signal Processing Magazine
, 2010
"... Probabilistic topic models are a suite of algorithms whose aim is to discover the ..."
Abstract

Cited by 221 (6 self)
 Add to MetaCart
(Show Context)
Probabilistic topic models are a suite of algorithms whose aim is to discover the
Topics in semantic representation
 Psychological Review
, 2007
"... Accounts of language processing have suggested that it requires retrieving concepts from memory in response to an ongoing stream of information. This can be facilitated by inferring the gist of a sentence, conversation, or document, and using that computational problem underlying the extraction and ..."
Abstract

Cited by 173 (14 self)
 Add to MetaCart
Accounts of language processing have suggested that it requires retrieving concepts from memory in response to an ongoing stream of information. This can be facilitated by inferring the gist of a sentence, conversation, or document, and using that computational problem underlying the extraction and use of gist, formulating this problem as a rational statistical inference. This leads us to a novel approach to semantic representation in which word meanings are represented in terms of a set of probabilistic topics. The topic model performs well in predicting word association and the effects of semantic association and ambiguity on a variety of language processing and memory tasks. It also provides a foundation for developing more richly structured statistical models of language, as the generative process assumed in the topic model can easily be extended to incorporate other kinds of semantic and syntactic structure. Many aspects of perception and cognition can be understood by considering the computational problem that is addressed by a particular human capacity (Andersion, 1990; Marr, 1982). Perceptual capacities such as identifying shape from shading (Freeman, 1994), motion perception
AJ: Bayesian inference of species trees from multilocus data
 Mol Biol Evol
"... Until recently, it has been common practice for a phylogenetic analysis to use a single gene sequence from a single individual organism as a proxy for an entire species. With technological advances, it is now becoming more common to collect data sets containing multiple gene loci and multiple indivi ..."
Abstract

Cited by 168 (5 self)
 Add to MetaCart
Until recently, it has been common practice for a phylogenetic analysis to use a single gene sequence from a single individual organism as a proxy for an entire species. With technological advances, it is now becoming more common to collect data sets containing multiple gene loci and multiple individuals per species. These data sets often reveal the need to directly model intraspecies polymorphism and incomplete lineage sorting in phylogenetic estimation procedures. For a single species, coalescent theory is widely used in contemporary population genetics to model intraspecific gene trees. Here, we present a Bayesian Markov chain Monte Carlo method for the multispecies coalescent. Our method coestimates multiple gene trees embedded in a shared species tree along with the effective population size of both extant and ancestral species. The inference is made possible by multilocus data from multiple individuals per species. Using a multiindividual data set and a series of simulations of rapid species radiations, we demonstrate the efficacy of our new method. These simulations give some insight into the behavior of the method as a function of sampled individuals, sampled loci, and sequence length. Finally, we compare our new method to both an existing method (BEST 2.2) with similar goals and the supermatrix (concatenation) method. We demonstrate that both BEST and our method have much better estimation accuracy for species tree topology than concatenation, and our method outperforms BEST in divergence time and population size estimation.
A CORRELATED TOPIC MODEL OF SCIENCE
, 2007
"... Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limi ..."
Abstract

Cited by 147 (10 self)
 Add to MetaCart
Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than Xray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science published from 1990–1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections.
K.L.: Fast modelbased estimation of ancestry in unrelated individuals. Genome Res. 19, 1655– 1664
 Information Systems and Data Analysis
, 1997
"... Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multilocus genotype data, can be used as covariates to correct for population stratification. One popular technique for estimation of ancestry is the modelb ..."
Abstract

Cited by 127 (4 self)
 Add to MetaCart
(Show Context)
Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multilocus genotype data, can be used as covariates to correct for population stratification. One popular technique for estimation of ancestry is the modelbased approach embodied by the widelyapplied program structure. Another approach, implemented in the program eigenstrat, relies on principal component analysis rather than modelbased estimation and does not directly deliver admixture fractions. eigenstrat has gained in popularity in part due to its remarkable speed in comparison to structure. We present a new algorithm and a program, admixture, for modelbased estimation of ancestry in unrelated individuals. admixture adopts the likelihood model embedded in structure. However, admixture runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments we have found that admixture is almost as fast as eigenstrat. The runtime improvements of admixture rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasiNewton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an ExpectationMaximization (EM) algorithm incorporated in the program frappe. Our simulations show that admixture’s maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure’s Bayesian estimates. On real world datasets, admixture’s estimates are directly comparable to those from structure and eigenstrat. Taken together, our results show that admixture’s computational speed opens up the possibility of using a much larger setof markers in modelbased ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies. 2 1
The nested chinese restaurant process and bayesian inference of topic hierarchies
, 2007
"... We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitelydeep, infinitelybranching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Spe ..."
Abstract

Cited by 123 (15 self)
 Add to MetaCart
(Show Context)
We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitelydeep, infinitelybranching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning—the use of Bayesian nonparametric methods to infer distributions on flexible data structures.