Results 1  10
of
119
Probabilistic topic models
 IEEE Signal Processing Magazine
, 2010
"... Probabilistic topic models are a suite of algorithms whose aim is to discover the ..."
Abstract

Cited by 235 (6 self)
 Add to MetaCart
(Show Context)
Probabilistic topic models are a suite of algorithms whose aim is to discover the
Online Learning for Latent Dirichlet Allocation
"... We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collection ..."
Abstract

Cited by 209 (21 self)
 Add to MetaCart
(Show Context)
We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
Stochastic Variational Inference
 JOURNAL OF MACHINE LEARNING RESEARCH (2013, IN PRESS)
, 2013
"... We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet proce ..."
Abstract

Cited by 131 (27 self)
 Add to MetaCart
(Show Context)
We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.
Rethinking LDA: Why Priors Matter
"... Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters ” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an ..."
Abstract

Cited by 110 (3 self)
 Add to MetaCart
(Show Context)
Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters ” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
Online Variational Inference for the Hierarchical Dirichlet Process
"... The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric model that can be used to model mixedmembership data with a potentially infinite number of components. It has been applied widely in probabilistic topic modeling, where the data are documents and the components are distributions o ..."
Abstract

Cited by 62 (7 self)
 Add to MetaCart
(Show Context)
The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric model that can be used to model mixedmembership data with a potentially infinite number of components. It has been applied widely in probabilistic topic modeling, where the data are documents and the components are distributions of terms that reflect recurring patterns (or “topics”) in the collection. Given a document collection, posterior inference is used to determine the number of topics needed and to characterize their distributions. One limitation of HDP analysis is that existing posterior inference algorithms require multiple passes through all the data—these algorithms are intractable for very large scale applications. We propose an online variational inference algorithm for the HDP, an algorithm that is easily applicable to massive and streaming data. Our algorithm is significantly faster than traditional inference algorithms for the HDP, and lets us analyze much larger data sets. We illustrate the approach on two large collections of text, showing improved performance over online LDA, the finite counterpart to the HDP topic model. 1
Software Traceability with Topic Modeling
"... Software traceability is a fundamentally important task in software engineering. The need for automated traceability increases as projects become more complex and as the number of artifacts increases. We propose an automated technique that combines traceability with a machine learning technique know ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
(Show Context)
Software traceability is a fundamentally important task in software engineering. The need for automated traceability increases as projects become more complex and as the number of artifacts increases. We propose an automated technique that combines traceability with a machine learning technique known as topic modeling. Our approach automatically records traceability links during the software development process and learns a probabilistic topic model over artifacts. The learned model allows for the semantic categorization of artifacts and the topical visualization of the software system. To test our approach, we have implemented several tools: an artifact search tool combining keywordbased search and topic modeling, a recording tool that performs prospective traceability, and a visualization tool that allows one to navigate the software architecture and view semantic topics associated with relevant artifacts and architectural components. We apply our approach to several data sets and discuss how topic modeling enhances software traceability, and vice versa. Categories and Subject Descriptors
Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
"... The recent emergence of Graphics Processing Units (GPUs) as generalpurpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet A ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
(Show Context)
The recent emergence of Graphics Processing Units (GPUs) as generalpurpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1
Streaming variational bayes
 In Neural Information Processing Systems (NIPS
, 2013
"... We present SDABayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a userspecified approximation batch primitive. We demonstrate the usefulness of our framework, with va ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
We present SDABayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a userspecified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two largescale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
Online Multiscale Dynamic Topic Models
"... We propose an online topic model for sequentially analyzing the time evolution of topics in document collections. Topics naturally evolve with multiple timescales. For example, some words may be used consistently over one hundred years, while other words emerge and disappear over periods of a few da ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
(Show Context)
We propose an online topic model for sequentially analyzing the time evolution of topics in document collections. Topics naturally evolve with multiple timescales. For example, some words may be used consistently over one hundred years, while other words emerge and disappear over periods of a few days. Thus, in the proposed model, current topicspecific distributions over words are assumed to be generated based on the multiscale word distributions of the previous epoch. Considering both the longtimescale dependency as well as the shorttimescale dependency yields a more robust model. We derive efficient online inference procedures based on a stochastic EM algorithm, in which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference. We demonstrate the effectiveness of the proposed method in terms of predictive performance and computational efficiency by examining collections of real documents with timestamps.
Partially labeled topic models for interpretable text mining
 In Proceedings of KDD
, 2011
"... Much of the world’s electronic text is annotated with humaninterpretable labels, such as tags on web pages and subject codes on academic publications. Effective text mining in this setting requires models that can flexibly account for the textual patterns that underlie the observed labels while stil ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
(Show Context)
Much of the world’s electronic text is annotated with humaninterpretable labels, such as tags on web pages and subject codes on academic publications. Effective text mining in this setting requires models that can flexibly account for the textual patterns that underlie the observed labels while still discovering unlabeled topics. Neither supervised classification, with its focus on label prediction, nor purely unsupervised learning, which does not model the labels explicitly, is appropriate. In this paper, we present two new partially supervised generative models of labeled text, Partially Labeled Dirichlet Allocation (PLDA) and the Partially Labeled Dirichlet Process (PLDP). These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpuswide latent topics. We explore applications with qualitative case studies of tagged web pages from del.icio.us and PhD dissertation abstracts, demonstrating improved model interpretability over traditional topic models. We use the many tags present in our del.icio.us dataset to quantitatively demonstrate the new models ’ higher correlation with human relatedness scores over several strong baselines.