Results 1 - 10
of
29
Evaluation methods for topic models
- In ICML
, 2009
"... A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is intractable, several estimators for this probability have been used in the topic modeling literature, including the harmonic mean me ..."
Abstract
-
Cited by 111 (10 self)
- Add to MetaCart
A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is intractable, several estimators for this probability have been used in the topic modeling literature, including the harmonic mean
Pachinko allocation: DAG-structured mixture models of topic correlations
- In Proceedings of the 23rd International Conference on Machine Learning
, 2006
"... Latent Dirichlet allocation (LDA) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. However, LDA does not capture correlations between topics. In this paper, we introduce the pachinko allocation model (PAM), which captures arbitr ..."
Abstract
-
Cited by 181 (8 self)
- Add to MetaCart
a flexible alternative to recent work by Blei and Lafferty (2006), which captures correlations only between pairs of topics. Using text data from newsgroups, historic NIPS proceedings and other research paper corpora, we show improved performance of PAM in document classification, likelihood of held-out
Replicated softmax: an undirected topic model
- In Advances in Neural Information Processing Systems
"... We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this m ..."
Abstract
-
Cited by 67 (14 self)
- Add to MetaCart
in terms of both the log-probability of held-out documents and the retrieval accuracy. 1
Mixtures of hierarchical topics with pachinko allocation
- In Proceedings of the 21st International Conference on Machine Learning
, 2007
"... The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specif ..."
Abstract
-
Cited by 64 (2 self)
- Add to MetaCart
improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and humangenerated categories such as journals. 1.
Effective Document-Level Features for Chinese Patent Word Segmentation
"... A patent is a property right for an inven-tion granted by the government to the in-ventor. Patents often have a high con-centration of scientific and technical terms that are rare in everyday language. How-ever, some scientific and technical terms usually appear with high frequency only in one speci ..."
Abstract
- Add to MetaCart
specific patent. In this paper, we propose a pragmatic approach to Chinese word segmentation on patents where we train a sequence labeling model based on a group of novel document-level features. Experiments show that the accuracy of our model reached 96.3 % (F1 score) on the de-velopment set and 95
Evaluating topic models for information retrieval
- In Proceedings of CIKM 2008
, 2008
"... We explore the utility of different types of topic models, both probabilistic and not, for retrieval purposes. We show that: (1) topic models are effective for document smoothing; (2) more elaborate topic models that capture topic dependencies provide no additional gains; (3) smoothing documents by ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
by using their similar documents is as effective as smoothing them by using topic models; (4) topics discovered on the whole corpus are too coarse-grained to be useful for query expansion. Experiments to measure topic models ’ ability to predict held-out likelihood confirm past results on small corpora
Accounting for Burstiness in Topic Models
"... Many different topic models have been used successfully for a variety of applications. However, even state-of-the-art topic models suffer from the important flaw that they do not capture the tendency of words to appear in bursts; it is a fundamental property of language that if a word is used once i ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
in a document, it is more likely to be used again. We introduce a topic model that uses Dirichlet compound multinomial (DCM) distributions to model this burstiness phenomenon. On both text and non-text datasets, the new model achieves better held-out likelihood than standard latent Dirichlet allocation
Omnifluent TM English-to-French and Russian-to-English Systems for the 2013 Workshop on Statistical Machine Translation
"... This paper describes OmnifluentTM Translate – a state-of-the-art hybrid MT system capable of high-quality, high-speed translations of text and speech. The system participated in the English-to-French and Russian-to-English WMT evaluation tasks with competitive results. The features which contributed ..."
Abstract
- Add to MetaCart
contributed the most to high translation quality were training data sub-sampling methods, document-specific models, as well as rule-based morphological normalization for Russian. The latter improved the baseline Russian-to-English BLEU score from 30.1 to 31.3 % on a heldout test set. 1
The Polylingual Labeled Topic Model
"... Abstract. In this paper, we present the Polylingual Labeled Topic Model, a model which combines the characteristics of the existing Polylingual Topic Model and Labeled LDA. The model accounts for multiple lan-guages with separate topic distributions for each language while restrict-ing the permitted ..."
Abstract
- Add to MetaCart
the permitted topics of a document to a set of predefined labels. We explore the properties of the model in a two-language setting on a dataset from the social science domain. Our experiments show that our model outperforms LDA and Labeled LDA in terms of their held-out perplexity and that it produces
Vouros. Non-parametric estimation of topic hierarchies from texts with hierarchical dirichlet processes
, 2011
"... This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics
Results 1 - 10
of
29