Results 1  10
of
111
Unsupervised Modeling of Twitter Conversations
, 2010
"... We propose the first unsupervised approach to the problem of modeling dialogue acts in an open domain. Trained on a corpus of noisy Twitter conversations, our method discovers dialogue acts by clustering raw utterances. Because it accounts for the sequential behaviour of these acts, the learned mode ..."
Abstract

Cited by 90 (4 self)
 Add to MetaCart
(Show Context)
We propose the first unsupervised approach to the problem of modeling dialogue acts in an open domain. Trained on a corpus of noisy Twitter conversations, our method discovers dialogue acts by clustering raw utterances. Because it accounts for the sequential behaviour of these acts, the learned model can provide insight into the shape of communication in a new medium. We address the challenge of evaluating the emergent model with a qualitative visualization and an intrinsic conversation ordering task. This work is inspired by a corpus of 1.3 million Twitter conversations, which will be made publicly available. This huge amount of data, available only because Twitter blurs the line between chatting and publishing, highlights the need to be able to adapt quickly to a new medium. 1
Polylingual Topic Models
"... Topic models are a useful tool for analyzing large text collections, but have previously been applied in only monolingual, or at most bilingual, contexts. Meanwhile, massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools th ..."
Abstract

Cited by 89 (2 self)
 Add to MetaCart
(Show Context)
Topic models are a useful tool for analyzing large text collections, but have previously been applied in only monolingual, or at most bilingual, contexts. Meanwhile, massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools that can characterize content in many languages. We introduce a polylingual topic model that discovers topics aligned across multiple languages. We explore the model’s characteristics using two large corpora, each with over ten different languages, and demonstrate its usefulness in supporting machine translation and tracking topic trends across languages. 1
Automatic evaluation of topic coherence
 In NAACLHLT
, 2010
"... This paper introduces the novel task of topic coherence evaluation, whereby a set of words, as generated by a topic model, is rated for coherence or interpretability. We apply a range of topic scoring models to the evaluation task, drawing on WordNet, Wikipedia and the Google search engine, and exis ..."
Abstract

Cited by 85 (9 self)
 Add to MetaCart
(Show Context)
This paper introduces the novel task of topic coherence evaluation, whereby a set of words, as generated by a topic model, is rated for coherence or interpretability. We apply a range of topic scoring models to the evaluation task, drawing on WordNet, Wikipedia and the Google search engine, and existing research on lexical similarity/relatedness. In comparison with human scores for a set of learned topics over two distinct datasets, we show a simple cooccurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of interannotator correlation, and that other Wikipediabased lexical relatedness methods also achieve strong results. Google produces strong, if less consistent, results, while our results over WordNet are patchy at best. 1
Optimizing Semantic Coherence in Topic Models
"... Latent variable models have the potential to add value to large document collections by discovering interpretable, lowdimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirich ..."
Abstract

Cited by 80 (5 self)
 Add to MetaCart
(Show Context)
Latent variable models have the potential to add value to large document collections by discovering interpretable, lowdimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce lowdimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a largescale document collection from the National Institutes of Health (NIH). 1
Replicated softmax: an undirected topic model
 In Advances in Neural Information Processing Systems
"... We introduce a twolayer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract lowdimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this m ..."
Abstract

Cited by 67 (14 self)
 Add to MetaCart
(Show Context)
We introduce a twolayer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract lowdimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a MonteCarlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the logprobability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the logprobability of heldout documents and the retrieval accuracy. 1
TreeStructured Stick Breaking for Hierarchical Data
"... Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stickbreaking processes to allow for trees of unbounded width and depth, where data can live at any node and are i ..."
Abstract

Cited by 50 (8 self)
 Add to MetaCart
(Show Context)
Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stickbreaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stickbreaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
Estimating likelihoods for topic models
 in Asian Conference on Machine Learning
, 2009
"... Abstract. Topic models are a discrete analogue to principle component analysis and independent component analysis that model topic at the word level within a document. They have many variants such as NMF, PLSI and LDA, and are used in many fields such as genetics, text and the web, image analysis an ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Topic models are a discrete analogue to principle component analysis and independent component analysis that model topic at the word level within a document. They have many variants such as NMF, PLSI and LDA, and are used in many fields such as genetics, text and the web, image analysis and recommender systems. However, only recently have reasonable methods for estimating the likelihood of unseen documents, for instance to perform testing or model comparison, become available. This paper explores a number of recent methods, and improves their theory, performance, and testing. 1
Negative Binomial Process Count and Mixture Modeling
, 2013
"... The seemingly disjoint problems of count and mixture modeling are united under the negative binomial (NB) process. A gamma process is employed to model the rate measure of a Poisson process, whose normalization provides a random probability measure for mixture modeling and whose marginalization lead ..."
Abstract

Cited by 17 (10 self)
 Add to MetaCart
The seemingly disjoint problems of count and mixture modeling are united under the negative binomial (NB) process. A gamma process is employed to model the rate measure of a Poisson process, whose normalization provides a random probability measure for mixture modeling and whose marginalization leads to an NB process for count modeling. A draw from the NB process consists of a Poisson distributed finite number of distinct atoms, each of which is associated with a logarithmic distributed number of data samples. We reveal relationships between various count and mixturemodeling distributions and construct a Poissonlogarithmic bivariate distribution that connects the NB and Chinese restaurant table distributions. Fundamental properties of the models are developed, and we derive efficient Bayesian inference. It is shown that with augmentation and normalization, the NB process and gammaNB process can be reduced to the Dirichlet process and hierarchical Dirichlet process, respectively. These relationships highlight theoretical, structural and computational advantages of the NB process. A variety of NB processes, including the betageometric, betaNB, markedbetaNB, markedgammaNB and zeroinflatedNB processes, with distinct sharing mechanisms, are also constructed. These models are applied to topic modeling, with connections made to existing algorithms under Poisson factor analysis. Example results show the importance of inferring both the NB dispersion and probability parameters.
TopicFlow Model: Unsupervised Learning of Topicspecific Influences of Hyperlinked Documents
"... Popular algorithms for modeling the influence of entities in networked data, such as PageRank, work by analyzing the hyperlink structure, but ignore the contents of documents. However, often times, influence is topic dependent, e.g., a web page of high influence in politics may be an unknown entity ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Popular algorithms for modeling the influence of entities in networked data, such as PageRank, work by analyzing the hyperlink structure, but ignore the contents of documents. However, often times, influence is topic dependent, e.g., a web page of high influence in politics may be an unknown entity in sports. We design a new model called TopicFlow, which combines ideas from network flow and topic modeling, to learn this notion of topic specific influences of hyperlinked documents in a completely unsupervised fashion. On the task of citation recommendation, which is an instance of capturing influence, the TopicFlow model, when combined with TFIDF based cosine similarity, outperforms several competitive baselines by as much as 11.8%. Our empirical study of the model’s output on ACL corpus demonstrates its ability to identify topically influential documents. The TopicFlow model is also competitive with the stateoftheart Relational Topic Models in predicting the likelihood of unseen text on two different data sets. Due to its ability to learn topicspecific flows across each hyperlink, the TopicFlow model can be a powerful visualization tool to track the diffusion of topics across a citation network. 1
Evaluating Topic Coherence Using Distributional Semantics
"... This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collec ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collect frequencies. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information (PMI). Topic coherence is determined by measuring the distance between these vectors computed using a variety of metrics. Evaluation on three data sets shows that the distributionalbased measures outperform the stateoftheart approach for this task.