Results 1  10
of
10
Improving User Topic Interest Profiles by Behavior Factorization
"... Many recommenders aim to provide relevant recommendations to users by building personal topic interest profiles and then using these profiles to find interesting contents for the user. In social media, recommender systems build user profiles by directly combining users ’ topic interest signals fro ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Many recommenders aim to provide relevant recommendations to users by building personal topic interest profiles and then using these profiles to find interesting contents for the user. In social media, recommender systems build user profiles by directly combining users ’ topic interest signals from a wide variety of consumption and publishing behaviors, such as social media posts they authored, commented on, +1’d or liked. Here we propose to separately model users ’ topical interests that come from these various behavioral signals in order to construct better user profiles. Intuitively, since publishing a post requires more effort, the topic interests coming from publishing signals should be more accurate of a user’s central interest than, say, a simple gesture such as a +1. By separating a single user’s interest profile into several behavioral profiles, we obtain better and cleaner topic interest signals, as well as enabling topic prediction for different types of behavior, such as topics that the user might +1 or comment on, but might never write a post on that topic. To do this at large scales in Google+, we employed matrix factorization techniques to model each user’s behaviors as a separate example entry in the input userbytopic matrix. Using this technique, which we call "behavioral factorization", we implemented and built a topic recommender predicting user’s topical interests using their actions within Google+. We experimentally showed that we obtained better and cleaner signals than baseline methods, and are able to more accurately predict topic interests as well as achieve better coverage.
Selfdisclosure topic model for classifying and analyzing Twitter conversations
"... Selfdisclosure, the act of revealing oneself to others, is an important social behavior that strengthens interpersonal relationships and increases social support. Although there are many social science studies of selfdisclosure, they are based on manual coding of small datasets and questionn ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Selfdisclosure, the act of revealing oneself to others, is an important social behavior that strengthens interpersonal relationships and increases social support. Although there are many social science studies of selfdisclosure, they are based on manual coding of small datasets and questionnaires. We conduct a computational analysis of selfdisclosure with a large dataset of naturallyoccurring conversations, a semisupervised machine learning algorithm, and a computational analysis of the effects of selfdisclosure on subsequent conversations. We use a longitudinal dataset of 17 million tweets, all of which occurred in conversations that consist of five or more tweets directly replying to the previous tweet, and from dyads with twenty of more conversations each. We develop selfdisclosure topic model (SDTM), a variant of latent Dirichlet allocation (LDA) for automatically classifying the level of selfdisclosure for each tweet. We take the results of SDTM and analyze the effects of selfdisclosure on subsequent conversations. Our model significantly outperforms several comparable methods on classifying the level of selfdisclosure, and the analysis of the longitudinal data using SDTM uncovers significant and positive correlation between selfdisclosure and conversation frequency and length. 1
Dual online inference for latent Dirichlet allocation
, 2014
"... Abstract Latent Dirichlet allocation (LDA) provides an efficient tool to analyze very large text collections. In this paper, we discuss three novel contributions: (1) a proof for the tractability of the MAP estimation of topic mixtures under certain conditions that might fit well with practices, e ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Latent Dirichlet allocation (LDA) provides an efficient tool to analyze very large text collections. In this paper, we discuss three novel contributions: (1) a proof for the tractability of the MAP estimation of topic mixtures under certain conditions that might fit well with practices, even though the problem is known to be intractable in the worse case; (2) a provably fast algorithm (OFW) for inferring topic mixtures; (3) a dual online algorithm (DOLDA) for learning LDA at a large scale. We show that OFW converges to some local optima, but under certain conditions it can converge to global optima. The discussion of OFW is general and hence can be readily employed to accelerate the MAP estimation in a wide class of probabilistic models. From extensive experiments we find that DOLDA can achieve significantly better predictive performance and semantic quality, with lower runtime, than stochastic variational inference. Further, DOLDA enables us to easily analyze text streams or millions of documents.
Model Selection for Topic Models via Spectral Decomposition
"... Abstract Topic models have achieved significant successes in analyzing largescale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor de ..."
Abstract
 Add to MetaCart
Abstract Topic models have achieved significant successes in analyzing largescale text corpus. In practical applications, we are always confronted with the challenge of model selection, i.e., how to appropriately set the number of topics. Following the recent advances in topic models via tensor decomposition, we make a first attempt to provide theoretical analysis on model selection in latent Dirichlet allocation. With mild conditions, we derive the upper bound and lower bound on the number of topics given a text collection of finite size. Experimental results demonstrate that our bounds are correct and tight. Furthermore, using Gaussian mixture model as an example, we show that our methodology can be easily generalized to model selection analysis in other latent models.
Most Large Topic Models are Approximately Separable
"... AbstractSeparability has recently been leveraged as a key structural condition in topic models to develop asymptotically consistent algorithms with polynomial statistical and computational efficiency guarantees. Separability corresponds to the presence of at least one novel word for each topic. Em ..."
Abstract
 Add to MetaCart
(Show Context)
AbstractSeparability has recently been leveraged as a key structural condition in topic models to develop asymptotically consistent algorithms with polynomial statistical and computational efficiency guarantees. Separability corresponds to the presence of at least one novel word for each topic. Empirical estimates of topic matrices for Latent Dirichlet Allocation models have been observed to be approximately separable. Separability may be a convenient structural property, but it appears to be too restrictive a condition. In this paper we explicitly demonstrate that separability is, in fact, an inevitable consequence of highdimensionality. In particular, we prove that when the columns of the topic matrix are independently sampled from a Dirichlet distribution, the resulting topic matrix will be approximately separable with probability tending to one as the number of rows (vocabulary size) scales to infinity sufficiently faster than the number of columns (topics). This is based on combining concentration of measure results with properties of the Dirichlet distribution and union bounding arguments. Our proof techniques can be extended to other priors for general nonnegative matrices.
PTE: Predictive Text Embedding through Largescale Heterogeneous Text Networks
"... ABSTRACT Unsupervised text embedding methods, such as Skipgram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these metho ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT Unsupervised text embedding methods, such as Skipgram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without leveraging the labeled information available for the task. Although the low dimensional representations learned are applicable to many different tasks, they are not particularly tuned for any task. In this paper, we fill this gap by proposing a semisupervised representation learning method for text data, which we call the predictive text embedding (PTE). Predictive text embedding utilizes both labeled and unlabeled data to learn the embedding of text. The labeled information and different levels of word cooccurrence information are first represented as a largescale heterogeneous text network, which is then embedded into a low dimensional space through a principled and efficient algorithm. This low dimensional embedding not only preserves the semantic closeness of words and documents, but also has a strong predictive power for the particular task. Compared to recent supervised approaches based on convolutional neural networks, predictive text embedding is comparable or more effective, much more efficient, and has fewer parameters to tune.
Geometric Dirichlet Means algorithm for topic inference
"... Abstract We propose a geometric algorithm for topic learning and inference that is built on the convex geometry of topics arising from the Latent Dirichlet Allocation (LDA) model and its nonparametric extensions. To this end we study the optimization of a geometric loss function, which is a surroga ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We propose a geometric algorithm for topic learning and inference that is built on the convex geometry of topics arising from the Latent Dirichlet Allocation (LDA) model and its nonparametric extensions. To this end we study the optimization of a geometric loss function, which is a surrogate to the LDA's likelihood. Our method involves a fast optimization based weighted clustering procedure augmented with geometric corrections, which overcomes the computational and statistical inefficiencies encountered by other techniques based on Gibbs sampling and variational inference, while achieving the accuracy comparable to that of a Gibbs sampler. The topic estimates produced by our method are shown to be statistically consistent under some conditions. The algorithm is evaluated with extensive experiments on simulated and real data.
How to Supervise Topic Models
"... Abstract. Supervised topic models are important machine learning tools which have been widely used in computer vision as well as in other domains. However, there is a gap in the understanding of the supervision impact on the model. In this paper, we present a thorough analysis on the behaviour of su ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Supervised topic models are important machine learning tools which have been widely used in computer vision as well as in other domains. However, there is a gap in the understanding of the supervision impact on the model. In this paper, we present a thorough analysis on the behaviour of supervised topic models using Supervised Latent Dirichlet Allocation (SLDA) and propose two factorized supervised topic models, which factorize the topics into signal and noise. Experimental results on both synthetic data and realworld data for computer vision tasks show that supervision need to be boosted to be effective and factorized topic models are able to enhance the performance.
Empirical Software Engineering manuscript No. (will be inserted by the editor) A Survey on the Use of Topic Models when Mining Software Repositories
"... ware development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have help ..."
Abstract
 Add to MetaCart
(Show Context)
ware development by mining and analyzing software repositories. Since the majority of the software engineering data is unstructured, researchers have applied Information Retrieval (IR) techniques to help software development. The recent advances of IR, especially statistical topic models, have helped make sense of unstructured data in software repositories even more. However, even though there are hundreds of studies on applying topic models to software repositories, there is no study that shows how the models are used in the software engineering research community, and which software engineering tasks are being supported through topic models. Moreover, since the performance of these topic models is directly related to the model parameters and usage, knowing how researchers use the topic models may also help future studies make optimal use of such models. Thus, we surveyed 167 articles from the software engineering literature that make use of topic models. We find that i) most studies centre around a limited number of software engineering tasks; ii) most studies use only basic topic models; iii) and researchers usually treat topic models as black boxes without fully exploring their underlying assumptions and parameter values. Our paper provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.
USC
, 2014
"... Correctly choosing the number of topics plays an important role in successfully applying topic models to real world applications. Following the latest tensor decomposition framework by Anandkumar et al., we make the first attempt to provide theoretical analysis on the number of topics under Latent D ..."
Abstract
 Add to MetaCart
Correctly choosing the number of topics plays an important role in successfully applying topic models to real world applications. Following the latest tensor decomposition framework by Anandkumar et al., we make the first attempt to provide theoretical analysis on the number of topics under Latent Dirichlet Allocation model. With mild conditions, our method provides accessible information on the number of topics, which includes both upper and lower bounds. Experimental results on synthetic datasets demonstrate that our proposed bounds are correct and tight. Furthermore, using Gaussian Mixture Model as an example, we show that our methodology can be easily generalized for analyzing the number of mixture components in other mixture models. 1