Results 1  10
of
72
A Spectral Algorithm for Latent Dirichlet Allocation
"... Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating th ..."
Abstract

Cited by 41 (9 self)
 Add to MetaCart
(Show Context)
Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topicword distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topicword distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of loworder moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1
Square deal: Lower bounds and improved relaxations for tensor recovery
 CoRR
"... Recovering a lowrank tensor from incomplete information is a recurring problem in signal processing and machine learning. The most popular convex relaxation of this problem minimizes the sum of the nuclear norms of the unfoldings of the tensor. We show that this approach can be substantially subopt ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
Recovering a lowrank tensor from incomplete information is a recurring problem in signal processing and machine learning. The most popular convex relaxation of this problem minimizes the sum of the nuclear norms of the unfoldings of the tensor. We show that this approach can be substantially suboptimal: reliably recovering a Kway tensor of length n and Tucker rank r from Gaussian measurements requires Ω(rnK−1) observations. In contrast, a certain (intractable) nonconvex formulation needs only O(rK+nrK) observations. We introduce a very simple, new convex relaxation, which partially bridges this gap. Our new formulation succeeds with O(rbK/2cndK/2e) observations. While these results pertain to Gaussian measurements, simulations strongly suggest that the new norm also outperforms the sum of nuclear norms for tensor completion from a random subset of entries. Our lower bound for the sumofnuclearnorms model follows from a new result on recovering signals with multiple sparse structures (e.g. sparse, low rank), which perhaps surprisingly demonstrates the significant suboptimality of the commonly used recovery approach via minimizing the sum of individual sparsity inducing norms (e.g. l1, nuclear norm). Our new formulation for lowrank tensor recovery however opens the possibility in reducing the sample complexity by exploiting several structures jointly. 1
Experiments with Spectral Learning of LatentVariable PCFGs
"... Latentvariable PCFGs (LPCFGs) are a highly successful model for natural language parsing. Recent work (Cohen et al., 2012) has introduced a spectral algorithm for parameter estimation of LPCFGs, which—unlike the EM algorithm—is guaranteed to give consistent parameter estimates (it has PACstyle g ..."
Abstract

Cited by 20 (8 self)
 Add to MetaCart
(Show Context)
Latentvariable PCFGs (LPCFGs) are a highly successful model for natural language parsing. Recent work (Cohen et al., 2012) has introduced a spectral algorithm for parameter estimation of LPCFGs, which—unlike the EM algorithm—is guaranteed to give consistent parameter estimates (it has PACstyle guarantees of sample complexity). This paper describes experiments using the spectral algorithm. We show that the algorithm provides models with the same accuracy as EM, but is an order of magnitude more efficient. We describe a number of key steps used to obtain this level of performance; these should be relevant to other work on the application of spectral learning algorithms. We view our results as strong empirical evidence for the viability of spectral methods as an alternative to EM. 1
Spectral Experts for Estimating Mixtures of Linear Regressions
"... Discriminative latentvariable models are typically learned using EM or gradientbased optimization, which suffer from local optima. In this paper, we develop a new computationally efficient and provably consistent estimator for a mixture of linear regressions, a simple instance of a discriminative ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Discriminative latentvariable models are typically learned using EM or gradientbased optimization, which suffer from local optima. In this paper, we develop a new computationally efficient and provably consistent estimator for a mixture of linear regressions, a simple instance of a discriminative latentvariable model. Our approach relies on a lowrank linear regression to recover a symmetric tensor, which can be factorized into the parameters using a tensor power method. We prove rates of convergence for our estimator and provide an empirical evaluation illustrating its strengths relative to local optimization (EM). 1.
Spectral methods meet em: A provably optimal algorithm for crowdsourcin. arXiv preprint arXiv:1406.3824
, 2014
"... The DawidSkene estimator has been widely used for inferring the true labels from the noisy labels provided by nonexpert crowdsourcing workers. However, since the estimator maximizes a nonconvex loglikelihood function, it is hard to theoretically justify its performance. In this paper, we propose ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
The DawidSkene estimator has been widely used for inferring the true labels from the noisy labels provided by nonexpert crowdsourcing workers. However, since the estimator maximizes a nonconvex loglikelihood function, it is hard to theoretically justify its performance. In this paper, we propose a twostage efficient algorithm for multiclass crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the DawidSkene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods. 1
A Spectral Learning Approach to Knowledge Tracing
"... Bayesian Knowledge Tracing (BKT) is a common way of determining student knowledge of skills in adaptive educational systems and cognitive tutors. The basic BKT is a Hidden Markov Model (HMM) that models student knowledge based on five parameters: prior, learn rate, forget, guess, and slip. Expectati ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Bayesian Knowledge Tracing (BKT) is a common way of determining student knowledge of skills in adaptive educational systems and cognitive tutors. The basic BKT is a Hidden Markov Model (HMM) that models student knowledge based on five parameters: prior, learn rate, forget, guess, and slip. Expectation Maximization (EM) is often used to learn these parameters from training data. However, EM is a timeconsuming process, and is prone to converging to erroneous, implausible local optima depending on the initial values of the BKT parameters. In this paper we address these two problems by using spectral learning to learn a Predictive State Representation (PSR) that represents the BKT HMM. We then use a heuristic to extract the BKT parameters from the learned PSR using basic matrix operations. The spectral learning method is based on an approximate factorization of the estimated covariance of windows from students ’ sequences of correct and incorrect responses; it is fast, localoptimumfree, and statistically consistent. In the past few years, spectral techniques have been used on realworld problems involving latent variables in dynamical systems, computer vision, and natural language processing. Our results suggest that the parameters learned by the spectral algorithm can replace the parameters learned by EM; the results of our study show that the spectral algorithm can improve knowledge tracing parameterfitting time significantly while maintaining the same prediction accuracy, or help to improve accuracy while still keeping parameterfitting time equivalent to EM.
Statistical guarantees for the EM algorithm: From population to samplebased analysis
, 2014
"... We develop a general framework for proving rigorous guarantees on the performance of the EM algorithm and a variant known as gradient EM. Our analysis is divided into two parts: a treatment of these algorithms at the population level (in the limit of infinite data), followed by results that apply to ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We develop a general framework for proving rigorous guarantees on the performance of the EM algorithm and a variant known as gradient EM. Our analysis is divided into two parts: a treatment of these algorithms at the population level (in the limit of infinite data), followed by results that apply to updates based on a finite set of samples. First, we characterize the domain of attraction of any global maximizer of the population likelihood. This characterization is based on a novel view of the EM updates as a perturbed form of likelihood ascent, or in parallel, of the gradient EM updates as a perturbed form of standard gradient ascent. Leveraging this characterization, we then provide nonasymptotic guarantees on the EM and gradient EM algorithms when applied to a finite set of samples. We develop consequences of our general theory for three canonical examples of incompletedata problems: mixture of Gaussians, mixture of regressions, and linear regression with covariates missing completely at random. In each case, our theory guarantees that with a suitable initialization, a relatively small number of EM (or gradient EM) steps will yield (with high probability) an estimate that is within statistical error of the MLE. We provide simulations to confirm this theoretically predicted behavior. 1
Methods of Moments for Learning Stochastic Languages: Unified Presentation and Empirical Comparison
"... Probabilistic latentvariable models are a powerful tool for modelling structured data. However, traditional expectationmaximization methods of learning such models are both computationally expensive and prone to localminima. In contrast to these traditional methods, recently developed learning a ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Probabilistic latentvariable models are a powerful tool for modelling structured data. However, traditional expectationmaximization methods of learning such models are both computationally expensive and prone to localminima. In contrast to these traditional methods, recently developed learning algorithms based upon the method of moments are both computationally efficient and provide strong statistical guarantees. In this work we provide a unified presentation and empirical comparison of three general momentbased methods in the context of modelling stochastic languages. By rephrasing these methods upon a common theoretical ground, introducing novel theoretical results where necessary, we provide a clear comparison, making explicit the statistical assumptions upon which each method relies. With this theoretical grounding, we then provide an indepth empirical analysis of the methods on both real and synthetic data with the goal of elucidating performance trends and highlighting important implementation details. 1.
Spectral Methods for Supervised Topic Models
"... Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on either variational approximation or Monte Carlo sampling. This paper presents a novel spectral dec ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on either variational approximation or Monte Carlo sampling. This paper presents a novel spectral decomposition algorithm to recover the parameters of supervised latent Dirichlet allocation (sLDA) models. The SpectralsLDA algorithm is provably correct and computationally efficient. We prove a sample complexity bound and subsequently derive a sufficient condition for the identifiability of sLDA. Thorough experiments on a diverse range of synthetic and realworld datasets verify the theory and demonstrate the practical effectiveness of the algorithm. 1
Contrastive Learning Using Spectral Methods
"... In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent momentbased estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1