Results 1  10
of
15
Tensor decompositions for learning latent variable models
, 2014
"... This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models—including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation—which exploits a certain tensor structure in their loworder observable mo ..."
Abstract

Cited by 83 (7 self)
 Add to MetaCart
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models—including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation—which exploits a certain tensor structure in their loworder observable moments (typically, of second and thirdorder). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin’s perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
Spectral Experts for Estimating Mixtures of Linear Regressions
"... Discriminative latentvariable models are typically learned using EM or gradientbased optimization, which suffer from local optima. In this paper, we develop a new computationally efficient and provably consistent estimator for a mixture of linear regressions, a simple instance of a discriminative ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
Discriminative latentvariable models are typically learned using EM or gradientbased optimization, which suffer from local optima. In this paper, we develop a new computationally efficient and provably consistent estimator for a mixture of linear regressions, a simple instance of a discriminative latentvariable model. Our approach relies on a lowrank linear regression to recover a symmetric tensor, which can be factorized into the parameters using a tensor power method. We prove rates of convergence for our estimator and provide an empirical evaluation illustrating its strengths relative to local optimization (EM). 1.
The algebraic combinatorial approach for lowrank matrix completion
, 2014
"... We present a novel algebraic combinatorial view on lowrank matrix completion based on studying relations between a few entries with tools from algebraic geometry and matroid theory. The intrinsic locality of the approach allows for the treatment of single entries in a closed theoretical and practic ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
We present a novel algebraic combinatorial view on lowrank matrix completion based on studying relations between a few entries with tools from algebraic geometry and matroid theory. The intrinsic locality of the approach allows for the treatment of single entries in a closed theoretical and practical framework. More specifically, apart from introducing an algebraic combinatorial theory of lowrank matrix completion, we present probabilityone algorithms to decide whether a particular entry of the matrix can be completed. We also describe methods to complete that entry from a few others, and to estimate the error which is incurred by any method completing that entry. Furthermore, we show how known results on matrix completion and their sampling assumptions can be related to our new perspective and interpreted in terms of a completability phase transition. On this revision This revision version 4 is both abridged and extended in terms of exposition and results, as compared to version 3 Király et al. (2013). The theoretical foundations are developed in a more adhoc way which allow to reach the main statements and algorithmic implications more quickly. Version 3 contains a more principled derivation of the theory, more related results (e.g., estimation of missing entries and its consistency, representations for the determinantal matroid, detailed examples), but a focus which is further away from applications. A reader who is interested in both is invited to read the main parts of version 4 first, then go through version 3 for a more detailed view on the theory.
Unsupervised Learning of NoisyOr Bayesian Networks
"... This paper considers the problem of learning the parameters in Bayesian networks of discrete variables with known structure and hidden variables. Previous approaches in these settings typically use expectation maximization; when the network has high treewidth, the required expectations might be appr ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
This paper considers the problem of learning the parameters in Bayesian networks of discrete variables with known structure and hidden variables. Previous approaches in these settings typically use expectation maximization; when the network has high treewidth, the required expectations might be approximated using Monte Carlo or variational methods. We show how to avoid inference altogether during learning by giving a polynomialtime algorithm based on the methodofmoments, building upon recent work on learning discretevalued mixture models. In particular, we show how to learn the parameters for a family of bipartite noisyor Bayesian networks. In our experimental results, we demonstrate an application of our algorithm to learning QMRDT, a large Bayesian network used for medical diagnosis. We show that it is possible to fully learn the parameters of QMRDT even when only the findings are observed in the training data (ground truth diseases unknown). 1
Contrastive Learning Using Spectral Methods
"... In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
In many natural settings, the analysis goal is not to characterize a single data set in isolation, but rather to understand the difference between one set of observations and another. For example, given a background corpus of news articles together with writings of a particular author, one may want a topic model that explains word patterns and themes specific to the author. Another example comes from genomics, in which biological signals may be collected from different regions of a genome, and one wants a model that captures the differential statistics observed in these regions. This paper formalizes this notion of contrastive learning for mixture models, and develops spectral algorithms for inferring mixture components specific to a foreground data set when contrasted with a background data set. The method builds on recent momentbased estimators and tensor decompositions for latent variable models, and has the intuitive feature of using background data statistics to appropriately modify moments estimated from foreground data. A key advantage of the method is that the background data need only be coarsely modeled, which is important when the background is too complex, noisy, or not of interest. The method is demonstrated on applications in contrastive topic modeling and genomic sequence analysis. 1
Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability. ArXiv 1304.8087
, 2013
"... We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor decompositions: we prove that given a tensor whose decomposition satisfies a robust form of Kruskal’s rank condition, it is possible to approximately recover the decomposition if the tensor is known up to a suff ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We give a robust version of the celebrated result of Kruskal on the uniqueness of tensor decompositions: we prove that given a tensor whose decomposition satisfies a robust form of Kruskal’s rank condition, it is possible to approximately recover the decomposition if the tensor is known up to a sufficiently small (inverse polynomial) error. Kruskal’s theorem has found many applications in proving the identifiability of parameters for various latent variable models and mixture models such as Hidden Markov models, topic models etc. Our robust version immediately implies identifiability using only polynomially many samples in many of these settings. This polynomial identifiability is an essential first step towards efficient learning algorithms for these models. Recently, algorithms based on tensor decompositions have been used to estimate the parameters of various hidden variable models efficiently in special cases as long as they satisfy certain “nondegeneracy ” properties. Our methods give a way to go beyond this nondegeneracy barrier, and establish polynomial identifiablity of the parameters under much milder conditions. Given the importance of Kruskal’s theorem in the tensor literature, we expect that this robust version will have several applications beyond the settings we explore in this work.
Estimating LatentVariable Graphical Models using Moments and Likelihoods
"... Recent work on the method of moments enable consistent parameter estimation, but only for certain types of latentvariable models. On the other hand, pure likelihood objectives, though more universally applicable, are difficult to optimize. In this work, we show that using the method of moments in ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Recent work on the method of moments enable consistent parameter estimation, but only for certain types of latentvariable models. On the other hand, pure likelihood objectives, though more universally applicable, are difficult to optimize. In this work, we show that using the method of moments in conjunction with composite likelihood yields consistent parameter estimates for a much broader class of discrete directed and undirected graphical models, including loopy graphs with high treewidth. Specifically, we use tensor factorization to reveal information about the hidden variables. This allows us to construct convex likelihoods which can be globally optimized to recover the parameters. 1.
Provable Algorithms for Machine Learning Problems
, 2013
"... Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NPhard problems in average case using heuristics. What properties of the input allow it to be solved efficiently? Theoretically analyzing the heuristics is often v ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NPhard problems in average case using heuristics. What properties of the input allow it to be solved efficiently? Theoretically analyzing the heuristics is often very challenging. Few results were known. This thesis takes a different approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks. The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results. The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the first algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders. The final part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight perturbation analysis for tensor power method, which reduces the number of samples required to learn many latent variable models. In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms. iii
Unsupervised Risk Estimation Using Only Conditional Independence Structure
"... Abstract We show how to estimate a model's test error from unlabeled data, on distributions very different from the training distribution, while assuming only that certain conditional independencies are preserved between train and test. We do not need to assume that the optimal predictor is th ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We show how to estimate a model's test error from unlabeled data, on distributions very different from the training distribution, while assuming only that certain conditional independencies are preserved between train and test. We do not need to assume that the optimal predictor is the same between train and test, or that the true distribution lies in any parametric family. We can also efficiently compute gradients of the estimated error and hence perform unsupervised discriminative learning. Our technical tool is the method of moments, which allows us to exploit conditional independencies in the absence of a fullyspecified model. Our framework encompasses a large family of losses including the log and exponential loss, and extends to structured output settings such as conditional random fields.
A Learnability Analysis of Argument and Modifier Structure
"... We present a computational learnability analysis of the argumentmodifier distinction, asking whether information present in the distribution of constituents in natural language supports the distinction and its learnability. We first develop general models of those aspects of argument structure and ..."
Abstract
 Add to MetaCart
We present a computational learnability analysis of the argumentmodifier distinction, asking whether information present in the distribution of constituents in natural language supports the distinction and its learnability. We first develop general models of those aspects of argument structure and the argumentmodifier distinction which have effects on the distribution of constituents in sentences—abstracting away many of the implementational details of specific theoretical proposals. Combining these models with a theory of learning based on succinctness, we define two systems, the argumentonly (PTSG) model and the argumentmodifier (PSAG) model. We first show that the argumentmodifier (PSAG) model is able to recover the argumentmodifier status of many individual constituents when evaluated against a gold standard. This provides evidence in favor of our general account of argumentmodifier structure as well as providing a lower bound on the amount of information that natural language input can provide for appropriately equipped learners attempting to recover the argumentmodifier status of individual constituents.