Results 1  10
of
67
Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair
"... Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Si ..."
Abstract

Cited by 154 (8 self)
 Add to MetaCart
(Show Context)
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification
"... Supervised topic models utilize document’s side information for discovering predictive low dimensional representations of documents; and existing models apply likelihoodbased estimation. In this paper, we present a maxmargin supervised topic model for both continuous and categorical response variab ..."
Abstract

Cited by 93 (27 self)
 Add to MetaCart
(Show Context)
Supervised topic models utilize document’s side information for discovering predictive low dimensional representations of documents; and existing models apply likelihoodbased estimation. In this paper, we present a maxmargin supervised topic model for both continuous and categorical response variables. Our approach, the maximum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the maxmargin principle to train supervised topic models and estimate predictive topic representations that are arguably more suitable for prediction. We develop efficient variational methods for posterior inference and demonstrate qualitatively and quantitatively the advantages of MedLDA over likelihoodbased topic models on movie review and 20 Newsgroups data sets. 1.
Multimodal learning with deep boltzmann machines
 In NIPS’2012
, 2012
"... Data often consists of multiple diverse modalities. For example, images are tagged with textual information and videos are accompanied by audio. Each modality is characterized by having distinct statistical properties. We propose a Deep Boltzmann Machine for learning a generative model of such mult ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
(Show Context)
Data often consists of multiple diverse modalities. For example, images are tagged with textual information and videos are accompanied by audio. Each modality is characterized by having distinct statistical properties. We propose a Deep Boltzmann Machine for learning a generative model of such multimodal data. We show that the model can be used to create fused representations by combining features across modalities. These learned representations are useful for classification and information retrieval. By sampling from the conditional distributions over each data modality, it is possible to create these representations even when some data modalities are missing. We conduct experiments on bimodal imagetext and audiovideo data. The fused representation achieves good classification results on the MIRFlickr data set matching or outperforming other deep models as well as SVM based models that use Multiple Kernel Learning. We further demonstrate that this multimodal model helps classification and retrieval even when only unimodal data is available at test time.
Predictive subspace learning for multiview data: a large margin approach
 In NIPS
, 2010
"... Learning from multiview data is important in many applications, such as image classification and annotation. In this paper, we present a largemargin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent s ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
(Show Context)
Learning from multiview data is important in many applications, such as image classification and annotation. In this paper, we present a largemargin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multiview observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of largemargin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval. 1
Neural Variational Inference and Learning in Belief Networks
"... Highly expressive directed latent variable models, such as sigmoid belief networks, are difficult to train on large datasets because exact inference in them is intractable and none of the approximate inference methods that have been applied to them scale well. We propose a fast noniterative appr ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
Highly expressive directed latent variable models, such as sigmoid belief networks, are difficult to train on large datasets because exact inference in them is intractable and none of the approximate inference methods that have been applied to them scale well. We propose a fast noniterative approximate inference method that uses a feedforward network to implement efficient exact sampling from the variational posterior. The model and this inference network are trained jointly by maximizing a variational lower bound on the loglikelihood. Although the naive estimator of the inference network gradient is too highvariance to be useful, we make it practical by applying several straightforward modelindependent variance reduction techniques. Applying our approach to training sigmoid belief networks and deep autoregressive networks, we show that it outperforms the wakesleep algorithm on MNIST and achieves stateoftheart results on the Reuters RCV1 document dataset. 1.
Conditional Restricted Boltzmann Machines for Structured Output Prediction
"... Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training nonconditional RBMs, these ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training nonconditional RBMs, these algorithms are not applicable to conditional models and there has been almost no work on training and generating predictions from conditional RBMs for structured output problems. We first argue that standard Contrastive Divergencebased learning may not be suitable for training CRBMs. We then identify two distinct types of structured output prediction problems and propose an improved learning algorithm for each. The first problem type is one where the output space has arbitrary structure but the set of likely output configurations is relatively small, such as in multilabel classification. The second problem is one where the output space is arbitrarily structured but where the output space variability is much greater, such as in image denoising or pixel labeling. We show that the new learning algorithms can work much better than Contrastive Divergence on both types of problems. 1
A neural autoregressive topic model
 In Advances in Neural Information Processing Systems 25
, 2012
"... We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more mea ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional meanfield recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1
An Efficient Learning Procedure for Deep Boltzmann Machines
, 2010
"... We present a new learning algorithm for Boltzmann Machines that contain many layers of hidden variables. Datadependent statistics are estimated using a variational approximation that tends to focus on a single mode, and dataindependent statistics are estimated using persistent Markov chains. The u ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
We present a new learning algorithm for Boltzmann Machines that contain many layers of hidden variables. Datadependent statistics are estimated using a variational approximation that tends to focus on a single mode, and dataindependent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann Machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layerbylayer “pretraining” phase that initializes the weights sensibly. The pretraining also allows the variational inference to be initialized sensibly with a single bottomup pass. We present results on the MNIST and NORB datasets showing that Deep Boltzmann Machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by Deep Boltzmann Machines are a very effective way to initialize the hidden layers of feedforward neural nets which are then discriminatively finetuned.
Learning Deep Generative Models
, 2009
"... Building intelligent systems that are capable of extracting highlevel representations from highdimensional sensory data lies at the core of solving many AI related tasks, including object recognition, speech perception, and language understanding. Theoretical and biological arguments strongly sugg ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
Building intelligent systems that are capable of extracting highlevel representations from highdimensional sensory data lies at the core of solving many AI related tasks, including object recognition, speech perception, and language understanding. Theoretical and biological arguments strongly suggest that building such systems requires models with deep architectures that involve many layers of nonlinear processing. The aim of the thesis is to demonstrate that deep generative models that contain many layers of latent variables and millions of parameters can be learned efficiently, and that the learned highlevel feature representations can be successfully applied in a wide spectrum of application domains, including visual object recognition, information retrieval, and classification and regression tasks. In addition, similar methods can be used for nonlinear dimensionality reduction. The first part of the thesis focuses on analysis and applications of probabilistic generative models called Deep Belief Networks. We show that these deep hierarchical models can learn useful feature representations from a large supply of unlabeled sensory inputs. The learned highlevel representations capture a lot of structure in the input data, which is useful for subsequent problemspecific tasks, such as classification, regression or information retrieval, even though these tasks are unknown when the