Results 1 - 10
of
32
Representation learning: A review and new perspectives.
- of IEEE Conf. Comp. Vision Pattern Recog. (CVPR),
, 2005
"... Abstract-The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can b ..."
Abstract
-
Cited by 173 (4 self)
- Add to MetaCart
(Show Context)
Abstract-The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Maxout networks
- In ICML
, 2013
"... We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to bot ..."
Abstract
-
Cited by 68 (17 self)
- Add to MetaCart
(Show Context)
We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout’s fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.
Dropout training as adaptive regularization,”
- in Advances in Neural Information Processing Systems 26
, 2013
"... Abstract Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L 2 regu ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
(Show Context)
Abstract Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L 2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.
Practical recommendations for gradient-based training of deep architectures
- Neural Networks: Tricks of the Trade
, 2013
"... ar ..."
(Show Context)
What regularized auto-encoders learn from the data generating distribution
, 2012
"... What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of data. This paper clarifies some of these previous observations by showing that minimizing a particular form o ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
What do auto-encoders learn about the underlying data generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of data. This paper clarifies some of these previous observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that lo-cally characterizes the shape of the data generating density. We show that the auto-encoder captures the score (derivative of the log-density with respect to the input). It contradicts pre-vious interpretations of reconstruction error as an energy function. Unlike previous results, the theorems provided here are completely generic and do not depend on the parametrization of the auto-encoder: they show what the auto-encoder would tend to if given enough capacity and examples. These results are for a contractive training criterion we show to be similar to the denoising auto-encoder training criterion with small corruption noise, but with contrac-tion applied on the whole reconstruction function rather than just encoder. Similarly to score matching, one can consider the proposed training criterion as a convenient alternative to max-imum likelihood because it does not involve a partition function. Finally, we show how an approximate Metropolis-Hastings MCMC can be setup to recover samples from the estimated distribution, and this is confirmed in sampling experiments. 1.
On the emergence of object-selective features in unsupervised feature learning. Unpublished manuscript
, 2012
"... Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to obje ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
(Show Context)
Recent work in unsupervised feature learning has focused on the goal of discovering high-level features from unlabeled images. Much progress has been made in this direction, but in most cases it is still standard to use a large amount of labeled data in order to construct detectors sensitive to object classes or other complex patterns in the data. In this paper, we aim to test the hypothesis that unsupervised feature learning methods, provided with only unlabeled data, can learn high-level, invariant features that are sensitive to commonly-occurring objects. Though a handful of prior results suggest that this is possible when each object class accounts for a large fraction of the data (as in many labeled datasets), it is unclear whether something similar can be accomplished when dealing with completely unlabeled data. A major obstacle to this test, however, is scale: we cannot expect to succeed with small datasets or with small numbers of learned features. Here, we propose a large-scale feature learning system that enables us to carry out this experiment, learning 150,000 features from tens of millions of unlabeled images. Based on two scalable clustering algorithms (K-means and agglomerative clustering), we find that our simple system can discover features sensitive to a commonly occurring object class (human faces) and can also combine these into detectors invariant to significant global distortions like large translations and scale. 1
Deep learning using linear support vector machines
- In ICML
, 2013
"... Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classification tasks, most of these “deep learnin ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classification tasks, most of these “deep learning ” models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the soft-max layer with a linear support vector ma-chine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neu-ral nets and SVMs in prior art, our results using L2-SVMs show that by simply replac-ing softmax with linear SVMs gives signifi-cant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Rep-resentation Learning Workshop’s face expres-sion recognition challenge. 1.
Better Mixing via Deep Representations
, 2014
"... It has previously been hypothesized, and supported with some experimental evidence, that deeper representations, when well trained, tend to do a better job at disentangling the underlying factors of variation. We study the following related conjecture: better representations, in the sense of better ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
(Show Context)
It has previously been hypothesized, and supported with some experimental evidence, that deeper representations, when well trained, tend to do a better job at disentangling the underlying factors of variation. We study the following related conjecture: better representations, in the sense of better disentangling, can be ex-ploited to produce faster-mixing Markov chains. Consequently, mixing would be more efficient at higher levels of representation. To better understand why and how this is happening, we propose a secondary conjecture: the higher-level samples fill more uniformly the space they occupy and the high-density manifolds tend to un-fold when represented at higher levels. The paper discusses these hypotheses and tests them experimentally through visualization and measurements of mixing and interpolating between samples. 1
Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks
"... We propose the simple and efficient method of semi-supervised learning for deep neural networks. Basically, the proposed network is trained in a supervised fashion with labeled and unlabeled data simultaneously. For un-labeled data, Pseudo-Labels, just picking up the class which has the maximum pred ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We propose the simple and efficient method of semi-supervised learning for deep neural networks. Basically, the proposed network is trained in a supervised fashion with labeled and unlabeled data simultaneously. For un-labeled data, Pseudo-Labels, just picking up the class which has the maximum predicted probability, are used as if they were true la-bels. This is in effect equivalent to Entropy Regularization. It favors a low-density sepa-ration between classes, a commonly assumed prior for semi-supervised learning. With De-noising Auto-Encoder and Dropout, this sim-ple method outperforms conventional meth-ods for semi-supervised learning with very small labeled data on the MNIST handwrit-ten digit dataset. 1.
Discriminative unsupervised feature learning with convolutional neural networks
- arXiv:1406.6909
"... Current methods for training convolutional neural networks depend on large amounts of labeled samples for supervised training. In this paper we present an approach for training a convolutional neural network using only unlabeled data. We train the network to discriminate between a set of surrogate c ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Current methods for training convolutional neural networks depend on large amounts of labeled samples for supervised training. In this paper we present an approach for training a convolutional neural network using only unlabeled data. We train the network to discriminate between a set of surrogate classes. Each surrogate class is formed by applying a variety of transformations to a randomly sampled ’seed ’ image patch. We find that this simple feature learning algorithm is surprisingly successful when applied to visual object recognition. The feature representation learned by our algorithm achieves classification results matching or outperforming the current state-of-the-art for unsupervised learning on several popular datasets (STL-10, CIFAR-10, Caltech-101). 1