Results 1  10
of
32
Dropout: A simple way to prevent neural networks from overfitting
 Journal of Machine Learning Research
, 1929
"... Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural net ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from coadapting too much. During training, dropout samples from an exponential number of different “thinned ” networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining stateoftheart results on many benchmark data sets.
The dropout learning algorithm
 Artificial int lligence
"... Dropout is a recently introduced algorithm for training neural networks by randomly dropping units during training to prevent their coadaptation. A mathematical analysis of some of the static and dynamic properties of dropout is provided using Bernoulli gating variables, general enough to accommoda ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Dropout is a recently introduced algorithm for training neural networks by randomly dropping units during training to prevent their coadaptation. A mathematical analysis of some of the static and dynamic properties of dropout is provided using Bernoulli gating variables, general enough to accommodate dropout on units or connections, and with variable rates. The framework allows a complete analysis of the ensemble averaging properties of dropout in linear networks, which is useful to understand the nonlinear case. The ensemble averaging properties of dropout in nonlinear logistic networks result from three fundamental equations: (1) the approximation of the expectations of logistic functions by normalized geometric means, for which bounds and estimates are derived; (2) the algebraic equality between normalized geometric means of logistic functions with the logistic of the means, which mathematically characterizes logistic functions; and (3) the linearity of the means with respect to sums, as well as products of independent variables. The results are also extended to other classes of transfer functions, including rectified linear functions. Approximation errors tend to cancel each other and do not accumulate. Dropout can also be connected to stochastic neurons and used to predict firing rates, and to backpropagation by viewing the backward propagation as ensemble averaging in a dropout linear network. Moreover, the convergence properties of dropout can be understood in terms of stochastic gradient descent. Finally, for the regularization properties of dropout, the expectation of the dropout gradient is the gradient of the corresponding approximation ensemble, regularized by an adaptive weight decay term with a propensity for selfconsistent variance minimization and sparse representations.
Discriminative unsupervised feature learning with convolutional neural networks
 arXiv:1406.6909
"... Current methods for training convolutional neural networks depend on large amounts of labeled samples for supervised training. In this paper we present an approach for training a convolutional neural network using only unlabeled data. We train the network to discriminate between a set of surrogate c ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Current methods for training convolutional neural networks depend on large amounts of labeled samples for supervised training. In this paper we present an approach for training a convolutional neural network using only unlabeled data. We train the network to discriminate between a set of surrogate classes. Each surrogate class is formed by applying a variety of transformations to a randomly sampled ’seed ’ image patch. We find that this simple feature learning algorithm is surprisingly successful when applied to visual object recognition. The feature representation learned by our algorithm achieves classification results matching or outperforming the current stateoftheart for unsupervised learning on several popular datasets (STL10, CIFAR10, Caltech101). 1
Follow the Leader with Dropout Perturbations
 JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL 35:1–26, 2014
, 2014
"... We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
We consider online prediction with expert advice. Over the course of many trials, the goal of the learning algorithm is to achieve small additional loss (i.e. regret) compared to the loss of the best from a set of K experts. The two most popular algorithms are Hedge/Weighted Majority and Follow the Perturbed Leader (FPL). The latter algorithm first perturbs the loss of each expert by independent additive noise drawn from a fixed distribution, and then predicts with the expert of minimum perturbed loss (“the leader”) where ties are broken uniformly at random. To achieve the optimal worstcase regret as a function of the loss L ∗ of the best expert in hindsight, the two types of algorithms need to tune their learning rate or noise magnitude, respectively, as a function of L∗. Instead of perturbing the losses of the experts with additive noise, we randomly set them to 0 or 1 before selecting the leader. We show that our perturbations are an instance of dropout — because experts may be interpreted as features — although for nonbinary losses the dropout probability needs to be made dependent on the losses to get good regret bounds. We show that this simple, tuningfree version of the FPL algorithm achieves two feats: optimal worstcase O( L ∗ lnK +
Learning by marginalizing corrupted features
 In Proceedings of the International Conference on Machine Learning
, 2013
"... The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on an almost infinitely large training data set that captures all variations in the data distribution. In practical learning settings, however, we do not have infinite data ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on an almost infinitely large training data set that captures all variations in the data distribution. In practical learning settings, however, we do not have infinite data and our predictors may overfit. Overfitting may be combatted, for example, by adding a regularizer to the training objective or by defining a prior over the model parameters and performing Bayesian inference. In this paper, we propose a third, alternative approach to combat overfitting: we extend the training set with infinitely many artificial training examples that are obtained by corrupting the original training data. We show that this approach is practical and efficient for a range of predictors and corruption models. Our approach, called marginalized corrupted features (MCF), trains robust predictors by minimizing the expected value of the loss function under the corruption model. We show empirically on a variety of data sets that MCF classifiers can be trained efficiently, may generalize substantially better to test data, and are also more robust to feature deletion at test time. 1.
Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives
"... Abstract—“Big Data ” as a term has been among the biggest trends of the last three years, leading to an upsurge of research, as well as industry and government applications. Data is deemed a powerful raw material that can impact multidisciplinary research endeavors as well as government and business ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—“Big Data ” as a term has been among the biggest trends of the last three years, leading to an upsurge of research, as well as industry and government applications. Data is deemed a powerful raw material that can impact multidisciplinary research endeavors as well as government and business performance. The goal of this discussion paper is to share the data analytics opinions and perspectives of the authors relating to the new opportunities and challenges brought forth by the big data movement. The authors bring together diverse perspectives, coming from different geographical locations with different core research expertise and different affiliations and work experiences. The aim of this paper is to evoke discussion rather than to provide a comprehensive survey of big data research. Index Terms—Big data, data analytics, machine learning, data mining, global optimization, application F 1
Deep Neural Nets as a Method for Quantitative Structure−Activity Relationships
"... ABSTRACT: Neural networks were widely used for quantitative structure−activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machin ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
ABSTRACT: Neural networks were widely used for quantitative structure−activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, more efficient training algorithms, and advancements in computer hardware. In particular, deep neural nets (DNNs), i.e. neural nets with more than one hidden layer, have found great successes in many applications, such as computer vision and natural language processing. Here we show that DNNs can routinely make better prospective predictions than RF on a set of large diverse QSAR data sets that are taken from Merck’s drug discovery effort. The number of adjustable parameters needed for DNNs is fairly large, but our results show that it is not necessary to optimize them for individual data sets, and a single set of recommended parameters can achieve better performance than RF for most of the data sets we studied. The usefulness of the
Collaborative deep learning for recommender systems
 CoRR
"... Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CFbased methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many application ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional CFbased methods use the ratings given to items by users as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in many applications, causing CFbased methods to degrade significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as item content information may be utilized. Collaborative topic regression (CTR) is an appealing recent method taking this approach which tightly couples the two components that learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be very effective when the auxiliary information is very sparse. Inspired by recent success demonstrated by deep learning models, we propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which tightly couples a Bayesian formulation of the stacked denoising autoencoder and probabilistic matrix factorization. Extensive experiments on realworld datasets show that CDL can significantly advance the state of the art. 1
Accelerated learning for Restricted Boltzmann Machine with momentum term
"... Abstract. Restricted Boltzmann Machines are generative models which can be used as standalone feature extractors, or as a parameter initialization for deeper models. Typically, these models are trained using Contrastive Divergence algorithm, an approximation of the stochastic gradient descent metho ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Restricted Boltzmann Machines are generative models which can be used as standalone feature extractors, or as a parameter initialization for deeper models. Typically, these models are trained using Contrastive Divergence algorithm, an approximation of the stochastic gradient descent method. In this paper, we aim at speeding up the convergence of the learning procedure by applying the momentum method and the Nesterov’s accelerated gradient technique. We evaluate these two techniques empirically using the image dataset MNIST.