Results 1  10
of
25
K.R.: Deep Boltzmann machines and the centering trick
 Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science
, 2012
"... Abstract. Deep Boltzmann machines are in theory capable of learning efficient representations of seemingly complex data. Designing an algorithm that effectively learns the data representation can be subject to multiple difficulties. In this chapter, we present the “centering trick ” that consists o ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Abstract. Deep Boltzmann machines are in theory capable of learning efficient representations of seemingly complex data. Designing an algorithm that effectively learns the data representation can be subject to multiple difficulties. In this chapter, we present the “centering trick ” that consists of rewriting the energy of the system as a function of centered states. The centering trick improves the conditioning of the underlying optimization problem and makes learning more stable, leading to models with better generative and discriminative properties.
Revisiting natural gradient for deep networks
 In International Conference on Learning Representations
, 2014
"... We evaluate natural gradient, an algorithm originally proposed in Amari (1997), for learning deep models. The contributions of this paper are as follows. We show the connection between natural gradient and three other recently proposed methods: HessianFree (Martens, 2010), Krylov Subspace Descent ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We evaluate natural gradient, an algorithm originally proposed in Amari (1997), for learning deep models. The contributions of this paper are as follows. We show the connection between natural gradient and three other recently proposed methods: HessianFree (Martens, 2010), Krylov Subspace Descent (Vinyals and Povey, 2012) and TONGA (Le Roux et al., 2008). We empirically evaluate the robustness of natural gradient to the ordering of the training set compared to stochastic gradient descent and show how unlabeled data can be used to improve generalization error. Another contribution is to extend natural gradient to incorporate second error information alongside the manifold information. Lastly we benchmark this new algorithm as well as natural gradient, where both are implemented using a truncated Newton approach for inverting the metric matrix instead of using a diagonal approximation of it. 1
Convergence of the Continuous Time Trajectories of Isotropic Evolution Strategies on Monotonic C 2composite Functions
"... Abstract. The InformationGeometric Optimization (IGO) has been introduced as a unified framework for stochastic search algorithms. Given a parametrized family of probability distributions on the search space, the IGO turns an arbitrary optimization problem on the search space into an optimization p ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
Abstract. The InformationGeometric Optimization (IGO) has been introduced as a unified framework for stochastic search algorithms. Given a parametrized family of probability distributions on the search space, the IGO turns an arbitrary optimization problem on the search space into an optimization problem on the parameter space of the probability distribution family and defines a natural gradient ascent on this space. From the natural gradients defined over the entire parameter space we obtain continuous time trajectories which are the solutions of an ordinary differential equation (ODE). Via discretization, the IGO naturally defines an iterated gradient ascent algorithm. Depending on the chosen distribution family, the IGO recovers several known algorithms such as the pure rankμ update CMAES. Consequently, the continuous time IGOtrajectory can be viewed as an idealization of the original algorithm. In this paper we study the continuous time trajectories of the IGO given the family of isotropic Gaussian distributions. These trajectories are a deterministic continuous time model of the underlying evolution strategy in the limit for population size to infinity and change rates to zero. On functions that are the composite of a monotone and a convexquadratic function, we prove the global convergence of the solution of the ODE towards the global optimum. We extend this result to composites of monotone and twice continuously differentiable functions and prove local convergence towards local optima. 1
Policy Improvement Methods: Between BlackBox Optimization and Episodic Reinforcement Learning
"... ..."
(Show Context)
Pairwise MRF calibration by perturbation of the Bethe reference point. Rapport de recherche 8059
, 2012
"... We investigate different ways of generating approximate solutions to the pairwise Markov random field (MRF) selection problem. We focus mainly on the inverse Ising problem, but discuss also the somewhat related inverse Gaussian problem because both types of MRF are suitable for inference tasks with ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
We investigate different ways of generating approximate solutions to the pairwise Markov random field (MRF) selection problem. We focus mainly on the inverse Ising problem, but discuss also the somewhat related inverse Gaussian problem because both types of MRF are suitable for inference tasks with the belief propagation algorithm (BP) under certain conditions. Our approach consists in to take a Bethe meanfield solution obtained with a maximum spanning tree (MST) of pairwise mutual information, referred to as the Bethe reference point, for further perturbation procedures. We consider three different ways following this idea: in the first one, we select and calibrate iteratively the optimal links to be added starting from the Bethe reference point; the second one is based on the observation that the natural gradient can be computed analytically at the Bethe point; in the third one, assuming no local field and using low temperature expansion we develop a dual loop joint model based on
Objective Improvement in InformationGeometric Optimization ABSTRACT
"... InformationGeometric Optimization (IGO) is a unified framework of stochastic algorithms for optimization problems. Given a family of probability distributions, IGO turns the original optimization problem into a new maximization problem on the parameter space of the probability distributions. IGO up ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
InformationGeometric Optimization (IGO) is a unified framework of stochastic algorithms for optimization problems. Given a family of probability distributions, IGO turns the original optimization problem into a new maximization problem on the parameter space of the probability distributions. IGO updates the parameter of the probability distribution along the natural gradient, taken with respect to the Fisher metric on the parameter manifold, aiming at maximizing an adaptive transform of the objective function. IGO recovers several known algorithms as particular instances: for the family of Bernoulli distributions IGO recovers PBIL, for the family of Gaussian distributions the pure rank
Adaptive exploration for continual reinforcement learning
 In International Conference on Intelligent Robots and Systems (IROS
, 2012
"... Abstract — Most experiments on policy search for robotics focus on isolated tasks, where the experiment is split into two distinct phases: 1) the learning phase, where the robot learns the task through exploration; 2) the exploitation phase, where exploration is turned off, and the robot demonstrate ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract — Most experiments on policy search for robotics focus on isolated tasks, where the experiment is split into two distinct phases: 1) the learning phase, where the robot learns the task through exploration; 2) the exploitation phase, where exploration is turned off, and the robot demonstrates its performance on the task it has learned. In this paper, we present an algorithm that enables robots to continually and autonomously alternate between these phases. We do so by combining the ‘Policy Improvement with Path Integrals ’ direct reinforcement learning algorithm with the covariance matrix adaptation rule from the ‘CrossEntropy Method ’ optimization algorithm. This integration is possible because both algorithms iteratively update parameters with probabilityweighted averaging. A practical advantage of the novel algorithm, called PI 2CMA, is that it alleviates the user from having to manually tune the degree of exploration. We evaluate PI 2CMA’s ability to continually and autonomously tune exploration on two tasks. I.
Riemannian metrics for neural networks
"... We describe four algorithms for neural network training, each adapted to different scalability constraints. These algorithms are mathematically principled and invariant under a number of transformations in data and network representation, from which performance is thus independent. These algorithms ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We describe four algorithms for neural network training, each adapted to different scalability constraints. These algorithms are mathematically principled and invariant under a number of transformations in data and network representation, from which performance is thus independent. These algorithms are obtained from the setting of differential geometry, and are based on either the natural gradient using the Fisher information matrix, or on Hessian methods, scaled down in a specific way to allow for scalability while keeping some of their key mathematical properties. The most standard way to train neural networks, backpropagation, has several known shortcomings. Convergence can be quite slow. Backpropagation is sensitive to data representation: for instance, even such a simple operation as exchanging 0’s and 1’s on the input layer will affect performance (Figure 1), because this amounts to changing the parameters (weights and biases) in a nontrivial way, resulting in different gradient directions in parameter space, and better performance with 1’s than with 0’s. (In the related context of restriced Boltzmann machines, it has been found that the standard training technique by gradient ascent favors setting hidden units to 1, for very much the same reason [AAHO11, Section 5].) This specific phenomenon disappears if, instead of the logistic function, the hyperbolic tangent is used as the activation function. Scaling also has an effect on performance: for instance, a common recommendation [LBOM96] is to use 1.7159 tanh(2
Analysis of a Natural Gradient Algorithm on Monotonic ConvexQuadraticComposite Functions
 in "Proc. Genetic and Evolutionary Computation Conference (ACMGECCO)", Philadelphia, ÉtatsUnis
, 2012
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.