| B. A. Pearlmutter, \Fast exact multiplication by the Hessian", Neural Computation, 6(1):147-160, 1994. |
....for training MLPs, but those nonlinear versions attempt to approximate the entire Hessian matrix by generating the solution sequence f kg directly as the outer nonlinear algorithm. Thus, they ignore the special structure of the nonlinear least squares problem; so does Pearlmutter s method [15] to the Newton formula, although its modification may be possible. Outer Iteration stopping criteria Does hold END YES NO Algorithm for local model check Evaluate E( next q E( qnow Compute Initialize Rnow qnow YES NO YES NO Evaluate nnow nnow small n IF = E( qnow E( ....
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147--160, 1994.
....Here, we will consider that all neurons can play all these roles at the same time. This formalism has the advantage of being both more simple and more general than the traditional input hidden output layered architecture . This kind of formalism is attributed to Fernando Pineda by Pearlmutter [52]. 131 y 2 Figure A.1: Neuron in a feedforward network: y i = x i # i A.1.2 The # # Notation A special notation will be used to distinguish between two forms of partial derivative. For any variable v, the usual notation # #v means a derivative with respect to v, with all other ....
Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6:147--160, 1994. 131
....descent, coupled with adaptation of local step size and or momentum parameters. Curvature matrix vector products. The most advanced parameter adapta tion methods [4 7] for stochastic gradient descent rely on fast curvature matrixvector products that can be obtained efficiently and automatically [7, 8]. Their calculation does not require explicit storage of the Hessian, which would be Proc. Intl. Conf. Artificial Neural Networks, LNCS, Springer Verlag, Berlin 2002 O(d2) the same goes for other measures of curvature, such as the Gauss Newton approximation of the Hessian, and the Fisher ....
....technique is to first determine a search direction, then look for the optimum in that direction. In a quadratic bowl, the step from w to the minimum along direction v is given by zW gTv vTH v. 4) See http: www unix .mcs. anl. gov autodiff Hv can be calculated very efficiently [8], and we can use (4) in stochastic settings as well. Line search in the gradient direction, v = g, is called steepest descent. When fully stochastic (b= 1) steepest descent degenerates into the normalized LMS method known in signal processing. Choice of Jacobian. For our experiments we choose J ....
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147-160, 1994.
....number is proportional to a convex combination of the largest and smallest eigenvalues of C. For convenient and e cient implementation of this algorithm automatic differentiation tools can be used to calculate gradient g t and curvature matrixvector product C t v in O (n) as described in [5]. The Gauss Newton approximation of the Hessian should be used to ensure positive semide niteness [6] Algorithm 1 Stable Adaptive Momentum (SAM) Require: A twice di erentiable objective function f : R R Require: instantaneous gradient g t = rwf t (w)j w=w t and curvature matrix vector ....
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147160, 1994.
....number is proportional to a convex combination of the largest and smallest eigenvalues of C. For convenient and efficient implementation of this algorithm automatic dif ferentiation tools 1 can be used to calculate gradient gt and curvature matrixvector product Ctv in 69 (n) as described in [5]. The Gauss Newton approximation of the Hessian should be used to ensure positive semidefiniteness [6] Algorithm 1 Stable Adaptive Momentum (SAM) Require: A twice differentiable objective function f: lequire: instantaneous gradient g = Vwf (W) w=w and curvature matrix vector product ....
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147-160, 1994.
....descent, coupled with adaptation of local step size and or momentum parameters. Curvature matrix vector products. The most advanced parameter adaptation methods [4 7] for stochastic gradient descent rely on fast curvature matrixvector products that can be obtained eciently and automatically [7, 8]. Their calculation does not require explicit storage of the Hessian, which would be Proc. Intl. Conf. Arti cial Neural Networks, LNCS, Springer Verlag, Berlin 2002 O(d ) the same goes for other measures of curvature, such as the Gauss Newton approximation of the Hessian, and the Fisher ....
....common optimization technique is to rst determine a search direction, then look for the optimum in that direction. In a quadratic bowl, the step from w to the minimum along direction v is given by w = Hv v : 4) See http: www unix.mcs.anl.gov autodiff Hv can be calculated very eciently [8], and we can use (4) in stochastic settings as well. Line search in the gradient direction, v = g, is called steepest descent. When fully stochastic (b = 1) steepest descent degenerates into the normalized LMS method known in signal processing. Choice of Jacobian. For our experiments we choose J ....
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147-160, 1994.
.... q i = 1 2 log det GB q i = Gamma 1 2 trace(G Gamma1 B G q i ) Gammatrace(G Gamma1 B J J q i ) Gammafi trace(G Gamma1 B J J q i ) 43) When presented in this form, r log (q) can be calculated using a modification of the Rfbackpropg algorithm of (Pearlmutter, 1994). This algorithm 23 was developed for fast multiplication of an arbitrary vector by the Hessian of a general class of error functions, and it can be shown that the rightmost wing of (43) can be represented as a sum of such vector Hessian products. The details of the implementation are beyond the ....
.... the Gaussian prior G(0; I) For every generated network, a dataset was created by taking m inputs distributed according to G(0; 1) propagating the inputs through the network and adding Gaussian noise G(0; 0:01) Then the Hessian of the log posterior was evaluated using the Rfbackpropg algorithm (Pearlmutter, 1994), and was calculated. 27 As Figure 3 shows, log 2 seems to be an affine function of both log 2 H and log 2 m (the fluctuations are due to the randomness in this experiment) The slopes of the graphs on the left are close to 1, which implies that, for a fixed m, the condition number is O(H) ....
[Article contains additional citation context not shown here]
Pearlmutter, B. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1):147--160.
....on the Hessian matrix of the training error [55] The time complexity to get such a Hessian matrix is of the order O(Nn 2 ) where n and N are the total number of weights and the number of training patterns respectively. Although we can get the importance of a weight with a O(Nn ) method [86] without requiring the exact calculation of the Hessian matrix, we still need O(Nn 2 ) times to get the importance of each weight. It should be noticed that to avoid overfitting, the number of training patterns is usually much greater than the total number of weights, i.e. N AE n . Also in ....
B.A. Pearlmutter, Fast exact multiplication by the Hessian, Neural Computation, Vol.6, 147-160, 1994.
....and showing a faster training. The complexity of the methods proposed is accurately compared with that of the corresponding first order algorithms showing an average increase of about 23 times in terms of number of operations per iteration. 2. Product calculation techniques The first method [11] allows calculating the product Hp, where p is a generic vector and H is the Hessian matrix, by a simple, accurate and numerically robust technique using the following operator R f r f r p r ( w w p = 0 (2.1) that gives R E r E r p r = w w p H w p ....
.... differential operator, it obeys to the usual rules for differential operators, in particular R cf w cR f w R f w g w R f w R g w R f w g w R f w g w f w R g w p p p p p p p p ( Moreover we can also write [11] p w = p R . 2.3) These rules are sufficient to derive a new set of equations on a new set of variables called R variables, from the equations used to compute the gradient. This can be thought of as an adjoint system to the gradient calculation that computes the vector R E p ( w , ....
Pearlmutter B.A., "Fast exact multiplication by the Hessian", Neural Computation 6:147-160, 1994.
....fact there is an O(n) method for calculating the product of an n Thetan matrix with an arbitrary vector if the matrix happens to be the Hessian of a system whose gradient can be calculated in O(n) as is the case for most architectures encountered in practice. This fast Hessian vector product (Pearlmutter, 1994; Werbos, 1988; Mller, 1993) can be used in conjunction with (1) to create an efficient, iterative O(n) implementation of Newton s method. Unfortunately Newton s method has severe stability problems when used in nonlinear systems, stemming from the fact that the Hessian may be illconditioned and ....
....it implements by propagating activity (i.e. intermediate results) forward through F . r 1 . The ordinary backward pass of a neural network, calculating J 0 F u by propagating the vector u backwards through F . This pass uses intermediate results computed in the f 0 pass. f 1 . Following Pearlmutter (1994), we define the differential operator R v (F( w) j F( w r v) r fi fi fi fi r=0 = JF v (8) which describes the effect on a function F( w) of a weight perturbation in the direction of v. By pushing R v which obeys the usual rules for differential operators down into the ....
[Article contains additional citation context not shown here]
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147--160, 1994.
....t Delta t and vector addition that are both O(n) For M output nodes, the algorithm is then O(N ) where N = NM is the total number weights in the network. For nonlinear networks the problem is somewhat more complicated. To compute H t Delta t we use the algorithm developed by Pearlmutter [12] for computing the product of the hessian times an arbitrary vector. 4 The equivalent of one forward back 3 We refer to (4) as a heuristic since we have no theoretical results on the dynamics of the squared weight error for learning with this matrix of momentum parameters. 4 We actually use a ....
Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147--160, 1994.
....be applied to compute the principal eigenvector and eigenvalue of H by the power method. By iterating and setting Psi (t 1) H Psi (t) k Psi (t)k ; 60) the vector Psi (t) will converge to the largest eigenvector of H and k Psi (t)k to the corresponding eigenvalue [23, 14, 10] See also [33] for an even more accurate method that (1) does not use finite differences and (2) has similar complexity. 8 Analysis of the Hessian in multi layer networks It is interesting to understand how some of the tricks shown previously influence on the Hessian, i.e. how does the Hessian change with ....
B.A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6:147--160, 1994.
....BP. Momentum learning [Rumelhart et al., 986] and conjugate gradient [Moller, 1993; Boray and Srinath, 1992; Smagt, 1994] which use the second order information indirectly [Smagt, 1994] and the second order learning methods, even with approximations to the Hessian [LeCun et al. 1990, 1991, 1993; Pearlmutter, 1993] were shown to improve the learning speed tremendously. Therefore, it is desirable to use the second order derivative information. But a major difficulty with the use of the second derivative is the storage and computational complexity of the Hessian matrix. Some simplifying approaches such as the ....
Pearlmutter, B. A., "Fast exact multiplication by the Hessian," Neural Computation 6, 147-160, 1993.
....time t. The approximation in (6) assumes that (8 i 0) p t = p t Gammai = 0; this signifies a certain dependence on an appropriate choice of meta learning rate . Note that there is an efficient O(n) algorithm to calculate H t v t without ever having to compute or store the matrix H t itself [20]; we shall elaborate on this technique for the case of independent component analysis below. Meta level conditioning. The gradient descent in p at the meta level (2) may of course suffer from ill conditioning as much as the descent in w at the main level (1) the meta descent in fact squares ....
....W t is W t 1 = W t Gamma P t Delta D t ; 8) where the gradient D t is given by D t j fW t ( x t ) W t = y t Sigma tanh( y t ) y T t Gamma I) W t ; 9) with the sign for each component of the tanh( y t ) term depending on its current kurtosis estimate. Following Pearlmutter [20], we now define the differentiation operator R V t (g(W t ) j g(W t rV t ) r fi fi fi fi r=0 (10) which describes the effect on g of a perturbation of the weights in the direction of V t . We can use R V t to efficiently calculate the Hessian vector product H t V t j vec Gamma1 ....
[Article contains additional citation context not shown here]
B. A. Pearlmutter, "Fast exact multiplication by the Hessian", Neural Computation, 6(1):147--160, 1994.
....van Camp s method (1993) our algorithm does not depend on the choice of a good weight prior. It finds a flat minimum by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using an efficient second order method (Pearlmutter 1994; Mller 1993) we obtain conventional backpropagation s order of computational complexity. Automatically, the method effectively reduces numbers of units, weights, and Flat Minima 3 input lines, as well as output sensitivity with respect to remaining weights and units. Unlike simple weight ....
....constant number of output units, the computational complexity of our algorithm is O(L) The long version of this paper (available on the World Wide Web; see our home pages) contains pseudo code of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Mller (1993) Acknowledgments We thank David Wolpert and several anonymous referees for numerous comments that helped to improve previous drafts of this article. This work was supported by DFG grant SCHM 942 3 1 from Deutsche Forschungsgemeinschaft. ....
Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Computation 6(1), 147--160.
....= 41, m = 16, net: 1 5 1] SSE = 0. 1; ENC DEC: the family of encoder decoder problem is very popular and is described in [4] N = 10, m = 157, net: 10 7 10] The training algorithm used was the Moller scaled conjugate gradient [11] with the exact calculation of the second order information [14]. The net has hyperbolic tangent as the activation function for the hidden units and linear output units. The desired SSE was 0.1 in every case. In all GA s evolved, we performed 100 generations with a population of 50 individuals, p c = 50 and p m = 10 . 5. Simulation results Table 1 presents ....
Pearlmutter, B.A. "Fast Exact Multiplication by the Hessian", Neural Computation, vol. 6, pp. 147-160, 1994.
....Algorithm The scaled conjugate gradient (SCG) algorithm circumvents the multiplication and computes this vector directly, without computing the Hessian. An elegant way to compute the exact product of a Hessian and an arbitrary vector has been independently rediscovered for neural networks in [6] [7] and [9] This method is used for the numerical inversion proposed in this paper, with the exact values for and . We begin w ith the first order expansion of the error gradient, 10) Transposing Equation 10 and setting , w ith scalar and vector , gives . 11) The desired product is obtained by ....
....the error gradient, 10) Transposing Equation 10 and setting , w ith scalar and vector , gives . 11) The desired product is obtained by dividing Equation 11 by and taking the limit, 12) which coincides with the definition of a derivative. The definition of the differential operator is due to [7]. The usual rules for differential operators apply. In this paper, we obtain the desired product by applying this operator to those equations which compute gradients. In the following we simplify the notation to . 5 Using Feedforward Neural Networks The above second order optimization methods ....
B.A. Pearlmutter, "Fast Exact Multiplication by the Hessian," Neural Computation, Vol. 6, No. 1, pp. 147-160, January 1994.
.... Section 3 and with respect to any data source from Section 2, reading and writing complete model descriptions to files, analytically solving all linear weights with respect to a DATASET, optimizing the properties of the Jacobian [12] calculating the Jacobian or the complete Hessian [13] (with respect to a single exemplar or an entire DATASET) and chaining the weight gradient calculation to arbitrary objective functions (say for using a cross entropy error metric, or performing backpropagation through time) Moreover, by allowing the user to define new activation and net ....
Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147--160, 1994.
....t ( x t ) w w T w t ln p = v t p t ( ffi t Gamma H t v t ) 4) where H t denotes the instantaneous Hessian of f w ( x) at time t. Note that there is an efficient O(n) algorithm to calculate H t v t without ever having to compute or store the matrix H t itself [24]. Meta level conditioning. The gradient descent in p at the meta level (2) may of course suffer from ill conditioning as much as the descent in w at the main level (1) the meta descent in fact squares the condition number when v is defined as the previous gradient, or an exponential average ....
B. A. Pearlmutter, "Fast exact multiplication by the Hessian", Neural Computation, 6(1):147--160, 1994.
....does not depend on explicitly choosing a good prior. Our algorithm finds flat minima by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using Pearlmutter s and M ller s efficient second order method [11, 7], we obtain the same order of complexity as with conventional backprop. Automatically, the method effectively reduces numbers of units, weigths, and input lines, as well as the sensitivity of outputs with respect to remaining weights and units. Excellent experimental generalization results will ....
....) E(w; D 0 ) is minimized by gradient descent. To minimize B(w; D 0 ) we compute B(w; D 0 ) w uv = X k;i;j B(w; D 0 ) o k w ij ) 2 o k w ij w uv for all u; v . 2) It can be shown (see [4] that by using Pearlmutter s and M ller s efficient second order method [11, 7], the gradient of B(w; D 0 ) can be computed in O(W ) time (see details in [4] Therefore, our algorithm has the same order of complexity as standard backprop. 4 EXPERIMENTAL RESULTS (see [4] for details) EXPERIMENT 1 noisy classification. The first experiment is taken from Pearlmutter and ....
B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 1994.
....gradient descent is an efficient optimization scheme for the weights of neural networks. This work includes an improvement to conjugate gradient descent that avoids line searches along the conjugate search directions. It makes use of a variant of backprop (Rumelhart et al. 1986) called rbackprop (Pearlmutter, 1993), which can calculate the product of the Hessian of the weights and an arbitrary vector. The calculation is exact and computationally cheap. The report is in the nature of a tutorial. Gradient descent is reviewed and the backpropagation algorithm, used to find the gradients, is derived. Then a ....
....detail for the schemes have been given so that the interested reader might implement the schemes. The schemes have been compared for a variety of small tasks and a variety of networks, with varying sizes, connectivity and non linear elements. The RBackprop algorithm, re introduced by Pearlmutter (Pearlmutter, 1993), has been described and used within the scaled conjugate gradient algorithm (M ller, 1993) A neural network optimization scheme makes assumptions about the error surface in weight space. The success of a scheme will depend on the validity of the assumptions. For different problems, the ....
[Article contains additional citation context not shown here]
Pearlmutter, B. A. (1993). Fast exact multiplication by the hessian. to appear in Neural Computation.
....powerful allpurpose minimization algorithms (although it is not clear whether this scales to very large networks) Here second derivatives are used during line search, so the full Hessian is not required, just the product of the Hessian and a given vector. M ller [Mo93b, Mo93a] and Pearlmutter [Pea93] have both independently suggest numerical differentiation and exact calculation to compute this product. The calculation can also be used iteratively in the power method 2 [GV89] to efficiently approximate the principle eigenvectors of the Hessian. The principle eigenvectors are used by Le Cun, ....
....of the same computational order as the exact calculations described previously, but requires little additional algorithm overhead other than the backpropagation algorithm. Moller and Pearlmutter have applied this numerical approach to calculate the second derivatives along a particular arc [Mo93b, Pea93] in 2 backpropagation passes. Let v be a column vector in weight space expressing a direction of interest, and let 4 be some small value. d 2 E dwdw Delta v 1 4 dE dw fi fi fi fi w=w v4 Gamma dE dw ; 5 Summary In general, the calculation of the full Hessian of a network of ....
B.A. Pearlmutter. Fast exact multiplication by the Hessian. Submitted to Neural Computation, 1993.
No context found.
Barak A. Pearlmutter. "Fast exact multiplication by the Hessian." In Neural Computation, pp. 147--160, Vol. 6, No. 1, 1994.
....recognition to be insensitive to small affine transformations in the character space. Tangent Prop can also be considered a special case of J prop. 3 Derivation We now define a formalism under which J prop can be easily derived. The method is very similar to a technique introduced by Pearlmutter [8] for calculating the product of the Hessian of an MLP and an arbitrary vector. However, where Pearlmutter used differential operators applied to a model s weight space, we use differential operators defined with respect to a model s input space. Our entire derivation is presented in five steps. ....
Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147--160, 1994.
No context found.
B. A. Pearlmutter, \Fast exact multiplication by the Hessian", Neural Computation, 6(1):147-160, 1994.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC