| Le Cun, Y., Kanter, I., and Solla, S. (1991). Eigenvalues of covariance matrices: application to neural-network learning. Physical Review Letters, 66(18):2396-- 2399. |
.... s remain orthogonal to each other. This can be performed by projecting each Psi k onto the space orthogonal to the space subtended by the Psi l # l k. This is an NK process, which is relatively cheap if the network uses shared weights. A generalization of the acceleration method introduced in (Le Cun, Kanter and Solla, 1991) can be implemented with this technique. The idea is to use a Newton like weight update formula of the type W W ; K X k=1 jj Psi k jj ;1 P k where P k # k = 1. K ; 1 is the projection of rE(W ) onto Psi k , and PK is the projection of rE(W ) on the space orthogonal to the Psi k # (k ....
Le Cun, Y., Kanter, I., and Solla, S. (1991). Eigenvalues of covariance matrices: application to neural-network learning. Physical Review Letters, 66(18):2396-- 2399.
....the models are nonlinear, the interpretations are not straightforward. In this particular case the saliency map can be viewed as a tool for visualizing the regions in the brain, which are related most strongly to the specific tasks. 2. The Saliency Map It is well known that affine preprocessing [8, 10] can assist training and generalization significantly. Affine preprocessing of an input vector x j (i.e. an element of the training set of inputs X = x 1 : x J ] can be expressed as v j = B T (x j Gamma c) In fact, translating by the training set averaged input vector c = x and ....
Le Cun, Y., I. Kanter, and S. Solla, "Eigenvalues of covariance matrices: Application to neural-network learning," Physical Review Letters, vol. 66, Number 18:pp 2396--2399, May 1991.
....2 e i ) 0 1: 3.31) where i now runs from the largest eigenvalue 1 down to the kth largest eigenvalue k , and is some appropriate constant (Le Cun et al. suggest = k 1 1 ) Equation (3.31) reduces the component of the gradient along the directions with large curvature. See also [Le Cun et al. 91] for a discussion of this. The learning rate can now be increased with a factor of 1 k 1 , since the components in directions with large curvature have been reduced with the inverse of this factor. Another approach also proposed by Le Cun et al. is to use a small part of the sum in equation ....
....runs from the largest eigenvalue 1 down to the k th largest eigenvalue k . The eigenvalues of the Hessian matrix are the curvatures in the direction of the corresponding eigenvectors. So Equation (C.18) reduces the component of the gradient along the directions with large curvature. See also [Le Cun et al. 91] for a discussion of this. The learning rate can now be increased with a factor of 1 k 1 , since the components in directions with large curvature has been reduced with the inverse of this factor. The largest eigenvalue and the corresponding eigenvector can be estimated by an iterative ....
[Article contains additional citation context not shown here]
Y. Le Cun, I. Kanter, S. Solla (1991), Eigenvalues of Covariance Matrices: Application to Neural Network Learning, Physical Review Letters, Vol. 66, pp. 2396-2399.
....now runs from the largest eigenvalue 1 down to the k th largest eigenvalue k . The eigenvalues of the Hessian matrix are the curvatures in the direction of the corresponding eigenvectors. So Equation (18) reduces the component of the gradient along the directions with large curvature. See also [Le Cun 91] for a discussion of this. The learning rate can now be increased with a factor of 1 k 1 , since the components in directions with large curvature has been reduced with the inverse of this factor. The largest eigenvalue and the corresponding eigenvector can be estimated by an iterative ....
Y. Le Cun, I. Kanter, S. Solla (1991), Eigenvalues of Covariance Matrices: Application to Neural Network Learning, Physical Review Letters, Vol. 66, pp. 23962399.
....and variance on the data, and the problem with a lot of correlation in the data set. 2.6.1 Normalization input The problem, when the data used to train the network isn t normalized (zero mean value and non uniform variance) is that the convergence for the learning algorithm is slow. In the article [Le Cun et al. 1991] it has been shown that a mean value different from zero or a non uniform variance in the inputs, influence the condition number of the Hessian matrix. Further it has been shown that the convergence rate depends on the condition number. The condition number for the Hessian matrix is defined as = ....
....number. The condition number for the Hessian matrix is defined as = oe max =oe min ; 2:36) where oe max and oe min are respectively the largest and smallest eigenvalue of the Hessian matrix. For a simple first order algorithm (back prop gradient descent algorithm) it has been shown, e.g. [Le Cun et al. 1991] that the optimal learning parameter j depends on the eigenvalue oe in the Hessian, which corresponds to that weight. When only one learning parameter for all the weights is used, it is necessary to tune the learning parameter to the largest eigenvalue, which causes the convergence of weights with ....
[Article contains additional citation context not shown here]
Le Cun, Y. , Kanter, I. , and Solla, S.A. (1991). Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters, 66(18):2396--2399.
.... s remain orthogonal to each other. This can be performed by projecting each Psi k onto the space orthogonal to the space subtended by the Psi l ; l k. This is an NK process, which is relatively cheap if the network uses shared weights. A generalization of the acceleration method introduced in (Le Cun, Kanter and Solla1991) can be implemented with this technique. The idea is to use a Newton like weight update formula of the type W W Gamma K X k=1 jj Psi k jj Gamma1 P k where P k ; k = 1 : K Gamma 1 is the projection of rE(W ) onto Psi k , and PK is the projection of rE(W ) on the space orthogonal ....
Le Cun, Y., Kanter, I., and Solla, S.1991. Eigenvalues of covariance matrices: application to neural-network learning. Physical Review Letters, 66(18):2396-- 2399.
....the transfer function is not in its saturated state, the system has a non optimal condition. We subsequently propose a change in the feed forward network structure which alleviates this problem. We finally demonstrate the positive influence of this approach. 1 Introduction It has long been known [1, 3, 4, 6] that learning in feed forward networks is a difficult problem, and that this is intrinsic to the structure of such networks. The cause of the learning difficulties is reflected in the Hessian matrix of the learning problem, which consists of second derivatives of the error function. When the ....
....(leaving the index (p) out) H jk = Pi Gamma1 Pi X p=1 x j x k ; 1 j; k n = N i 1 (5) where, for notational simplicity, we set xN i 1 j 1. In this case H is the covariance matrix of the input patterns. It is instantly clear that H is a positive definite symmetric matrix. Le Cun et al. [1] show that, when the input patterns are uncorrelated, H has a continuous spectrum of eigenvalues Gamma . Furthermore, there is one eigenvalue n of multiplicity order n present only in the case that hx k i 6= 0. Therefore, the Hessian for a linear feed forward network is optimally ....
[Article contains additional citation context not shown here]
Y. Le Cun, I. Kanter, and S. A. Solla. Eigenvalues of covariance matrices: Application to neural network learning. Physical Review Letters, 66(18):2396--2399, 1991.
....estimation. However, if we assume that our current parameter vector is fairly close to , then min and max at the current position can be used to provide a fairly good estimate for j . These eigenvalues can be estimated fairly efficiently using techniques similar to those used in [15]. Of course, empirical testing would be necessary to check whether the additional computational cost of approximating j is worthwhile in practice. Finally, we note that our convergence theorem and the result on the optimal rate of convergence hold only in a neighborhood of the local maximum. By ....
Y. Le Cun, I. Kanter, and S. A. Solla. Eigenvalues of covariance matrices: Application to neural- network learning. Physical Review Letters, 66(18):2396--2399, 1991.
....precision [3] For the widely used IEEE 64 bit floating point representation this is equivalent to (H) 6:7 Delta 10 7 . This may seem as a large number, but this order of magnitude is not uncommon in the framework of either feedforward networks [8] or recurrent networks as we shall see. In [2] it was shown that an eigenvalue of the order of the number of input variables could be avoided if the mean was subtracted from each of the input variables x k (t) and if a symmetric activation function is used. However, these simple countermeasures are not adequate for avoiding ill conditioning ....
Y. Le Cun, I. Kanter, and S. A. Solla. Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters, 66(18):2396--2399, 1991.
.... that most of hidden neurons start with their outputs in the linear region of the activation functions, where the learning progresses the fastest [51] Moreover, Le Cun et al. have analytically shown that standardizing the inputs to zero mean improves the convergence properties of BP learning [33]. A comparison of the generalization performance of the best DWN, MFN, and CWN on the diabetes database is presented in Table 9. The three performances are within 1.5 of each other, with CWN being the best and DWN the worst. One expects DWN to be the worst performer, but why is the MFN, with its ....
Y. Le Cun, I. Kanter, and S. A. Sola. Eigenvalues of covariance matrices: Application to neural network learning. Physical Review Letters, 66(18):2396--2399, 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC