| A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 950-957, San Mateo, 1992. Morgan Kaufmann Publishers. |
....of which (number of layers and units per layer) is defined at creation time, by the client program, for each learning problem. Training is based on the back propagation algorithm. In order to reduce the bias in small training sets and improve the generality of MLP classifier, weight decay is used [11]. The main principle of weight decay is to limit the growth of the weights to reduce the number of free parameters in the network and so decrease its complexity. Otherwise, we may face the overfitness problem, even more serious with small sample sets. Weight decay is done by adding a term to ....
Krogh, A. and J.A. Hertz, "A simple Weight Decay Can Improve Generalization", In Advances in NeuralInformation Processing Systems 4, J.E. Moody, S. J. Hanson and R.P. Lippmann, eds. Morgan Kauffmann Publishers, San Mateo CA, pp. 950-957, 1992.
....is a weight decay parameter. For gradient descent learning, weight decay adds the term to the weight update: 3. 11) Weight decay improves generalisation by suppressing irrelevant components of the weight vector, 49 and choosing the smallest vector that solves the learning problem [32]. Furthermore, a good choice may suppress some of the effects of static noise in the training set. However, weight decay does add another parameter to the training process, and no general method exists for determining an optimal value for . 3.8 Improving the Training Algorithm Many ....
A. Krogh and J. Hertz, "A simple weight decay can improve generalization," in Advances in Neural Information Processing Systems 4 (J. Moody, S. Hanson, and R. Lippmann, eds.), pp. 950--957, San Mateo, CA: Morgan Kaufmann Publishers, 1992.
....architectures [2] Several strategies are suggested which can be applied when using CV based MLP architecture selection to significantly improve the performance CV based architecture selection. Weight Decay Weight decay adds a penalty term to the error function that favors smaller weights [5, 12]. The rate of weight decay is often chosen by training several different networks with different rates of decay and then using CV to estimate which rate is optimal. Network Pruning Pruning techniques start with an overly large network and iteratively prune connections that are estimated to be ....
Krogh, Anderse, and John Hertz. 1992. A Simple Weight Decay Can Improve Generalization. In Advances in Neural Information Processing Systems, Moody, J; Hanson S; and Lippmann, R eds, vol 4, pp 950-957. San Mateo, CA: Morgan Kauffmann publishers.
....against several other well known learning algorithms. The comparison shows that CV and MLPs are capable of performing better than many of the learning algorithms which are frequently employed in the fields of machine learning and neural networks. The other learning methods compared against are c4 [4][12] c4.5 [2] ib1[3] 6] mml [4] 12] and cn2 [5] 10] The results for these algorithms are taken from [13] The average generalization accuracy for CV is better than any of the other learning algorithms compared against ( 95 confidence level) c4 c45 ib1 mml cn2 CV(2,20) 84.57 84.68 84.00 ....
....learning algorithms. The comparison shows that CV and MLPs are capable of performing better than many of the learning algorithms which are frequently employed in the fields of machine learning and neural networks. The other learning methods compared against are c4 [4] 12] c4.5 [2] ib1[3] 6] mml [4][12] and cn2 [5] 10] The results for these algorithms are taken from [13] The average generalization accuracy for CV is better than any of the other learning algorithms compared against ( 95 confidence level) c4 c45 ib1 mml cn2 CV(2,20) 84.57 84.68 84.00 85.85 80.74 86.07 Table 4. CV vs ....
Krogh, Anders, and John Hertz. "A Simple Weight Decay Can Improve Generalization," In Moody, J.; Hanson, S.; and Lippmann, R., eds., Advances in Neural Information Processing Systems, volume 4, 950-957. San Mateo, CA: Morgan Kauffmann Publishers.
....remaining examples were held back for evaluation. There are certain implementation issues associated with any use of Backpropagation. Here, an offset sigmoid function is used in the threshold units to avoid the stuck unit problem [7] and a weak weight decay term was used to help generalization [8]. The Backpropagation algorithm was applied to learn the output task tasks from the training data. Following each training epoch, the network s performance on the cross validation set was measured. In the early phase of training, Backpropagation usually reduces the error rate of the network on ....
A. Krogh and J. A. Hertz. A Simple Weight Decay Can Improve Generalization. In J. E. Moody, S. J. Hanson, and R. P. Lipmann, editors, Advances in Neural Information Processing Systems 4, pages 950--957. Morgan Kaufmann, December 1992.
....of neural network solutions caused by the (possible) non uniqueness of a global minima, and the existence of (possibly) many local minima, leads to a large prediction variance. The large variance of each single network in the ensemble can be tempered with a regularization such as weight decay (Krogh and Hertz, 1992; Ripley, 1996, provide a review) Weight decay regularization imposes a constraint on the minimization of the squared prediction error of the form: E = # p t p y p 2 # # i,j w 2 i,j , 1) where t p is the target (observation) and y p the output (prediction) for the p th ....
Krogh, A. and Hertz, J. A., A simple weight decay can improve generalization, in: Moody, J., Hanson, S., and Lippmann, R., editors, Advances in Neural Information Processing Systems, Vol 4 (Morgan Kaufmann, San Mateo, CA, 1992) 950--957.
....data from y = sin(x=3) Order 20 overfits. Bottom: Small and large MLPs fit to same data. The large MLP does not overfit significantly more than the small MLP. 2 Overfitting Much has been written about overfitting and the bias variance tradeoff in neural nets and other machine learning models [2, 12, 4, 8, 5, 13, 6]. The top of Figure 1 illustrates polynomial overfitting. We created a training dataset by evaluating y = sin(x=3) at 0; 1; 2; 20 where is a uniformly distributed random variable between 0.25 and 0.25. We fit polynomial models with orders 2 20 to the data. Underfitting occurs with ....
A. Krogh and J.A. Hertz. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems, volume 4, pages 950--957. Morgan Kaufmann, 1992.
.... Occam s Razor in some form by penalizing more complex solutions (ridge regression, subset selection) or solutions that are less smooth (non parametric penalty functionals using a di erential operator) A variety of penalties are studied in [Fri94] Regularization can be linked to weight decay [KH92, CDS90] and to the use of model priors in the Bayesian framework [Mac95] Complexity regularization followed by empirical risk minimization has been studied because of nice convergence properties, as mentioned in Sec. 3.1. Below, we outline how the linear nature of the second stage training allows one ....
A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In S.J. Hanson J.E. Moody and R.P. Lippmann, editors, Advances in Neural Information Processing Systems-4, pages 950-957. Morgan Kaufmann, San Mateo, CA, 1992.
....to be specified, thus being simple (Hinton and van Camp, 1993) Pearlmutter and Hinton were probably the first to propose weight decay, while Rumelhart was perhaps the first to suggest its use for reducing overfitting. Variants of weight decay were successfully applied by Weigend et al. 1990) Krogh and Hertz (1992), and others. ffl Soft weight sharing. Nowlan and Hinton (1992) introduce an additional objective function encouraging groups of weights with nearly equal values. The weights are taken to be generated by mixtures of Gaussians. The fewer the number of Gaussians and the closer some weight is to the ....
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 4, pages 950--957. San Mateo, CA: Morgan Kaufmann.
....for an exception) 1. Assumptions about the prior weight distribution. Hinton and van Camp (1993) and Williams (1994) assume that pushing the posterior weight distribution close to the weight prior leads to good generalization (see more details below) Weight decay (e.g. Hanson and Pratt 1989; Krogh and Hertz 1992) can be derived, for example, from gaussian or Laplace weight priors. Nowlan and Hinton (1992) assume that a distribution of networks with many similar weights generated by gaussian mixtures is better a priori. MacKay s weight priors (1992b) are implicit in additional penalty terms, which embody ....
Krogh, A., and Hertz, J. A. 1992. A simple weight decay can improve generalization.
....to represent the interaction between two explanatory variables. The focus of such work has been to produce models that improve the fit of the data to the model and not to improve the comprehensibility or acceptance of the learned models. In training artificial neural networks, weight decay (Krogh, Hertz, 1995) Sill, Abu Mostafa, 1997) have been proposed as techniques for constraining models. However, the focus has been on improving generalization ability and not improving the user acceptance of the learned models. Causal Models (Spirtes, Glymour, and Scheines, 1993) and Belief Networks (Pearl, ....
Krogh, A. & Hertz, J. (1995). A Simple Weight Decay Can Improve Generalization Advances in Neural Information Processing Systems 4, Morgan Kauffmann Publishers, San Mateo CA, 950-957.
.... developmental mechanism guarantees equality between two corresponding weights In this paper, we first show that it is possible to relax the equality restriction and consider only the asymtopic case, i.e. for all , We show that besides its other uses (Hinton Sejnowski, 1986; Hinton, 1987; Krogh and Hertz, 1992; MacKay, 1992; Moody, 1992) weight decay can synchronize each weight pair, as outlined in Figure 2. Weight decay is defined simply as , where is the decay constant. In physical systems, such as the brain or VLSI chips, maintaining stable analog values requires elaborate mechanisms. By adopting ....
Krogh, A., & Hertz, J. A., (1992). A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R.
....of neural network solutions caused by the (possible) non uniqueness of a the global minima, and the existence of (possibly) many local minima, leads to a large prediction variance. The large variance of each single network in the ensemble can be tempered with a regularization such as weight decay (Krogh and Hertz, 1992; Ripley, 1996, for review) Weight decay regularization imposes a constraint on the minimization of the squared prediction error of the form: E = X p jt p Gamma y p j 2 Delta X i;j w 2 i;j ; where t p is the target and y p the output for the p th example pattern. w i;j are the ....
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Moody, J., Hanson, S., and Lippmann, R., editors, Advances in Neural Information Processing Systems, volume 4, pages 950--957. Morgan Kaufmann, San Mateo, CA.
....of neural network solutions caused by the (possible) non uniqueness of a global minima, and the existence of (possibly) many local minima, leads to a large prediction variance. The large variance of each single network in the ensemble can be tempered with a regularization such as weight decay (Krogh and Hertz, 1992; Ripley, 1996, provide a review) Weight decay regularization imposes a constraint on the minimization of the squared prediction error of the form: E = # p t p y p 2 # # i,j w 2 i,j , 1) where t p is the target (observation) and y p the output (prediction) for the p th ....
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Moody, J., Hanson, S., and Lippmann, R., editors, Advances in Neural Information Processing Systems, volume 4, pages 950--957. Morgan Kaufmann, San Mateo, CA.
....approach ignores the uncertainty with respect to the weights and assumes all possible weights to be equally likely. 3. 3 Priors for Neural Networks A simple method that is often used to reduce the risk of overfitting, is adding a term to the cost function that penalizes (too) large weights [HKP91, KH91]: E(w) D(w) 1 2 nw X i w 2 i : 4) where nw is the number of weights. This leads to the following update rule for gradient descent: Deltaw i Gamma D(w) w i Gamma w i ; 5) 1 Note this results in a procedure quite similar to the combination of networks [LT93, TG95] where ....
A. Krogh and J.A. Hertz. A simple weight decay can improve generalization. Technical report, The Niels Bohr Institute, 1991.
....make good predictions on unseen data. One way of improving the generalization performance is to add a regularization term to the error function. A common regularizer is called weight decay and consists of the sum of the squares of the free parameters, i.e. the weights and biases of the network, [1]. E(w) 1 2 N X i=1 (t i Gamma o i ) 2 Delta 1 2 W X j=1 w 2 j (1) The first part of equation (1) is the sum square error over N training pattern and the second part is the weight decay term over W weights and biases. One major problem is to optimize the weight decay parameter ....
Krogh A. , Hertz J.: A Simple Weight Decay Can Improve Generalization, in: J. Moody, S. Hansom, R. Lippmann, Advances in Neural Information Processing Systems (NIPS) 4, Morgan Kaufmann Publishers Inc. San Mateo (1992) 950 - 958.
....space or reducing the effective size of each dimension. Techniques for reducing the number of parameters are greedy constructive learning [7] pruning [5, 12, 14] or weight sharing [18] Techniques for reducing the size of each parameter dimension are regularization, such as weight decay [13] and others [25] or early stopping [17] See also [8, 20] for an overview and [9] for an experimental comparison. Early stopping is widely used because it is simple to understand and implement and has been reported to be superior to regularization methods in many cases, e.g. in [9] 1.2 The ....
Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In [16], pages 950--957, 1992.
....the initialisation to small values mentioned above. Regularisation Sensible training of neural networks models will usually not minimise the empirical cost, as this leads to overly optimistic models, a phenomenon known as over fitting. A popular and widespread remedy is the use of regularisation [3]. A number of regularisation techniques have been proposed [2] which involve optimising a regularised cost consisting of the empirical risk and an additional, positive regularisation term. Regularisation terms using the parameters directly rely on the assumption that small weights correspond to ....
A. Krogh and J. A. Hertz, "A simple weight decay can improve generalization", in J. E. Moody, S. J. Hanson and R. P. Lippman (eds.), Advances in Neural Information Processing Systems, vol. 4, 1992.
....a simple technique for pruning trained recurrent neural networks to significantly improve their generalization performance. To our knowledge, no such technique for recurrent neural networks has been previously published. Good generalization results have also be reported using weight decay ([8, 10]) We will compare our pruning method with weight decay for different decay rates. Published in IEEE Trans. on Neural Networks vol. 5, no. 5, p. 848, 1994. Copyright IEEE. 2 PRUNING A RECURRENT NETWORK To test our pruning heuristic, we incrementally trained discrete time, fully recurrent ....
....size of the maximal training set; NN classification errors on test set; quantization level; size of extracted DFA; DFA classification errors. 3. 3 Comparison with Weight Decay It has been observed in simulations that weight decay can improve the generalization performance of feedforward networks ([8, 10]) Weight decay suppresses irrelevant components of weight vectors by choosing a small vector that solves the learning problem. For networks trained using weight decay, the error function is expanded to include an error term which penalizes large weights: The weight update then becomes Deltaw ....
A. Krogh and J. A. Hertz, "A simple weight decay can improve generalization," in Advances in Neural Information Processing Systems 4 (J. Moody, S. Hanson, and R. Lippmann, eds.), (San Mateo, CA), pp. 950--957, Morgan Kaufmann Publishers, 1992.
....between the hidden layer and the output. During training (and testing) before introduction of a new string, the values of all the delays were set to 0. Online back propagation [Rumelhart et al. 1986] with a learning rate of 0.25 and momentum of 0.25 was used for training. Weight decay [Krogh and Hertz, 1992] with a weight decay parameter of 0.0001 was used. A selective updating scheme was applied whereby weights were updated in an online fashion, but only if the absolute error on the current training sample was greater than 0.2. This effectively speeds up the algorithm by avoiding gradient ....
Krogh, A. and Hertz, J. (1992). A simple weight decay can improve generalization. In [Moody et al., 1992], pages 950--957.
.... optimal brain damage improves generalization ability and speed of learning by using second derivative information to remove unimportant weights from the network [3] Weight decay was shown to improve generalization on feed forward networks by suppressing irrelevant components of the weight vector [13]. Still another network simplification method is pruning, which has demonstrated [7] improvement in generalization in recurrent neural networks. 1.1 Previous Work on Training with Noise Previous research has investigated the effects of noise on feedforward neural networks. Training with noise ....
....net inputs Theta t i on the states S t i are minimized. Multiplicative noise implements a form of weight decay because the error expansion terms include the weight products W 2 t;ijk or W t;ijk W u;ijk . Although weight decay has been shown to improve generalization on feedforward networks [13], we hypothesize that in general weight decay will not always improve generalization for recurrent networks. that are learning finite state automata (FSA) problems. Large weights are necessary to saturate the state nodes to the upper and lower limits of the sigmoid discriminant function. ....
Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 450--957, San Mateo, CA, 1992. Morgan Kaufmann Publishers.
....method, Bayesian learning, Evidence framework, Laplace prior, Comparison with weight decay. 1 Gauss and Laplace priors Regularization by weight decay, and the associated Bayesian interpretation involving a Gaussian prior is a well established part of modern neural computation, see e.g. [KH91]. In [HR94] it was shown that if the weight decay parameter is determined by data, either using MacKay s Evidence procedure, or by minimizing generalization error, pruning may result. In a recent communication, Wil95] elucidated the interesting properties of the Laplacian prior, in particular he ....
A. Krogh and J.A. Hertz. A simple weight decay can improve generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippman, editors, Advances in Neural Information Processing Systems, volume 4, 1991.
....ratio on a particular model, a non parametric kernel estimator with adaptive metric. 1 Cross validation Most efficient learning procedures require the setting of an extra learning parameter, or hyper parameter . Neural networks typically use a regularisation parameter weighting a weight decay [1], or the extent of pruning [2] Estimating the optimal hyper parameter is the topic of active current research in the statistical learning community [3] Let us consider a typical learning problem: modelling an input output relationship based on some empirical data D = Gamma x (i) y (i) ....
Krogh A, Hertz JA, A simple weight decay can improve generalization. In: Moody et al. (eds), Advances in Neural Inf Proc Systems 4, 1992, pp 950--957
....output for unit i of pattern j. Other free parameters of the network such as weights (number, magnitude) biases (magnitude) units (number) are not assessed by this error measure. It has been shown that these free parameters have a direct influence on the generalization ability of neural nets [KH92] To include these parameters in the error function we have to add another term to E s , i.e. E = ff s E s ff c E c ; 1) where E c is called a Complexity Regularization Term [Hay94] The constants ff s and ff c emphasize both error measures, accordingly. The above equation can also be seen in ....
A. Krogh and J.A. Hertz. A Simple Weight Decay can Improve Generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems, Vol. 4, volume 4, pages 950--957. Morgan Kaufmann, 1992.
....and Peng, 1990; Williams and Zipser, 1990) augmented with a number of heuristics found useful for grammatical inference problems. No batching was done on the training set, i.e. the weights were updated after processing each string (although see comment below on selective updating) Weight decay (Krogh and Hertz, 1992) was used with a weight decay parameter of 0.0001. For sample presentation we used teacher forcing. When target values are available at intermediate points during the processing of a string, these target values are used in the feedback loop instead of the actual node output values. However, this ....
Krogh, A. and Hertz, J. (1992). A simple weight decay can improve generalization. In Moody, J., Hanson, S., and Lippmann, R., editors, Advances in Neural Information Processing Systems 4, pages 950--957.
....of weights in competition with pressures from the error from the actual data set. In addition, modifications to the penalty terms have been suggested that allow certain weight magnitudes to escape decay, giving advantages for generalization performance. For a review of some of these techniques see [12, 19]. 3.2 Cause and Effect Techniques that fall into this category use the causality between individual network parameter variations, whether weights or neurons, and the output error. By removing a neuron for the network, for example, the saliency of that neuron can be estimated by measuring the ....
A. Krogh and J.A. Hertz. A simple weight decay can improve generalization. In D.S. Touretzky, editor, Proc. Neural Information Processing Systems (NIPS) Conference, pages 950 -- 957. Morgan Kauffmann, 1992.
....estimator. wG best generalization estimator. Phi(x) area under the normal curve. 1 Pruning prior 1 Gauss and Laplace priors Regularization by weight decay, and the associated Bayesian interpretation involving a Gaussian prior is a well established part of modern neural computation, see e.g. (Krogh and Hertz, 1992). Hansen and Rasmussen, 1994) show that when the weight decay parameter is determined by data either using MacKay s evidence procedure, or by minimizing generalization error pruning may result. Recently, Williams, 1995) elucidated the interesting properties of the Laplace prior; in ....
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Moody, J. E., Hanson, S. J., and Lippman, R. P., editors, Advances in Neural Information Processing Systems, volume 4 of NIPS.
....output for unit i of pattern j. Other free parameters of the network such as weights (number, magnitude) biases (magnitude) units (number) are not assessed by this error measure. It has been shown that these free parameters have a direct influence on the generalization ability of neural nets [KH92]. To include these parameters in the error function we have to add another term to Es , i.e. E = ff s Es ff c Ec ; where Ec is called a complexity or regularization term [Hay94] The constants ff s and ff c emphasize both error measures, accordingly. The first regularization term examined was ....
A. Krogh and J.A. Hertz. A Simple Weight Decay can Improve Generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems, Vol. 4, volume 4, pages 950--957. Morgan Kaufmann, 1992.
....weight update is used, etc. The most common modifications of the algorithm (which Chapter 4.1. 2 Implementation of on chip back propagation Page 60 can also be applied to many other learning algorithms) include (Hertz et al. 95] Haykin [93] Plaut et al. 188] Krogh and Hertz [124], Solla et al. 226] Fahlman [69] and others) ffl Weight decay. Modifying the weight updating rule (9 4 ) as w l kj (t 1) Gamma w l kj (t) Deltaw l kj (t) Delta (1 Gamma ffl dec ) 10 4 ) where 0 ffl dec 1 is the weight decay parameter, discourages large weight ....
....updating scheme) is, in principle, just a matter of making the storage capacitor leaky; i.e. placing a resistor from the capacitor to a zero weight reference voltage. The weight decay, however, must be small compared to typical weight changes in order not to prohibit learning. Krogh and Hertz [124], for example, use a very small weight decay parameter learning rate ratio of ffl dec =j = 1 Delta 10 Gamma4 , which would probably be insignificant compared to typical weight change offsets. If one could accept a weight decay that was large compared to the weight change offsets, the ....
Anders Krogh and John A. Hertz, "A Simple Weight Decay Can Improve Generalization," in Proc. Neural Information Processing Systems Conference '91, Denver, pp. 950--957, 1992.
....connections, and their connectivity. While networks tend to lose generalization ability with increasing complexity, they are less able to memorize input patterns with decreasing complexity. A way to avoid a too complex ANN is to limit the growth of the weights through some kind of weight decay [KH92] which is a special kind of regularization. Various methods have been suggested for the automatic construction of ANN topologies. Among these are Network Pruning (OBS [HSW93] OBD [CDS90] etc. Network Growing (Cascade Correlation [FL90] etc. and Evolutionary Design of networks. Adopting ....
....1: Parameter settings for the GA and the ANN used throughout this study. 2. 3 Regularization Terms It has been shown that free parameters of the network such as weights (number, magnitude) biases (magnitude) units (number) have a direct influence on the generalization ability of neural nets [KH92] To include these parameters in the network error function we have to add another term to the classification error term E s : i.e. E = ff s E s ff c E c ; where E c is called a complexity or regularization term [Hay94] The constants ff s and ff c emphasize both error measures, ....
A. Krogh and J.A. Hertz. A Simple Weight Decay can Improve Generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems, Vol. 4, volume 4, pages 950--957. Morgan Kaufmann, 1992.
....error as a function of the loading rate ff = P=N for different fixed values of . The curves are the theoretical predictions and the points are simulated results. We can see that the overfitting is already reasonably reduced, but the optimal curve is reached only once for each choice of , see also [4]. So the weight decay strength should be chosen more accurately. Now we want to determine the optimal value for the weight decay strength for each ff. As a starting point we look at the generalization error as a function of for different choices of ff, see Fig. 2. Again it can be seen that there ....
A. Krogh, and J. Hertz (1992), 'A simple weight decay can improve generalization ', in Advances in Neural Information Processing Systems 4, editors J.E. Moody, S.J. Hanson and R.J. Lippmann, Kaufmann, San Mateo CA, p.950--957. This article was processed using the L a T E X macro package with LLNCS style
....solutions (e.g. due to local minima) Often, a result of attaining a sub optimal solution is that not all of the network resources are e ciently used. Experiments with a controlled task have indicated that the sub optimal solutions often have smaller weights on average [17] 3. Weight decay [16] or weight elimination [30] are often used in MLP training and aim to minimize a cost function which penalizes large weights. These techniques tend to result in networks with smaller weights. 4. A commonly recommended technique with MLP classi cation is to set the training targets away from the ....
A. Krogh and J.A. Hertz. A simple weight decay can improve generalization. In J.E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 950-957. Morgan Kaufmann, San Mateo, CA, 1992.
....use of a regularization term during optimization improves the general accuracy of the model obtained. In the case of neural networks, regularization is most often used through the addition of a weight decay term to the cost function in order to improve the generalization abilities of the solution [5]. Other methods for improving these abilities include pruning, along the lines of OBD [6] These techniques have been applied to a wide variety of problems, including time series and system identification. In this paper, we analyse the use of another regularization term, due to [11] which is ....
A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural Information Processing Systems, volume 4 of NIPS, 1992.
....network is to be the same as the target function. In these situations, there is little need for the network to have any generalization power. More realistically, where examples are few, the error term does not express a complete notion of generalization. Instead, biases such as a weight decay term[9] are used to help with generalization to unseen examples. 3.2 Training with an Initialization Bias Training with an initialization bias is simplified when the previously learned knowledge is available in the form of a neural network identical to that to be used the training of the target ....
....than sufficient to allow convergence to occur. This process is repeated with different random seeds several times for each methods, to reduce variance. In all algorithms, a weak weight decay term was used during training with the weights decayed towards zero. In general, this helps generalization[9] by favoring the development of the simplest solutions. In addition, we shall see that it also enables a learner to overcome the initialization bias of an incorrect domain theory. 4.3 Robustness to Number of Examples We first investigate how generalization power varies at two extremes of ....
Krogh, A. and Hertz, J. A. A Simple Weight Decay Can Improve Generalization. in: Advances in Neural Information Processing Systems 4, edited by J. E. Moody, S. J. Hanson, and R. P. Lipmann. Morgan Kaufmann, 1992, pp. 950--957.
....1989) However, the universal approximation result requires an infinite number of hidden nodes. For a given number of hidden nodes a network may be incapable of representing the required function and instead implement a simpler function which approximates the required function. 3. Weight decay (Krogh Hertz 1992) or weight elimination (Weigend, Rumelhart, Huberman 1991) are often used in MLP training and aim to minimize a cost function which penalizes large weights. These techniques tend to result in networks with smaller weights. 4. A commonly recommended technique with MLP classification is to set the ....
Krogh, A., and Hertz, J. 1992. A simple weight decay can improve generalization.
....(see [40] however, for an exception) 1) Assumptions about the prior weight distribution. Hinton and van Camp [14] and Williams [49] assume that pushing the posterior distribution (after learning) close to the prior leads to good generalization (see more details below) Weight decay (e.g. [11, 18]) can be derived e.g. from Gaussian or Laplace priors. Nowlan and Hinton [34] assume that networks with many similar weights generated by Gaussian mixtures are better a priori. MacKay s priors [23] are implicit in additional penalty terms, which embody the assumptions made. 2) Prior ....
A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural Information Processing Systems 4, pages 950--957. San Mateo, CA: Morgan Kaufmann, 1992.
....for an exception) 1) Assumptions about the prior weight distribution. Hinton and van Camp (1993) and Williams (1994) assume that pushing the posterior weight distribution close to the weight prior leads to good generalization (see more details below) Weight decay (e.g. Hanson Pratt, 1989; Krogh Hertz, 1992) can be derived, e.g. from Gaussian or Laplace weight priors. Nowlan and Hinton (1992) assume that a distribution of networks with many similar weights generated by Gaussian mixtures is better a priori. MacKay s weight priors (1992b) are implicit in additional penalty terms, which embody the ....
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Moody, J. E., Hanson, S. J., and Lippman, R. P., editors, Advances in Neural Information Processing Systems 4, pages 950--957. San Mateo, CA: Morgan Kaufmann.
....model would converge to the optimum solution. In this paper, we propose a pruning based algorithm, the Delay Damage Algorithm, to determine the optimal memory order of NARX and input time delay neural networks. This algorithm can also incorporate several useful heuristics, such as weight decay [31], which are used extensively in static networks to optimize the nonlinear function. For a survey of pruning methods for feedforward neural networks, see [52] The procedure of the algorithm starts with a NARX network with enough degrees of freedom in both input and output memory or taps, and ....
....use degenerate forms of the NARX network, the NSARs. We also give a brief introduction to the theory of dynamic embedding before discussing the results of time series prediction. In order to also optimize the architecture of the MLP of a NARX network or NSAR, several methods of weight elimination [5, 31, 47, 64, 66] can be incorporated into the training algorithm. In the following experiments, networks are trained using weight decay [31] All experiments were trained using Back Propagation Through Time (BPTT) 68] 4.1 Grammatical Inference: Learning A 512 state Finite Memory Machine NARX networks have ....
[Article contains additional citation context not shown here]
A. Krogh and J.A. Hertz. A simple weight decay can improve generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 950--957, 1992.
.... Fahlman Lebiere, 1990) pruning (e.g. Le Cun, Denker Solla, 1990; Hassibi Stork, 1992; Levin, Leen Moody, 1994) or weight sharing (e.g. Nowlan Hinton, 1992) The corresponding NN techniques for reducing the size of each parameter dimension are regularization such as weight decay (e.g. Krogh Hertz, 1992) and others (e.g. Weigend, Rumelhart Huberman, 1991) or early stopping (Morgan Bourlard, 1990) See also (Reed, 1993; Fiesler, 1994) for an overview and (Finnoff, Hergert Zimmermann, 1993) for an experimental comparison. Early stopping is widely used because it is simple to understand and ....
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In (Moody et al., 1992), pages 950--957.
No context found.
A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 950-957, San Mateo, 1992. Morgan Kaufmann Publishers.
No context found.
A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 950--957. Morgan Kaufmann Publishers, Inc., 1992.
No context found.
A Krogh and J A Hertz. A simple weight decay can improve generalization. In J E Moody, S J Hanson, and R P Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 950--957, San Mateo, CA, 1992. Morgan Kaufmann Publishers. 20
No context found.
A. Krogh, J.A. Hertz. A Simple Weight Decay Can Improve Generalization. Advances in Neural Information Processing Systems, 4, J.E. Moody, S.J. Hanson and R.P. Lippmann, eds., Morgan Kauffmann Publishers, San Mateo CA, 950-957, 1992.
No context found.
A. Krogh, J. Hertz, A simple weight decay can improve generalization, Adv. Neural Inform. Process. Syst. 4 (1992) 950--957.
No context found.
Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In #16#, pages 950#957, 1992.
No context found.
Krogh, A. and Hertz, J.A., "A Simple Weight Decay Can Improve Generalization", Niels Bohr Institute, Copenhagen, Denmark, Computer and Information Sciences, Univ. of California Santa Cruz,CA 95064, 1995.
No context found.
A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In J.E. Moody, S.J Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 950--957. Morgan Kaufmann, San Mateo, CA, 1992.
No context found.
Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 450--957, San Mateo, CA, 1992. Morgan Kaufmann Publishers.
No context found.
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Moody, J. E., Hanson, S. J., Lippmann, R. P. (eds.) Advances in Neural Information Processing Systems 4, San Mateo, CA. Morgan Kaufman Publishers, pp. 950--957.
No context found.
Krogh, A. and Hertz, J. (1992). A Simple Weight Decay Can Improve Generalization. In Advances In Neural Information Processing Systems 4, Moody, J., Hanson, S., Lippmann, R. (eds), Morgan Kaufmann Publishers.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC