| M. Mller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6, 525--533, (1993). |
....of adapting weights towards the minimum of E is the steepest descent method in which weight movement is in the direction of the negative error gradient scaled with a learning rate #: #w ij = ##w ij E. Weight adaptation in this paper is performed with a scaled conjugate gradient method [1] which significantly improves speed and convergence of training as compared to standard techniques. Batch mode adaptation in which the weights are modified at the end of each training epoch is utilized. This has the property of using the exact gradient as opposed to stochastic adaptation which ....
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, pp. 525--533, 1993. 1.
.... di Roma Tor Vergata, 00133 Rome, Italy (e mail: fanelli mat.uniroma2.it) Digital Object Identifier 10.1109 TNN.2003.809425 In order to reach this aim, several iterative schemes exploiting parts of the Hessian approximation were implemented (e.g. the diagonal or the block diagonal part [29]) Among the latter algorithms, the BFGS (Limited memory BFGS) methods [1] 27] 30] 31] have been studied extensively. The BFGS algorithms update continuously a Hessian approximation by using the most recent second order information available in the form of the vectors , The rate of ....
M. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, no. 4, pp. 525--533, 1993.
....towards the minimum of E is the steepest descent method in which weights are changed in the direction of the negative error gradient scaled with a learning rate # according to the equation #w ij = ##w ij E. Weight adaptation in this paper is performed with a scaled conjugate gradient method [5] which significantly improves speed and convergence of training as compared to standard techniques. Weights are modified at the end of each training epoch (i.e. in batch mode) This is opposed to stochastic adaptation which performs weight updates after each pattern presentation and thus ....
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, pp. 525--533, 1993. 1.
....an error function (often a mean square error) of ANN s. The so called learning problem here is a typical optimization problem in numerical analysis. Many improvements on the ANN learning algorithm are actually improvements over optimization algorithms [12] such as conjugate gradient methods [13] [14]. Learning is different from optimization because we want the learned system to have best generalization, which is different from minimizing an error function. The ANN with the minimum error does not necessarily mean that it has best generalization unless there is an equivalence between ....
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, pp. 525--533, 1993.
....training in ANN s is usually formulated as minimization of an error function, such as the mean square error between target and actual outputs averaged over all examples, by iteratively adjusting connection weights. Most training algorithms, such as BP and conjugate gradient algorithms [7] 17] [19], are based on gradient descent. There have been some successful applications of BP in various areas [20] 22] but BP has drawbacks due to its use of gradient descent [23] 24] It often gets trapped in a local minimum of the error function and is incapable of finding a global minimum if the ....
....training, this term does not need to be differentiable or even continuous. Weight sharing and weight decay can also be incorporated into the fitness function easily. Evolutionary training can be slow for some problems in comparison with fast variants of BP [131] and conjugate gradient algorithms [19], 132] However, EA s are generally much less sensitive to initial conditions of training. They always search for a globally optimal solution, while a gradient descent algorithm can only find a local optimum in a neighborhood of the initial solution. For some problems, evolutionary training can ....
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, no. 4, pp. 525--533, 1993.
.... be proportional to the inverse of the Lipschitz constant which, in practice, is not easily available [2] 42] 69] A variety of approaches adapted from numerical analysis have been applied, in an attempt to use second derivative related information to accelerate the learning process [6] 44] [46], 53] 68] 72] However, second order training algorithms are, in certain cases, computationally intensive for MLPs with several hundred weights [7] Furthermore, it is not certain that the extra computational cost speeds up the minimization process for nonconvex functions when far from a ....
....within the predetermined limit of error function evaluations, their number of gradient evaluations is smaller than the corresponding number of the other methods. Keeping in mind that for some problems, 48] a gradient evaluation is more costly than an error function evaluation (see, for example, [46], where Mller suggests counting gradient evaluations more than error function evaluations) one can understand that these methods require fewer floating point operations and are actually much faster. From the above discussion, it is clear why, in the tables below, there are two rows for the ....
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, pp. 525--533, 1993.
....network because it is used in curve fitting problems, but its mathematical form is much simpler with respect to the metrics proposed in superquadrics fitting literature. Weights update has been performed both with the Levenberg Marquardt algorithm [11] 13] and the Scaled Conjugate Gradient (SCG) [15] approach. These two methods are faster and more efficient than the classical gradient descent algorithm, and they are more suited to face the estimation of a high dimension non linear model like superquadrics. In section 3 a detailed comparison of the two techniques will be presented. The ....
M. Mller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4):525--533, 1993.
....7. age. For comparison, Wahba et al. (1995) using generalized linear regression, found that variables 1, 2 5 and 6 were the most important. 22 Rather than using the potential and its gradient in a HMC routine, we now simply used them as inputs to a scaled conjugate gradient optimiser (based on [13]) instead, attempting to find a mode of the class posterior, rather than to average over the posterior distribution. We tested the multiple class method on the Forensic Glass dataset described in [18] This is a dataset of 214 examples with 9 inputs and 6 output classes. Because the dataset is so ....
M. Mller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4):525--533, 1993.
....Unfortunately, each iteration requires a sizable number of function calls (about 30) and gradient calls (about 25) The two variants do not present statistically significant differences. Some references about the use of conjugate gradient for training in neural networks are, for example, 18] [21] and [2] 3.4 One Step Secant with Fast Line Search Computing the exact Hessian requires order O(N ) operations [8] and order O(N ) memory to store the Hessian components, in addition the solution of equation 13 to find the step (or search direction) in Newton s method requires O(N ) ....
M. F. Mller, A scaled conjugate gradient algorithm for fast supervised learning, Comp. Sci. Dept., University of Aarhus, preprint, (Nov. 1990).
.... Current research mostly concentrates on the optimal setting of initial weights [2, 3] optimal learning rates and momentum [4, 5, 6, 7] finding optimal NN architectures using pruning techniques [8, 9, 10, 11, 12, 13] and construction techniques [14, 15, 16] sophisticated optimization techniques [17, 18, 19, 20, 21, 22], and adaptive activation functions [23, 24, 25] This paper presents an alternative approach to improve generalization and training time, i.e. active learning using sensitivity analysis. Standard error back propagating NNs are passive learners. These networks passively receive information about ....
Mller, M.F.: A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning, Neural Networks, 6, 1993, 525-533.
....not easy to locate weights that will allow the networks to detect malignant regions with a success of over 90 . For example, in Figure 2, only 2 networks out of the 1000 trained with the Rprop algorithm, 22] achieved recognition success from 90 to 100 . For the Scaled Conjugate Gradient (SCG) [23], the corresponding number is 3 out of 1000, while for the Levenberg Marquardt (L M) 24] this number is slightly higher, as 6 out of the 1000 networks exhibited classification success between 90 and 100 . The best result for each training method is: 90 for the Rprop, 92.4 for the L M and ....
Mller, M., 1993, "A scaled conjugate gradient algorithm for fast supervised learning", Neural Networks, 6, 525533.
.... (8) Proof: The assertion is evident with the properties of a cg method using (9) or (10) The off line version could often be successfully used with the very simple regulation (9) For using conjugate gradient techniques in that case various formulations are known, see for instance [1] or [3]. As expected the regulation (9) fails for the on line case in general since here r Upsilon l r Gamma1 (W (r Gamma 1) and r Upsilon l r (W (r) appear gradients of different functions. If then the quotient of them will be used the outcoming algorithm must not be stable numerically. More ....
M. Mller, "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning," Neural Networks, vol. 6, pp. 525--533, 1993.
....The list given above is not exhaustive in the least. The use of the Metropolis algorithm, simulated annealing, Random Walk, error surfaces, Conjugate Gradients, Scaled Conjugate Gradients [4] more complex Gradient Descent methods [23] Boltzmann Learning [9] the Delta Rule [3] Line Search [9] [21], Linear Quadratic Programming, or even the use of Genetic Algorithms or Reinforcement Learning have not been addressed in detail in this section. E. Choose the stopping criterion The decision when to stop the optimization can be taken apart from the optimization routine itself. In general one ....
Mller M. F., "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning", Neural Networks, Vol. 6, 1993, pp. 525 - 533.
....but neither too fast nor too slow. So, how should the stepsizes be chosen In the existing practice, one resorts to using heuristics justified only by extensive experimentation (see [HKP91, p. 124] THA89] and references therein and in [DeL93] This is quite unsatisfactory since, as is noted in [Ml93], the stepsizes are often crucial for the success of the algorithm. In 2 this paper, we consider two stepsize rules and provide theoretical justification for them. To faciliate the description of these two rules, we make the mild assumption that there exist scalars j f(x 0 1 ) and ae 0 ....
....with j = 1, the generated sequence fx t 1 g converges to a stationary 4 point of f [Luo91] Finally, we note that, in the case where X = n , other gradient schemes such as quasi Newton and conjugate gradient have also been applied with success (see [HKP91, pp. 124 127] KoA89] KDMT91] [Ml93], Wer90] Nonetheless, backpropagation remains the most popular approach to training nonlinear feedforward neural networks. 2 Main Results To establish our main results on the convergence of the algorithm (1.3) 1.4) Propositions 1 and 2) we first need the following three technical lemmas. ....
M. F. Mller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks, 6 (1993), pp. 525-533.
....algorithm combined with the conjugate gradient approach. The problem of the Hessian matrix definition is solved trying to make always positive the quantity in the denominator of (1.2) adding a positive term, which is determined recursively. This algorithm is called Scaled Conjugate Gradient (SCG) [10] and results to be better than CG in terms of convergence properties. With respect to the first problem some methods exist to extract information on the Hessian matrix without calculating or storing it and without making numerical approximations. From (1.2) we note that the algorithm needs to ....
Mller M., "A scaled conjugate gradient algorithm for fast supervised learning", Neural Networks 6(4) , 1993.
....13:37; no v. p.3 4 J. Suykens et al. where G k is the measured monthly gas consumption and the available data are divided into a training set and test set, consisting of n 1 and n 2 data respectively. Many local optimization methods exist in order to solve the nonlinear least squares problem (5) [3, 4, 7, 9, 10, 12, 13], such as backpropagation (e.g. with adaptive learning rate and momentum term) Levenberg Marquardt, quasi Newton or conjugate gradient algorithms. 3. Interpretation of identified neural network models The neural network models (4) identified according to (5) are interpreted then as follows: ....
....has been identified in a linear least squares sense. For the neural network model, several numbers of hidden neurons were tested. Best results were obtained by taking n h = 5, with respect to error on the training and test set. The training has been done using a scaled conjugate gradient algorithm [7], each time for 200 random starting points. The best result is shown on Fig.3 4. The corresponding evolution of the error on the training set and test set during optimization is shown on Fig.5. No overfitting occurs in this case. ebelnnf.tex; 23 02 1998; 13:37; no v. p.5 6 J. Suykens et al. For ....
Mller M.F., "A scaled conjugate gradient algorithm for fast supervised learning, " Neural Networks, Vol.6, pp.525-533, 1993.
....experiments using different architectures and train test paradigms were carried out. We used as training methods some variants of the Conjugate Gradient Descent optimization algorithm: the original Powell s algorithm [20] Fletcher Reeves Polack Ribiere [19] and the more recent algorithm by Mller [17]. Common features of these algorithms are the high learning speed, good performance for high grade polynomial cost functions, and reasonable storage costs, which increase linearly with the number of variables the cost function depends on: this is a relevant feature for a connectionist training ....
M. Mller. A scaled conjugate gradient algorithm for fast supervised learning. Neural Computation, 6(4):525--533, June 1983.
....that direction. Even if we knew the optimal size of a step to take in that direction (which we do not) we usually would not be at the local minimum. Instead, we would only be in a position to take a new step in a somewhat orthogonal direction, as is done in the conjugate gradient training method (Mller, 1993). Continuous training, on the other hand, takes many small steps in the average direction of the gradient. After a few training patterns have been presented to the network, the weights are in a different position in the weight space, and the gradient will likely be slightly different there. As ....
Mller, Martin F., (1993). "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning," Neural Networks, vol. 6, pp. 525-533.
No context found.
M. Mller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6, 525--533, (1993).
No context found.
M. Mller, A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6 (1993), 525--533.
No context found.
Mller, M. [1993] "A scaled conjugate gradient algorithm for fast supervised learning ", Neural Networks, 6, 525--533.
No context found.
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks, vol. 6, pp. 525--533, 1993.
No context found.
M. F. Mller, "A scaled conjugate gradient algorithm for fast supervised learning," Neural Networks 6, pp. 525-- 533, 1993.
No context found.
M. F. Mller. "A scaled conjugate gradient algorithm for fast supervised learning". Neural Networks, 6:pp.525-- 533, 1993.
No context found.
Mller, M. (1993a). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks. In press.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC