| C. Darken and J. Moody, "Towards faster stochastic gradient search," Neural Information Processing Systems, vol. 4, pp. 1009--1016, 1992. |
....(2) a controlling factor is put before the gain ,j=1, M. This makes the adaptation rule slightly different from the original Kalman filter approach proposed by Bannour [1] The step size controlling factor is incorporated so that the search then convergence idea is integrated into the learning [2]; namely, where and are constants. Due to the fact that the eigenvectors of the PCA converge sequentially from the W i n ( D 2e i n ( V i n ( 1) W 2 W N W 1 yn( dn( en( Tapped Delay Line N by N Orthogonal Transform input x(n) Y N n ( Y 1 n ( Power Normalization V N n ( U N n ....
Darken, C., and Moody, J., "towards faster stochastic gradient search," NIPs 4, pp. 1009-1016, San Mateo, CA: Morgan Kaufmann, 1992.
....j : Similarly, lim i 1 x 2i 1 = 1 2 Gamma j : The two accumulation points are distinct (and different from x ) for any fixed j. Note however, that they both tend to x in the limit as j 0. Remark 1.3.2 We note that the learning rates rules that satisfy (1.3. 4) were used in practice [10], and are known as search then converge strategy. In the first phase of learning, called the search phase , the learning rate is almost constant, or it decreases slowly. In the second phase of learning, called the converge phase , it decreases to zero. In particular, two possible rules for the ....
....strategy. In the first phase of learning, called the search phase , the learning rate is almost constant, or it decreases slowly. In the second phase of learning, called the converge phase , it decreases to zero. In particular, two possible rules for the learning rate have been suggested [10], 9] j i = j 0 1 1 i i 0 28 and j i = j 0 1 ci j 0 i 0 1 ci j 0 i 0 i 0 ( i i 0 ) 2 where j 0 0, c 0, i 0 1 are appropriately chosen parameters. Note that for i i 0 , the learning rate j i = j 0 , and for i i 0 the learning rate decreases proportional to 1=i. ....
C. Darken and J.Moody. Towards faster stochastic gradient search. In G. Tesauro J.D. Cowan and J. Alspector, editors, Advances in Neural Information Processing Systems 4, pages 1009--1016, San Francisco, CA, 1991. Morgan Kaufmann Publishers.
....extending our earlier work [7] to fully non linear problems. We demonstrate the performance on several large back prop networks trained with large datasets. Implementations of stochastic learning typically use a constant learning rate during the early part of training (what Darken and Moody [4] call the search phase) to obtain exponential convergence towards a local optimum, and then switch to annealed learning (called the converge phase) We use Darken and Moody s adaptive search then converge (ASTC) algorithm to determine the point at which to switch to 1=t annealing. ASTC was ....
....adaptive momentum was significantly better than ASTC. In the next section, we examine this problem in more detail. 3.3 Switching on Annealing A complete algorithm must choose an appropriate point to change from constant search to annealed learning. We use Moody and Darken s ASTC algorithm [4, 14] to accomplish this. ASTC measures the roughness of trajectories, switching to 1=t annealing when the trajectories become very rough an indication that the noise in the updates is dominating the algorithm s behavior. In an attempt to satisfy 5 Prediction of a 4 Theta 4 block of image pixels ....
Christian Darken and John Moody. Towards faster stochastic gradient search. In J.E. Moody, S.J. Hanson, and R.P. Lipmann, editors, Advances in Neural Information Processing Systems 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992.
....used for neural network training. Hanson added random noise to the weight adaptation and experimentally showed the convergence to a global minimum with constant [Hanson, 1990] Using exactly the idea given by Eq. 2) but with a damped Darken and Moody proposed the search then convergence scheme [Darken and Moody, 1991, 1992] As far as the slow search convergence is concerned, it is well known that the convergence speed can be improved greatly if higher order curvature information is employed in the adaptation [Beck and Le Cun, 1988; Baritompa, 1994; Le Cun et al. 1991, 1993] Theoretically, in a neural ....
Darken, C., and Moody, J., "Towards Faster Stochastic Gradient Search," NIPs 4, 1991.
....of the system. These techniques include asymptotically optimal methods as derived via the theory of stochastic approximation [22] methods based on a statistical analysis of the particular system [23, 24] and heuristic approximations to these methods, commonly known as search and converge [25] or gearshifting. By contrast, adaptive methods are based on on line measurements of the state of the adaptive system, usually as characterized by the outputs or by the parameter updates of the system. Non adaptive step size methods usually require more information about the adaptive system and ....
C. Darken and J. Moody, "Towards faster stochastic gradient search," Advances in Neural Information Processing Systems, vol. 4, (San Mateo, CA: Morgan Kaufman, 1991), pp. 10091016.
....density function (e.g. BIRCH, DBSCAN [SEKX98] CLARANS, CURE [GRS98] etc. Three techniques estimating mixture model parameters over large databases are evaluated: scalable EM (SEM) standard or vanilla EM run over random samples of the database (VEM) and an online EM implementation (OEM) [NH99, DM92]. The online EM algorithm is a stochastic gradient descent approach that operates by updating the initial mixture model one record at a time [DM92] A single record is read and its membership probabilities in each of the k clusters is computed. The cluster parameters are then updated and the ....
.... are evaluated: scalable EM (SEM) standard or vanilla EM run over random samples of the database (VEM) and an online EM implementation (OEM) NH99, DM92] The online EM algorithm is a stochastic gradient descent approach that operates by updating the initial mixture model one record at a time [DM92]. A single record is read and its membership probabilities in each of the k clusters is computed. The cluster parameters are then updated and the record is purged from memory. SEM has three major parameters: primary compression factor p (Section 3.3.1) standard tolerance b (Section 3.3.2) and ....
C. Darken and J. Moody. "Towards Faster Stochastic Gradient Search". In Advances in Neural Information Processing Systems 4, Moody, Hanson, and Lippmann, (Eds.), Morgan Kaufmann, Palo Alto, 1992.
....All nodes included a bias input which was part of the optimization process. Weights were initialized as shown in Haykin [4] Target outputs were 0.8 and 0.8 using the tanh output activation function and we used the quadratic cost function. A search then converge learning rate schedule was used [1] with an initial learning rate was 0.5. 2.5 Non stationarity The approach we use to handle non stationarity is to build models based on short time periods only. There is a noise vs. non stationarity tradeoff as the size of the training set is varied. If the training set is too small, noise makes ....
C. Darken and J.E. Moody. Towards faster stochastic gradient search. In Neural Information Processing Systems
....typically used in order to avoid slow convergence and local minima. However, a constant learning rate results in significant parameter and performance fluctuation during the entire training cycle. Moody and Darkin have proposed search then converge learning rate schedules (Darken Moody 1991, Darken Moody 1992): j(t) j 0 1 t (5) We have found that the learning rate during the final epoch still results in considerable parameter fluctuation 10 and hence we have added an additional term to further reduce the learning rate over the final epochs. 3. Alphabetical ordering of the training data. ....
Darken, C. & Moody, J. (1992), Towards faster stochastic gradient search, in `Neural Information Processing Systems 4', Morgan Kaufmann, pp. 1009--1016.
....squashing for both computational layers) improve the percentage success, but there is a clear trend toward worsened performance for the more sophisticated algorithms. Simple on line BP (without momentum) performs best; this may be due to the method s stochastic features, as discussed in [3]. Trapping in local minima can also be observed for continuous function learning problems. McInerney et al. [4] have discovered (by exhaustive search of the error weight surface) local minima in a (1 2 1) network (with a linear output node) learning the sine function. This problem was also ....
C Darken and J M Moody, "Towards faster stochastic gradient search", in: Advances in Neural Information Systems 4, Morgan Kaufmann, San Mateo, CA, 1009-1016 (1991).
....instantaneous cost is the true cost E(w) The latter drives the corresponding deterministic, or batch mode, gradient descent algorithm. The learning rate may be independent of time, or it may follow a specified time dependence, or may change in time in response to the progress of the learning [9, 10, 11]. Constant learning rates are commonly employed during the initial phases (and sometimes through all phases) of stochastic learning algorithms. Constant allows the system to converge on local optima at rates comparable to the equivalent batch algorithm (e.g. exponential convergence in quadratic ....
....(4.23) relates the minimal required 0 to the unknown cost function curvature, one is not guaranteed to achieve the optimal convergence rate for an arbitrarily chosen 0 . Since the actual convergence rate given in (4. 25) can be much slower than optimal, this situation has led many researchers [9, 10, 4, 12, 11] to devise algorithms that attempt to adaptively set 0 . Alternative Annealing Schedules The situation is rather different for p 1. Referring to (4.22) at late times the right hand side is dominated by the terms in 1=s p , and the 1=s and O(1=s p ) 3=2 terms can be neglected. Then ....
Christian Darken and John Moody. Towards faster stochastic gradient search. In J.E. Moody, S.J. Hanson, and R.P. Lipmann, editors, Advances in Neural Information Processing Systems 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992.
.... generalizes conventional topological feature maps which suggest the same learning rate for all clusters [30] The question if there exists any faster learning rate schedule c= N P fl T flff p (N) fl ) with c 1 than the schedule proposed by (47) is still open although numerical simulations [12, 13] suggest that it is advisable to choose c 1 to speed up the convergence. Online optimization of the codebook size K relies on a heuristics for cluster merging and cluster creation. We have explored the following heuristics for cluster creation: A data point x i initializes a new reference ....
C. Darken and J. Moody. Towards faster stochastic gradient search. In Neural Information Processing Systems 4, San Mateo, California, 1992. Morgan Kaufmann.
....9 Stochastic update does not generally tolerate as high a learning rate as batch update due to the stochastic nature of the updates. alter significantly from the beginning to the end of the final epoch. Moody and Darken have proposed search then converge learning rate schedules of the form [10, 11]: j(t) j 0 1 t (1) where j(t) is the learning rate at time t, j 0 is the initial learning rate, and is a constant. We have found that the learning rate during the final epoch still results in considerable parameter fluctuation, 10 and hence we have added an additional term to ....
C. Darken and J.E. Moody. Towards faster stochastic gradient search. In Neural Information Processing Systems 4, pages 1009--1016. Morgan Kaufmann, 1992.
....results in significant parameter and performance fluctuation during the entire training cycle such that the performance of the network can alter significantly from the beginning to the end of the final epoch. Moody and Darkin have proposed search then converge learning rate schedules of the form [10, 11]: j(t) j 0 1 t (6) where j(t) is the learning rate at time t, j 0 is the initial learning rate, and is a constant. We have found that the learning rate during the final epoch still results in considerable parameter fluctuation 15 and hence we have added an additional term to further ....
C. Darken and J.E. Moody. Towards faster stochastic gradient search. In Neural Information Processing Systems 4, pages 1009--1016. Morgan Kaufmann, 1992.
....by (24) and (25) and i (t) are two learning rate schedules defined by i (t) j i 1 c i j i t 1 c i j i t t 2 ; i = 0; 1: 59) Here t is the iteration index. The learning rate function i (t) is a special form of the following search then converge schedules proposed in [7]: t) j 1 c j t 1 c j t t 2 : 60) t is a search phase and t is a converge phase . The learning rate functions i (t) do not have the search phase but they start learning with a weaker converge phase when j i are small. When t is large, each learning rate ....
C. Darken and J. Moody. Towards faster stochastic gradient search. In Advances in Neural Information Processing Systems, 4, eds. Moody, Hanson, and Lippmann, Morgan Kaufmann, San Mateo, pages 1009--1016, 1992.
....in significant parameter and performance fluctuation during the entire training cycle such that the performance of the network can alter significantly from the beginning to the end of the final epoch. Moody and Darkin have proposed search then converge learning rate schedules of the form [9] [10]: j(t) j 0 1 t (4) where j(t) is the learning rate at time t, j 0 is the initial learning rate, and is a constant. We have found that the learning rate during the final epoch still results in considerable parameter fluctuation 15 and hence we have added an additional term to further ....
Christian Darken and John Moody. Towards faster stochastic gradient search. In Neural Information Processing Systems 4, pages 1009--1016. Morgan Kaufmann, 1992.
....when the training set is very large. In other applications of optimization, one sometimes has access only to a noisy estimate of the gradient, making the use of deterministic gradient impossible. Some proposals of stochastic step size adaptation procedures have appeared in the literature [4][5], 7] However, none of them seems to be simple and general enough for widespread use. In this paper we propose a new, simple step size adaptation technique for stochastic gradient optimization, similar in spirit to the deterministic adaptive step sizes technique of [10] see also [11] This ....
Christian Darken and John Moody. Towards faster stochastic gradient search. In Moody, Hanson, and Lippmann, editors, Advances in Neural Information Processing Systems 4, Palo Alto, 1992. Morgan Kaufmann.
....LMS, on line backpropagation, and adaptive k means clustering as special cases. The standard choices of the learning rate j (both adaptive and fixed functions of time) often perform quite poorly. In contrast, our recently proposed class of search then converge (STC) learning rate schedules (Darken and Moody, 1990b, 1991) display the theoretically optimal asymptotic convergence rate and a superior ability to escape from poor local minima However, the user is responsible for setting a key parameter. We propose here a new methodology for creating the first automatically adapting learning rates that achieve the ....
.... j(t) c=t (the usual choice in the stochastic approximation literature for the last forty years beginning with seminal work Robbins and Monro (1951) typically results in slow convergence to bad solutions (high lying local minima) for small c, and parameter blow up for small t if c is too large (Darken and Moody, 1990b, 1991). The available adaptive schedules (i.e. j s depending on the time and on previous exemplars) have problems as well. A schedule developed by Urasiev is proven to converge in principle, but in practice it may converge slowly if at all (see fig. 2) The delta bar delta learning rule, which was ....
[Article contains additional citation context not shown here]
C. Darken and J. Moody. (1991) Towards faster stochastic gradient search. Advances in Neural Information Processing Systems 4, Morgan Kauffman, San Mateo, California. 1009-1016.
No context found.
C. Darken and J. Moody, "Towards faster stochastic gradient search," Neural Information Processing Systems, vol. 4, pp. 1009--1016, 1992.
No context found.
Christian Darken and John Moody. Towards faster stochastic gradient search. In John Moody, Hanson, and Lippmann, editors, Advances in Neural Information Processing Systems, 4, pages -- , San Mateo, 1992. Morgan Kaufmann. To appear.
No context found.
C. Darken and J. Moody. "Towards Faster Stochastic Gradient Search". In Advances in Neural Information Processing Systems 4, Moody, Hanson, and Lippmann, (Eds.), Morgan Kaufmann, Palo Alto, 1992.
No context found.
Darken, C. and Moody, J., \Towards faster stochastic gradient search", Advances in Neural Information Processing Systems 4, Moody, Hanson and Lippmann, Eds., Morgan Kaufman, San Mateo, 1992.
No context found.
Christian Darken and John Moody. (1992) Towards Faster Stochastic Gradient Search. In J.E. Moody, S.J. Hanson, and R.P. Lipmann (eds.) Advances in Neural Information Processing Systems, vol. 4. Morgan Kaufmann Publishers, San Mateo, CA, 1009-1016.
No context found.
Christian Darken and John Moody. Towards faster stochastic gradient search. In J.E. Moody, S.J. Hanson, and R.P. Lipmann, editors, Advances in Neural Information Processing Systems 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992.
No context found.
C. Darken and J. Moody. Towards faster stochastic gradient search. In Advances in Neural Information Processing Systems, 4, eds. Moody, Hanson, and Lippmann, Morgan Kaufmann, San Mateo, pages 1009--1016, 1992.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC