19 citations found. Retrieving documents...
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent connectionist networks. In Y. Chauvin and D. E. Rumelhart, editors, Backpropagation: Theory, Architectures, and Applications. Erlbaum, Hillsdale, NJ, 1990.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Simple Learning Algorithm for Recurrent Networks to Realize .. - SHIBATA, OKABE, ITO   (Correct)

....network have to be stored. If the propagation is truncatedat T time step, that is called truncated BPTT(T) the neural network cannot memorize the signals before T time steps. In BPTT(T) O(n2T) order of calculation time and O(nT) order of memory is required, where n represents number of neurons[5]. On the other hand, in RTRL, the partial differential value of the output of each neuron with respect to each weight, is calculated using simultaneous differential equations from the values at the previous time. Each weight is modified by putting the partial differential value into the steepest ....

....differential equations from the values at the previous time. Each weight is modified by putting the partial differential value into the steepest descent equation. Therefore, it is not necessary to trace back against time, but O(n 4) order of calculation time and O(n 3) order of memory are required[5]. O(n 3) is more than the order of the number of weights O(n2) which means that even if the memory is assigned at each weight, the memory size is varied according to the size of the (operaa d 1 ) irput 4 (operaa d 2 ) irput 2 nigger) ideal ou out A time lag T To ou outar operation ....

Williams, R. J. and Zipser, D. , "Gradient Based Learning Algorithms for Recurrent Connectionist Networks", Northeastem University, College of Computer Science Technical Report, NU-CCS-90-9 (1990)


Modeling Dynamical Systems with Recurrent Neural Networks - Tsung (1994)   (1 citation)  (Correct)

....error) the error is calculated and propagated all the way back to the first layer (initial time t0) of the unrolled (feedforward) network, by regular backpropagation. The error gradient for a weight is just the sum of all the gradients for that weight at every layer. Williams and Zipser [WZ90] discussed generalizations and variations of the original description of BPTT. If BPTT is performed all the way back to t0 every time errors are injected, it is called real time BPTT, or BPTT(1; 1) The 1 in the notation denotes the fact that, in the worse case, error is injected at every step ....

....points, oscillate, or become chaotic. This is the continuous form of the real time recurrent learning (RTRL) algorithm; a slight variation appears in [Ghe89] Historically, various discrete forms of this algorithm preceded this continuous version [RF88, Bac88] other variants are referenced in [WZ90] p.17. Having seen the continuous version, however, it is easy to understand the discrete version. We take the derivative on both sides of the discrete equation II.1, with respect to w ij and get: p k ij (t 1) f 0 (x k ) ffi ik y j X l w kl p k ij (t) II:13) This is ....

[Article contains additional citation context not shown here]

R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent connectionist networks. Technical Report NU-CCS-90-9, College of Computer Science, Northeastern University, Boston, MA, 1990.


Bifurcations of Recurrent Neural Networks in Gradient Descent.. - Kenji Doya (1993)   (7 citations)  (Correct)

....of problems. In contrast, since gradient descent learning algorithms for recurrent networks became popular several years ago [19, 5, 18, 25] not many cases have been reported about their successful application to large scale problems. One reason for this is the large cost for gradient computation [26]. However, another critical issue in training recurrent networks is bifurcation of the network dynamics. In general, asymptotic behavior of a nonlinear dynamical system changes qualitatively at certain points in its parameter space [10, 24] For example, a stable fixed point can change into an ....

....learning equations. Actually, learning rules in which recurrent connections are neglected in error gradient calculation have been successfully applied to sequence generation [5, 14] and sequence prediction tasks [2, 7] In the algorithm called truncated back propagation through time [26], the adjoint equation (7) is calculated only for some limited steps backward in time. Although this algorithm was developed mainly to reduce the amount of computation, it also avoids the problem of instability of the learning equations. 4 Conclusions Gradient descent learning in recurrent ....

R. J. Williams and D. Zipser. Gradient based learning algorithms for recurrent connectionist networks. College of Computer Science Technical Report NU-CCS-90-9,,


Supervised Learning in Recurrent Networks - Doya (1995)   (Correct)

....of units. However, for a small sized network or a network with only local connections, on line weight update can be an advantage. In order to allow on line weight update with the efficiency of the backward algorithm, a truncated version of back propagation through time algorithm has been proposed (Williams and Zipser, 1990). 4.2 Teacher Forcing So called teacher forcing technique has been shown to be helpful, especially in training a network into an autonomous dynamical system (Pineda, 1988; Doya and Yoshizawa, 1989, Williams and Zipser, 1989) In this scheme, the desired output d i (t) is used to drive the ....

....the network is run autonomously after learning. Several heuristics have been proposed for enhancing the stability of the non forced trajectory. Noisy forcing: add some noise to the forcing input. Partial forcing: use a mixed input z i (t) y i (t) ff(d i (t) Gamma y i (t) with 0 ff 1 (Williams and Zipser, 1990) and decrease the forcing rate ff with the progress of learning. Part time forcing: turn on forcing to synchronize the network to the teacher and then turn off forcing to train the autonomous trajectory. 4.3 Bifurcation Boundaries In many learning tasks, the goal is not only to replicate ....

Williams, R. J. and Zipser, D. 1990. Gradient based learning algorithms for recurrent connectionist networks. Technical Report NU-CCS-90-9, College of Computer Science, Northeastern University.


Synaptic Noise in Dynamically-driven Recurrent Neural Networks.. - Kam Jim (1994)   (4 citations)  (Correct)

....function is computed from E p = 1 2 ffl 2 p = 1 2 (S T O Gamma d p ) 2 (4) where d p is the target output value for pattern p (either 1 or 0 depending on the DFA response of the input string) and ffl p is the raw error. We used the Back Propagation Through Time (BPTT) training algorithm [20]. Incremental learning, as described in [6] is used to descend the error function. Basically, the network is trained on a subset of the training set, called the working set, which gradually increases in size until the network is able to correctly classify the entire training set. Strings from the ....

Ronald J. Williams and David Zipser. Gradient--based learning algorithms for recurrent connectionist networks. Technical Report NU--CCS--90--9, College of Computer Science, Northeastern University, 1990.


Phase-Space Learning for Recurrent Networks - Tsung, Cottrell (1993)   (3 citations)  (Correct)

....a point, the hidden unit activations will differ, Thus the mapping changes during learning. If one uses a smaller time step and teacher force frequently in the hope of maintaining the hidden units in nearby regions, then the gradients may become too small for the system to learn [Tsu91] See [Pin88, WZ89, WZ90] for other discussions of teacher forcing. With BPTT, the network is unrolled in time (Figure 2) This unrolling reveals another problem: Suppose in the teaching signal, the visible units next state is a non linearly separable function of their current state. Then hidden units are needed between ....

R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent connectionist networks. Technical Report NU-CCS-90-9, College of Computer Science, Northeastern University, Boston, MA, 1990.


On the Applicability of Neural Network and Machine Learning.. - Lawrence, al. (1996)   (6 citations)  (Correct)

....on the test data. 8 Gradient Descent We have used backpropagation through time to train the networks 8 . Backpropagation through time extends backpropagation to include temporal aspects by considering an equivalent feedforward network created by unfolding the recurrent network in time (Williams Zipser 1990). We could not train a recurrent network with a small temporal input window until we implemented various techniques aimed at improving the convergence of the algorithm. Before listing the techniques we used we would like to point out that we have used stochastic update (weights updated after each ....

Williams, R. & Zipser, D. (1990), Gradient-based learning algorithms for recurrent connectionist networks, in Y. Chauvin & D. Rumelhart, eds, `Backpropagation: Theory, Architectures, and Applications ', Erlbaum, Hillsdale, NJ.


Learning a Class of Large Finite State Machines with a.. - Giles, Horne, Lin (1995)   (5 citations)  (Correct)

....zero. The networks had 49 and 181 adjustable weights respectively with the initial values of the weights randomly chosen from a uniform distribution in the range [ Gamma0:1; 0:1] 4. 4 Training Algorithm The network was trained with Backpropagation Through Time Algorithm (Williams and Peng, 1990; Williams and Zipser, 1990), augmented with a number of heuristics found useful for grammatical inference problems. No batching was done on the training set, i.e. the weights were updated after processing each string (although see comment below on selective updating) Weight decay (Krogh and Hertz, 1992) was used with a ....

Williams, R. and Zipser, D. (1990). Gradient--based learning algorithms for recurrent connectionist networks. Technical Report NU--CCS--90--9, College of Computer Science, Northeastern University.


The Emergence of Grammaticality in Connectionist Networks - Allen, al. (1999)   (2 citations)  (Correct)

....This approach significantly improves the ability of networks to reach back in time, that is, to develop sensitivity to longer sequences than is possible in standard discrete time nets. This continuous approach is approximated by dividing the normal time steps of discrete back prop through time (Williams Zipser, 1990) into ticks of some shorter duration. An infinite number of such ticks would represent truly continuous activation. The number of time steps per tick (called the integration constant) changes the grain at which activation is propagated and error injected into the network. Details of the ....

Williams, R., & Zipser, D. (1990). Gradient based learning algorithms for recurrent connectionist networks (Technical Report NU-CCS-90-9). College of Computer Science, Northeastern University.


An Application of Recurrent Nets to Phone Probability Estimation - Robinson (1994)   (78 citations)  (Correct)

....Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation. 1 Introduction The aim of this paper is to describe the application of a recurrent net to phone recognition. There are several forms of recurrent net (e.g. [1, 2, 3]) however this paper is interested in the kind that map one sequence on to another. This form of recurrent net is potentially very powerful as it is capable of emulating any finite state machine [4] Specifically, the aim of the network is to perform the mapping from a sequence of frames of ....

R. J. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent connectionist networks," Tech. Rep. NU-CCS-90-9, Northeastern University, Apr. 1990.


A Hodgkin-Huxley Type Neuron Model That Learns Slow.. - Doya, Selverston, Rowat   (Correct)

....potential trajectory. We first derive the gradient of E with respect to the model parameters ( i ; g j ; v a j ; s a j ; t a j ; In studies of recurrent neural networks, it has been shown that teacher forcing is very important in training autonomous oscillation patterns [4, 6, 12, 13]. In H H type models, teacher forcing drives the activation and inactivation variables by the target membrane potential v (t) instead of v(t) as follows. x = k x (v (t) Delta ( Gammax x1 (v (t) x = a j ; b j ) 6) We use (6) in place of (2) during training. The effect of a small ....

R. J. Williams and D. Zipser. Gradient based learning algorithms for recurrent connectionist networks. Technical Report NU-CCS-90-9, College of Computer Science, Northeastern University, 1990.


Natural Language Grammatical Inference: A Comparison of.. - Lawrence, Fong, Giles (1996)   (4 citations)  (Correct)

....100 100 W Z 100 92 Table 4. Percentage correct classification for the test data. TEST large small window window Edit distance 55 N A Euclidean 65 55 Decision trees 60 N A MLP 63 54 FGS 65 59 BT FIR 64 54 Elman 65 74 W Z 59 71 8 Gradient Descent We have used backpropagation through time 8 (Williams Zipser 1990) to train the globally recurrent networks 9 and the gradient descent algorithm described by the authors for the FGS and Back Tsoi FIR networks. The error surface of a multilayer network is generally non convex, non quadratic, and often has large dimensionality. We found the standard gradient ....

Williams, R. & Zipser, D. (1990), Gradient-based learning algorithms for recurrent connectionist networks, in Y. Chauvin & D. Rumelhart, eds, `Backpropagation: Theory, Architectures, and Applications', Erlbaum, Hillsdale, NJ.


On the Applicability of Neural Network and Machine.. - Lawrence, Giles, Fong (1995)   (6 citations)  (Correct)

....non input nodes are connected to all other nodes as described in [55] We expect the feedforward and locally recurrent architectures to encounter difficulty performing the task and include them primarily as control cases. 7 Gradient Descent Learning We have used backpropagation through time 9 [56] to train the globally recurrent networks 10 , standard backpropagation for the multi layer perceptron, and the gradient descent algorithms described by the authors for the locally recurrent networks. The error surface of a multilayer network is non convex, non quadratic, and often has large ....

R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent connectionist networks. In Y. Chauvin and D.E. Rumelhart, editors, Backpropagation: Theory, Architectures, and Applications. Erlbaum, Hillsdale, NJ, 1990.


Can Recurrent Neural Networks Learn Natural Language Grammars? - Lawrence, Giles, Fong (1996)   (2 citations)  (Correct)

....) 4 . The data was input to the neural networks with a window which is passed over the sentence in temporal order from the start to the end. The size of the window was variable from one word to the length of the longest sentence. 4. Gradient Descent We have used backpropagation through time [30] to train the globally recurrent networks and the gradient descent algorithm described by the authors for the FGS network. The error surface of a multi layer network is generally non convex, non quadratic, and often has large dimensionality. We found the standard gradient descent algorithms to be ....

R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent connectionist networks. In Y. Chauvin and D.E. Rumelhart, editors, Backpropagation: Theory, Architectures, and Applications. Erlbaum, Hillsdale, NJ, 1990.


A Connectionist Model for Bootstrap Learning of Syllabic.. - Jean Vroomen   (Correct)

....in a feedforward back propagation network, trained on recognising and labelling phonemes when presented with a sequence of speech signal samples. A type of topology in which copies of the complete network rather than delayed input time slices are memorised, is back propagation through time (bptt) (Williams Zipser, 1990). Doutriaux and Zipser (1990) demonstrate its use in a series of simulations in which they train a bptt network on predicting speech spectogram time slices on the basis of previous speech spectogram time slices. They demonstrate that sudden changes in hidden layer activity over time correlate with ....

Williams, R. J., & Zipser, D. (1990). Gradient-based learning algorithms for recurrent connectionist networks. Technical Report NU-CCS-90-9, Northeastern University, 1990.


Simple Learning Algorithm for Recurrent Networks to Realize.. - Shibata   (Correct)

....have to be stored. If the propagation is truncated at T time step, that is called truncated BPTT(T) the neural network cannot memorize the signals before T time steps. In BPTT(T) O(n 2 T) order of calculation time and O(nT) order of memory is required, where n represents number of neurons[5]. On the other hand, in RTRL, the partial differential value of the output of each neuron with respect to each weight, is calculated using simultaneous differential equations from the values at the previous time. Each weight is modified by putting the partial differential value into the steepest ....

....equations from the values at the previous time. Each weight is modified by putting the partial differential value into the steepest descent equation. Therefore, it is not necessary to trace back against time, but O(n 4 ) order of calculation time and O(n 3 ) order of memory are required[5]. O(n 3 ) is more than the order of the number of weights O(n 2 ) which means that even if the memory is assigned at each weight, the memory size is varied according to the size of the This research was supported by The Japan Society for the Promotion of Science as Biologically Inspired ....

Williams, R. J. and Zipser, D. , "Gradient Based Learning Algorithms for Recurrent Connectionist Networks", Northeastern University, College of Computer Science Technical Report, NU-CCS-90-9 (1990)


Some Observations on the Use of the Extended Kalman Filter as a.. - Williams (1992)   (2 citations)  Self-citation (Williams)   (Correct)

....streams of input output data. Perhaps the most widely used are real time recurrent learning (RTRL) and backpropagation through time (BPTT) These and several variants, all based on computation of the gradient of an output error measure with respect to network weights, are discussed at length in Williams and Zipser (1990). Another approach, based on computation of the gradient of output error with respect to network activity, is Rohwer s (1990) moving targets method. Recently, several authors have noted that the extended Kalman filter (EKF) well known in engineering circles, can also be used for the purpose of ....

....its computational requirements in large networks. 4 Application of the EKF to Recurrent Networks 4. 1 The Underlying Approach We consider here a formulation more or less consistent with the usual approach to recurrent network sequential supervised learning problems, as described, for example, in Williams and Zipser (1990). In particular, we consider an arbitrary recurrent network given target values for specified units at specified times, and the objective is to find weights that allow optimal matching between target and actual values in a least mean square error sense. For simplicity, we assume that all ....

[Article contains additional citation context not shown here]

Williams, R. J. & Zipser, D. (1990). Gradient-based learning algorithms for recurrent connectionist networks, (Technical Report NU-CCS-90-9). Boston: Northeastern University, College of Computer Science.


An Efficient Gradient-Based Algorithm for On-Line Training of .. - Williams, Peng (1990)   (59 citations)  Self-citation (Williams)   (Correct)

....case of BPTT appropriate for situations when both the actual and desired trajectories consist of settling to a constant state. An extensive discussion of a number of gradient based learning algorithms, including BPTT, RTRL, recurrent backpropagation, and other related algorithms can be found in (Williams and Zipser, 1990). Among the algorithms described in detail there is the particular one to be highlighted here. The algorithm to be described here has five important properties. First, it is an on line algorithm, designed to be used to train a network while it runs; no manual state resets or segmentation of the ....

....at time step t h 0 . The more familiar BPTT(h) algorithm is just the special case when h 0 = 1. Zipser, 1989a) and elaborated upon in (Williams Zipser, 1989b) One noteworthy example is the Turing machine task, involving networks having 12 15 units. In earlier experiments, reported in (Williams Zipser, 1990), it had been found that BPTT(9) gave a factor of 28 speedup in running time over RTRL on this same task, with success rate at least as high. Using BPTT(16;8) gave an additional factor of 2 speedup, 3 making BPTT(16;8) well over 50 times faster than RTRL on this task. 6 Discussion Note that as ....

[Article contains additional citation context not shown here]

Williams, R. J., & Zipser, D. (1990). Gradient-based learning algorithms for recurrent connectionist networks, (Technical Report NU-CCS-90-9). Boston: Northeastern University, College of Computer Science.


A Comparison between Spiking and Differentiable.. - Graves, Beringer.. (2004)   (Correct)

No context found.

R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent connectionist networks. In Y. Chauvin and D. E. Rumelhart, editors, Backpropagation: Theory, Architectures, and Applications. Erlbaum, Hillsdale, NJ, 1990.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC