| R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications. Hillsdale, 1992. |
....forward through the entire network and recursive steps and afterwards propagating the error signals backwards through the network and all recursive steps. However, the possibility of online adaptation while a sequence is still processed is lost in this so called backpropagation through time [28,44]. There exist combinations of both methods and variations for training continuous systems [33] The true gradient is sometimes substituted by a truncated gradient in earlier approaches [6] Since theoretical investigation suggests that pure gradient descent techniques will likely suffer from ....
R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications. Erlbaum, 1992.
....by F. Pineda and L. B. Almeida, e.g. 26] only in the case when the recurrent network behaviour relaxes to a fixed point. However, if a general temporal processing is needed, two main gradient based learning approaches exist for recurrent networks [15,16,20] Back Propagation Through Time (BPTT) [6,1,27,15] and Real Time Recurrent Learning (RTRL) 2,18,22,27,15] BPTT is a family of algorithms which extends the BP paradigm to dynamic networks. There are two major points of view to understand what BPTT is. The first is an intuitive one: time unfolding of the recurrent network, i.e. for single layer ....
....the case when the recurrent network behaviour relaxes to a fixed point. However, if a general temporal processing is needed, two main gradient based learning approaches exist for recurrent networks [15,16,20] Back Propagation Through Time (BPTT) 6,1,27,15] and Real Time Recurrent Learning (RTRL) [2,18,22,27,15]. BPTT is a family of algorithms which extends the BP paradigm to dynamic networks. There are two major points of view to understand what BPTT is. The first is an intuitive one: time unfolding of the recurrent network, i.e. for single layer single feedback delay fully recurrent networks one can ....
[Article contains additional citation context not shown here]
R.J. Williams, D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications Y. Chauvin and D.E. Rumelhart, Eds. Hillsdale, NJ: Lawrence Erlbaum Associates, 1994.
....i.e. the data cases determine the sucient subnetwork of DG(B) to calculate the likelihood. On the one hand, this may a ect the generalization of the learned program, but on the other hand, this is a similar situation as for unrolling dynamic Bayesian networks [5] or recurrent neural networks [24]. Finally, our learning setting can be used for MLE of the parameters of intensional rules only. Assume that if we observe h(john) then h(john) 172:06 holds. In this case, it is problematic to estimate the ML parameters of h(john) But, we can still estimate the ML parameters of h(X) g(X) ....
R. J. Williams and D. Zipser. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computatinal Complexity. In Back-propagation:Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1995.
....we will show how to adapt gradient based approaches and EM in the context of Bayesian logic programs. The third point may a ect the generalization performance of the learned program, but it is a similar situation as for unrolling dynamic Bayesian networks [DK88] or recurrent neural networks [WZ95] 5.2. Gradient based approach Gradient ascent, also known as hill climbing, is a classical method for nding a maximum of an evaluation function. Here, one computes the gradient vector r of partial derivatives with respect to the parameters of the conditional probability distributions at a ....
R. J. Williams and D. Zipser. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computatinal Complexity. In Back-propagation:Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1995.
....to classification of the structures, denoted as classification subnetwork. The learning task of structured information is to classify the structure. It can be made by using Backpropagation Through Structure (BPTS) algorithm. The BPTT is an extension of Backpropagation Through Time (BPTT) [54] described in Section 2.4.4. Analogous to backpropagation neural network learning [53] supervised learning from structures is accomplished in two steps: forward and backward step. The forward step takes the topology of respective structure description, starts from leaf nodes, calculates each ....
.... (or maximum likelihood from the statistical viewpoint) Searching the optimal parameters and # can be accomplished by gradient descent techniques [53] In this case, gradient can be computed using the back propagation through structure algorithm, an extension of back propagation through time [54] that unrolls the recursive network in a larger feed forward network, following the topology of the input graph [17] An example of unrolled recursive network obtained from processing the IEEE logo tree is shown in Figure 3.2. See a detailed description in Chapter 2. r i e i e 2 1 2 u ....
R. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity," Backpropagation: Theory, Architectures, and Applications, In Y. Chauvin and D.E. Rumelhart (Eds.), Lawrence Erlbaum Associates, 1995.
....fan out 1. They compute a function of the form hf 1 R . They are commonly trained with a gradient descent method such as backpropagation through time and variants assumed the sequences are given a priori, or real time recurrent learning for on line learning in robotics, for example [14,51]. These approaches provide concrete tools for processing structured data with neural networks. We believe that a thorough theoretical investigation is mandatory in order to establish these methods. Several points are worth considering: each method comes with a specific training method which ....
R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. Rumelhart, editors, Backpropagation: Theory, Architectures and Applications. Erlbaum, 433-486, 1992.
.... they do not provide clear practical advantages over, say, backprop in feedforward networks with limited time windows (see crossreference Chapters 11 and 12) With conventional algorithms based on the computation of the complete gradient , such as Back Propagation Through Time (BPTT, e.g. [22, 27, 26]) or Real Time Recurrent Learning (RTRL, e.g. 21] error signals flowing backwards in time tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights [11, 6] Case (1) may lead to oscillating weights, while in ....
....going to prove hold regardless of the particular kind of cost function used (as long as its continuous in the output) and regardless of 2 the particular algorithm which is employed to compute the gradient. Here we shortly explain how gradients are computed by the standard BPTT algorithm (e.g. [27], see also crossreference Chapter 14 for more details) because its analytical form is better suited to the forthcoming analyses. The error at time t is denoted by E(t) Considering only the error at time t, output unit k s error signal is ffi k (t) E(t) net k (t) and some non output unit ....
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992. 15
....rather complicated procedure. However, methods from the well developed optimal control theory and dynamic programming were of help in giving some insights into the problem. Several training methods have been developed, that are basically different computational methods to obtain the gradient [10] [20]. Some of these methods compute the gradient very efficiently. However, the main shortcoming of these methods is the excessive number of iterations needed to reach the minimum. Only few methods in the literature attempt to overcome this shortcoming. To name a few, some are based on approximating ....
....is improved, by considering the recursive equations for the output gradients. A Green s function solution is computed, from which the sought error gradient is obtained by a simple dot product. This method is an on line technique, and the complexity is . 5) The block update approach (BU) 18] [20]: This is an on line approach that updates the weights every data points using some aspects of the FP and BTT methods. Each update has complexity . Since the update is performed every data points, the complexity per data point is . In this paper we present a novel formulation to unify these five ....
R. Williams and J. Peng, "Gradient-based learning algorithms for recurrent networks and their computational complexity," in Backpropagation: Theory, Architectures, and Applications. Hillsdale, NJ: Lawrence Erlbaum, 1992.
....A gradient descent learning algorithm for RNNs, such as LSTM, computes the gradient of E with respect to each weight w lm to determine the weight changes w lm : w lm (t) E(t) w lm ; where is called the learning rate. For an excellent introduction to gradient learning in RNNs see Williams and Zipser (1992). The connection scheme of a network is called the network architecture or topology. Architectures without loops are called feed forward neural networks (Figure 1.1, left) RNN topologies range from partly recurrent, to fully recurrent networks. An example of a partly recurrent network is a ....
....AND RELATED WORK 7 1.2.1 Problem: Exponential decay of gradient information. The extent to which this potential can be exploited, is however limited by the e ectiveness of the training procedure applied. Gradient based methods (see survey: Pearlmutter, 1995) Back Propagation Through Time (Williams Zipser, 1992; Werbos, 1988) or Real Time Recurrent Learning (Robinson Fallside, 1987; Williams Zipser, 1992) and their combination (Schmidhuber, 1992a) share an important limitation. The temporal evolution of the path integral over all error signals owing back in time exponentially depends on the ....
[Article contains additional citation context not shown here]
Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Back-propagation: Theory, Architectures and Applications (pp. 433-486). Hillsdale, NJ: Erlbaum.
....in a recursive fashion. The algorithm is especially suitable for online (real time) learning situations, where weights are adjusted in a continuous fashion. With DEKF it should be possible for a RNN to learn optimal weights for many dicult problems. However, RNNs in general are hampered by [2, 11, 17, 14, 9] vanishing gradients [7, 1] that make a network unable to deal correctly with longterm dependencies. A recent novel RNN called Long Short Term Memory [8] overcomes this vanishing gradient problem and learns previously unlearnable solutions to numerous tasks [8, 4, 5] including tasks that require ....
Williams, R. J. and D. Zipser (1992), \Gradient-based learning algorithms for recurrent networks and their computational complexity", in Y. Chauvin and D. E. Rumelhart, eds., Back-propagation: Theory, Architectures and Applications, Hillsdale, NJ: Erlbaum.
.... has been mentioned first in [Berg, 1992] and recently worked out and generalized to DAGs by [Goller and Kuchler, 1996b] The reader is assumed to be familiar with the standard backpropagation algorithm (BP) Rummerlhart and McClelland, 1986] and its variant backpropagation through time (BPTT) [Williams and Zipser, 1994]. For the sake of brevity only a sketch of the underlying principles of the approach called backpropagation through structure (BPTS ) is given. Refer to [Goller and Kuchler, 1996a] for a detailed discussion, formal derivation of the given results and the algorithmic specifications. 3.2.1 ....
Williams, R. J. and Zipser, D. (1994). Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity. In Chauvin, Y. and Rummelhart, D. E., editors, Backpropagation: Theory, Architectures and Applications, chapter 13, pages 433--486. Lawrence Erlbaum Associates, Hillsdale, NJ.
....special units, without loss of short time lag capabilities. Multiplicative gate units learn to open and close access to the constant error ow. Moreover, LSTM s learning algorithm is more ecient than previous RNN algorithms such as real time recurrent learning (RTRL Robinson Fallside, 1987; Williams Zipser, 1992) and back propagation through time (BPTT Williams Peng, 1990; Werbos, 1988) it is local in space and time, with computational complexity O(1) per time step and weight. Recent research on LSTM has concentrated on improving the structure of the adaptive gates surrounding the CECs. Here we will ....
....RNNs already fail to learn in the presence of 10 step time lags, despite requiring more complex update algorithms. Moreover, LSTM is local in space and time (Schmidhuber, 1989) and it is more ecient than other RNN algorithms such as real time recurrent learning (RTRL Robinson Fallside, 1987; Williams Zipser, 1992). LSTM s computational complexity for a memory block j per time step and weight is O(S j ) where S j is the number of cells in the block. S j is typically a small constant in this paper it is in fact always equal to one, so that the computational complexity is O(1) Back propagation through ....
[Article contains additional citation context not shown here]
Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Back-propagation: Theory, Architectures and Applications (pp. 433-486). Hillsdale, NJ: Erlbaum.
....time dependence. Starting at the beginning of the data, each copy calculates its outputs. At the end, the error is calculated for the last copy, and the weights are updated using the chain rule for ordered derivatives. The copies are then moved forward one time step and the process is repeated [14, 15, 16]. EKF training is a parameter identification technique for an RNN [4] which adapts weights of the network pattern by pattern, accumulating training information in approximate error covariance matrices and providing individually adjusted updates for the network s weights. All weights of the RNN ....
....onestep ahead predictions the model gets outputs of the actual system from the previous time step as its inputs. The model outputs are predictions of the current system outputs. Such model system configuration is known under different names including series parallel model [3] and teacher forcing [15]. We model the ball and beam system via a recurrent neural network henceforth referred to as an ID (identification) network. After this network is trained, its weights are fixed, and it replaces the system in subsequent off line training of the neurocontrollers. We emphasize that the complete ....
Williams, R., and D. Zipser, "Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity," Ch. 13 in Backpropagation: Theory, Architecture, and Applications (Chauvin and Rumelhart, Eds.), LEA, 1995. pp. 311-340. 20
....by F. Pineda and L. B. Almeida, e.g. 26] only in the case when the recurrent network behaviour relaxes to a fixed point. However, if a general temporal processing is needed, two main gradient based learning approaches exist for recurrent networks [15,16,20] Back Propagation Through Time (BPTT) [6,1,27,15] and Real Time Recurrent Learning (RTRL) 2,18,22,27,15] BPTT is a family of algorithms which extends the BP paradigm to dynamic networks. There are two major points of view to understand what BPTT is. The first is an intuitive one: time unfolding of the recurrent network, i.e. for single layer ....
....the case when the recurrent network behaviour relaxes to a fixed point. However, if a general temporal processing is needed, two main gradient based learning approaches exist for recurrent networks [15,16,20] Back Propagation Through Time (BPTT) 6,1,27,15] and Real Time Recurrent Learning (RTRL) [2,18,22,27,15]. BPTT is a family of algorithms which extends the BP paradigm to dynamic networks. There are two major points of view to understand what BPTT is. The first is an intuitive one: time unfolding of the recurrent network, i.e. for single layer single feedback delay fully recurrent networks one can ....
[Article contains additional citation context not shown here]
R.J. Williams, D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications Y. Chauvin and D.E. Rumelhart, Eds. Hillsdale, NJ: Lawrence Erlbaum Associates, 1994.
....graph) 26] Therefore BPTT is local in space but not in time, and is computationally simple but is noncausal; so it can be implemented only in batch mode. For on line adaptation some approximations are needed, namely causalization and truncation of past history, as explained in [1] 40] and [53] for fully recurrent neural networks. On the other hand, RTRL is local in time but not in space, computationally complex but intrinsically on line. RTRL also implements an approximated calculation of the gradient if the parameters are continually adapted since the true derivative would require ....
....neural networks. On the other hand, RTRL is local in time but not in space, computationally complex but intrinsically on line. RTRL also implements an approximated calculation of the gradient if the parameters are continually adapted since the true derivative would require constant weights [1] [53]. In [53] Williams and Zipser report better performance and convergence rate for truncated BPTT than RTRL and explain this result stating that the history truncation approximation can be better than the approximation implemented in RTRL. Recently Wan and Beaufays [26] proposed a simple method ....
[Article contains additional citation context not shown here]
R. J. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity," Backpropagation: Theory, Architectures and Applications, Y. Chauvin and D. E. Rumelhart, Eds. Hillsdale, NJ: Lawrence Erlbaum, 1994.
....time step k , given by: k y k y k e h = 8) Several training algorithms have been proposed to adjust the weight values in recurrent networks. Examples of these methods are the dynamic backpropagation from Narendra and Parthasarathy (1991) the real time recurrent algorithm from Williams and Ziepser (1995) and the backpropagation through time from Werbos (1990) among others. This latest method is considered in the present work. For updating x W , u W and y W a gradient type algorithm is used as follows: y u,x i k W k E k W k W i m m i m i , 1 ( 1 ( D = D r r ....
Williams, R.; Zipser, D., 1995, "Gradient-based learning algorithms for recurrent networks and their computational complexity", Backpropagation, Edit by Yves Chauvin and D. Rumelhart, Chap.13, pp 433-486.
....k given by (25) k x k x k e h d m = 25) Several training algorithms have been proposed to adjust the weighting values in recurrent networks. Examples of these methods are the Narendra s dynamic backpropagation [34]#37, the real time recurrent algorithm from Williams and Zipser [35]#38 and the backpropagation through time 12 algorithm (BTT) from Werbos [36]#39, which is being considered in the present work. All these methods use a gradient based learning algorithm and involve the computation of partial derivatives or sensitivity functions. To updated x W and u W ....
Williams, R. and Zipser, D. - Gradient-based learning algorithms for recurrent networks and their computational complexity - Backpropagation: Theory, architectures and applications, Edit by Yves Chauvin and D. Rumelhart, Chap.13, 433-486, 1995.
....where d(k) 2 n denotes the actual plant states at time step k. Several training algorithms have been proposed to adjust the weight values in recurrent networks. Examples of these methods are the Narendra s dynamic backpropagation [12] the real time recurrent algorithm of Williams and Ziepser [21] and the Werbos backpropagation trough time [20] among others. The backpropagation through time is considered in the present work. Being a gradient type algorithm the updated of weight values W s ; s = x; u (W y is known and xed) are given by (13) w s ij = w s ij (1 ) E total ....
Williams R., Zipser D., "Gradient-based learning algorithms for recurrent networks and their computational complexity", Backpropagation: Theory, architectures and applications, Yves Chauvin and D. Rumelhart Editors, Chap.13, 433-486, (1995).
....output errors become sufficiently small after learning. Learning of the network is composed of the following two stages. In the first stage the initial state x 0 and the output sequence fy k g are given to the network, and the optimal gain matrix F d is obtained by backpropagation through time [8]. In the second stage, the criterion matrices R, Q and P , corresponding to F d , are obtained by backpropagation [9] satisfying positive definiteness of R and P and positive semi definiteness of Q. During learning, we use a method of eigenvalue modification [4] to assure R 0, Q0 and P 0. In ....
R. J. Williams and D. Zipser. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity. Y. Chauvin and D. E. Rumelhart Eds., Backpropagation: Theory, Architectures and Applications, LEA, Inc., 1995.
....in each layer as in the case of multilayer feedforward networks. 3.1.2 Recurrent Architecture Training Fewer learning algorithms have been developed for recurrent neural networks due to their complexity. A popular algorithm used is a modification of BP called Backpropagation Through Time (BPTT) [64, 79]. It forms a feedforward network from the recurrent network by unrolling the recurrent network through a number of time steps. The unrolling process involves copying the network neurons for each time step to form layers of networks. If a connection exists from neuron n i to n j in the recurrent ....
Ronald J. Williams and David Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Yves Chauvin and David E. Rumelhart, editors, Backpropagation: Theory, Architectures and Applications, chapter 13, pages 433-- 486. Lawrence Erlbaum Associates, 1995.
....the problems faced by the sub grouping strategy. Firstly, the dynamic mode gives excitation to the learning process so that the chance of convergence is higher; secondly, the single output does not need to be duplicated on every sub network. Exact gradient methods The exact gradient methods [7,8] combine the RTRL and the BPTT algorithms to compute the error gradient. These method divide the gradient calculations into blocks with h ( O(n) time steps, and the weight changes are performed only at the end of each time block. At the beginning of each block, say t , all the p t ij k ( ....
R. J. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity", In Backpropagation: Theory, Architectures, and Applications, Y. Chauvin and D.E. Rumelhart, Eds., Lawrence Erlbaum Associates Pub., Hillsdale, NJ, 433-486, 1994.
....unit to the the k th normal output unit is w okx j . w ij s real valued weight at time t is denoted by w ij (t) Before training, all weights w ij (1) are randomly initialized. The following definitions will look familiar to the reader knowledgeable about conventional recurrent nets (e.g. [19]) The environment determines the activations of a normal input unit x k . The activations of the remaining input units will be specified in section 2.2, which lists the self referential aspects of the architecture. For a non input unit y k we define net yk (1) 0; 8t 1 : y k (t) f yk ....
....described in sections 2.1 and 2.2 the corresponding learning algorithm for the simpler architecture without explicit weight changing capabilities (footnote 9, section 2. 3) is just a modification of conventional gradient based algorithms for recurrent nets (e.g. 10] 17] 6] 7] [19]) To obtain a better overview, let us summarize the system dynamics in compact form: net yk (1) 0; 8t 1 : x k (t) environment; y k (t) f yk (net yk (t) 8t 1 : net yk (t) X l w yk l (t Gamma 1)l(t Gamma 1) 8) 8t 1 : w ij (t 1) w ij (t) 4(t) g[ kadr(w ij ) Gamma ....
[Article contains additional citation context not shown here]
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992, in press. 15
....behavior. How to learn such behavior from training examples If there are long time lags between relevant events and later error signals, then most analog gradient based recurrent net learning algorithms, such as #Back#Propagation Through Time #BBTT, e.g. #Rumelhart et al. 1986; Werbos, 1988; Williams and Zipser, 1992## or #Real Time Recurrent Learning #RTRL, e.g. #Robinson and Fallside, 1987## #see overviews by Williams, 1989; Pearlmutter, 1995#, will not work. Their main problem is that error signals ##owing backwards in time tend to decay exponentially, as was shown #rst byHochreiter #1991#. This insight ....
Williams, R. J. and Zipser, D. #1992#. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum.
....a very powerful class of computational models, capable of instantiating almost arbitrary dynamics. The extent to which this potential can be exploited, is however limited by the effectiveness of the training procedure applied. Gradient based methods ( Back Propagation Through Time (BPTT) [8] or Real Time Recurrent Learning (RTRL) 6] share an important limitation. The magnitude of the error signal propagated back in time depends exponentially on the magnitude of the weights. This implies that the backpropagated error quickly either vanishes or blows up [5, 1] Hence standard RNNs ....
R. J. Williams and D. Zipser. Gradientbased learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992.
....a very powerful class of computational models, capable of instantiating almost arbitrary dynamics. The extent to which this potential can be exploited, is however limited by the effectiveness of the training procedure applied. Gradient based methods ( BackPropagation Through Time (BPTT) [10] or Real Time Recurrent Learning (RTRL) 8] share an important limitation. The magnitude of the error signal propagated back in time depends exponentially on the magnitude of the weights. This implies that the backpropagated error quickly either vanishes or blows up [6, 1] Hence standard RNNs ....
R. J. Williams and D. Zipser. Gradientbased learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992.
....unit to the the k th normal output unit is w ok x j . w ij s real valued weight at time t is denoted by w ij (t) Before training, all weights w ij (1) are randomly initialized. The following definitions will look familiar to the reader knowledgeable about conventional recurrent nets (e.g. [13]) The environment determines the activations of a normal input unit x k . For a non input unit y k we define net yk (1) 0; 8t 1 : y k (t) f yk (net yk (t) 8t 1 : net yk (t) X l w yk l (t Gamma 1)l(t Gamma 1) 1) where f i is the activation function of unit i. The current ....
....2 X k (eval k (t 1) 2 : Note that elements of algorithm space are evaluated solely by a conventional evaluation function 3 . The following algorithm for minimizing E total is partly inspired by (but more complex than) conventional recurrent network algorithms (e.g. 4] 10] 2] 3] [13]) Derivation of the algorithm. We use the chain rule to compute weight increments (to be performed after each training sequence) for all initial weights w ab (1) according to w ab (1) w ab (1) Gamma j E total (n r n s ) w ab (1) 11) where j is a constant positive learning rate . ....
[Article contains additional citation context not shown here]
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992.
....biological plausibility is an important point of consideration. The idea of neural processing comes from biological models of actual nervous systems. However, many of artificial neural network models are not biologically plausible. For example, the back propagation through time method [WZ89] requires to store all states of the networks during processing. Such mechanisms are not plausible as actual nervous systems [GA91] In works described in this thesis, such plausibility is considered carefully, especially, with respect to the following points. ffl Locality of calculations Locality ....
....complex temporal sequence processing, especially with long distance dependencies (LDD) One way to avoid the disadvantage is to use 30 Chapter 3. Pattern Representation of Sequence Input Output Context State Figure 3.1: Simple Recurrent Network. the back propagation through time (BPTT) method [WZ89]. This method, however, has another disadvantage that it requires to record whole states during processing. Such a mechanism is not plausible biologically. In order to process temporal sequences with LDD, a network needs a mechanism to hold information about inputs through time. In this chapter, I ....
[Article contains additional citation context not shown here]
Ronald J. Williams and David Zipser. Gradient-Based Learning Algorithms for Recurrent Networks. In Y. Chauvin and D. E. Rumelhart, editor, Backpropagation: Theory, Architectures and Applications, chapter , pp. . Hillsdale, NJ: Erbaum, 1989. List of Publications List of Major Publications
....existing non local algorithms. 1. 1 A WEAKNESS OF PREVIOUS LEARNING ALGORITHMS Exact gradient based supervised learning algorithms for minimizing E are back propagation through time (BPTT) e.g. 14] 24] 10] the real time recurrent learning algorithm (RTRL) 13] 27] its accelerated versions [26][28][17] and the recent fast weight algorithm [19] All these approaches are non local for a restricted class of recurrent networks, however, there is a local gradient based algorithm [6] Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. 3] 1] ....
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992, in press.
.... reduce time complexities Epochwise BPTT Truncated BPTT Truncated RTRL Subgrouping strategy Mode exchange RTRL and Fast RTRL RTRL BPTT Researchers Williams and Peng 1990 [5] Catfolis 1993 [6] Zipser 1989 [7] Lu et al. 1995 [8,9] Lu 1996 [10] Schmidhuber 1992 [11] Williams and Zipser 1994 [12] Time complexity per time step O(n 2 ) O(n 4 ) O(n 2 ) O(n 3 ) O(n 3 ) Table 1 A summary of the gradient based methods and their time complexities. The variable n denotes the number of processing nodes. Stochastic based Methods Approaches that improve convergence and or reduce time ....
....[9] showed that the FRTRL algorithm is about three times faster than the original RTRL algorithm. 2.1. 3 Exact gradient methods The exact gradient methods combine the RTRL and the BPTT algorithms to compute the error gradient, resulting in an average time complexity of O(n 3 ) per time step [11,12]. The combined algorithm divides the gradient calculations into blocks of h time steps, and the weight changes are performed only at the end of each time block. Therefore, strictly speaking, the algorithm is not an on line one if each block contains more than one time step. At the beginning of ....
[Article contains additional citation context not shown here]
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Backpropagation: Theory, Architectures, and Applications, chapter 13, pages 433--486. Lawrence Erlbaum Associates Pub., Hillsdale, NJ, 1994.
....required per Current address: Dept. of Computer Science, University of Colorado, Campus Box 430, Boulder, CO 80309, USA, yirgan cs.colorado.edu 1 Since the acceptance of this paper for publication it has come to my attention that the same algorithm was derived by Ron Williams (Williams, 1989; Williams and Zipser, 1992). time step. Such an algorithm is the RTRL algorithm (Robinson and Fallside, 1987) Williams and Zipser, 1989) It requires only fixed size storage of the order O(n 3 ) but is computationally expensive: It requires O(n 4 ) operations per time step 2 . The algorithm described herein ....
Williams, R. J. and Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum.
....Wiles Phillips, 1991) attempt to learn to build a representation of the entire list. This representation requires training on a significant proportion of all possible lists in order to ensure generalisation (Wiles Phillips, 1991) People, however, do not 4 At least in its original formulation (Williams Zipser, 1990), although an O(n 3 ) algorithm has recently been developed (Schmidhuber, 1992) 5 This statement needs some qualification. There are many variations on BPTT. If the SRN is compared against BPTT where error is backpropagated at the end of each sequence rather than after each pattern, the time ....
Williams, R. J., & Zipser, D. (1990). Gradient-based learning algorithms for recurrent networks. In Chauvin, Y., & Rumelhart, D. E. (Eds.), Backpropagation: Theory, Architectures and Applications, pp. 1--42. Erlbaum, Hillsdale, NJ.
....(e.g. Mozer, 1992) However, previous algorithms for learning what to put in short term memory take too much time or don t work at all, especially when there are long time lags between inputs and corresponding teacher signals. For instance, with conventional backprop through time (BPTT, e.g. Williams and Zipser, 1992) or RTRL (e.g. Robinson and Fallside, 1987) error signals flowing backwards in time tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends e.g. on weights of self connections (leading from some unit to itself) See Hochreiter (1991) ....
....its problems concerning information storage and retrieval. These problems will be solved by the LSTM architecture to be described in section 3. Section 4 will present experimental comparisons with competing methods. LSTM outperforms them. 2 CONSTANT ERROR BACKPROP Conventional BPTT (e.g. Williams and Zipser, 1992). Output unit k s target at time t is d k (t) Using mean squared error, k s error signal is # k (t) f 0 k (net k (t) d k (t) Gamma y k (t) where y i (t) f i (net i (t) is the activation of a non input unit i with activation function f i , w ij is the weight on the connection from ....
[Article contains additional citation context not shown here]
Williams, R. J. and Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Chauvin, Y. and Rumelhart, D. E., editors, Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum.
.... simulator shaped by the actual physical characteristics of the individual robot we have (for more about methodological issues see Nolfi, Floreano, Miglino, and Mondada, 1994; Miglino, Lund, and Nolfi, 1995) 11 sequential inputs (Elman, 1993; see also Robinson Fallside, 1987; Williams, 1989; Williams Zipser 1992). sensors planned action time delay 1th level prediction 2th level prediction 1th level segmentation time delay . Figure 6. The architecture of the network. Single arrows indicate the weights of the first and of the second prediction layer which are taught by back propagation and the ....
Williams, R.J. & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum.
.... d p ) 0 r x(T ) y p (T )rwx(T ) 6) We can expand this further by assuming that the weights at different time indices are independent and computing the partial gradient with respect to these weights, which is the methodology used to derive algorithms such as Backpropagation Through Time (BPTT) [27, 38]. The total gradient is then equal to the sum of these partial gradients. Specifically, rwC = X p (y p (T ) Gamma d p ) 0 r x(T ) y p (T ) T X =1 rw( x(T ) # : 7) Another application of the Chain Rule to Equation 7 gives rwC = X p (y p (T ) Gamma d p ) 0 r x(T ) y p (T ) ....
R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications, chapter 13, pages 433--486. Lawrence Erlbaum Publishers, Hillsdale, N.J., 1995.
....of the alphabet is assigned its own input neuron. Training is performed by updating the network weights at the end of each sample string presentation. We use a gradient descent optimization algorithm for finding a minimum on the network error surface which is defined by a quadratic cost function [55]. As with any optimization method based on gradient descent, the training algorithm is prone to finding local minima for which the network does not correctly classify the training data. However, other methods such as simulated annealing are computationally prohibitive [31] We have found that an ....
R.J.Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity," in Backpropagation: Theory, Architectures and Applications (Y. Chauvin and D. E. Rumelhart, eds.), ch. 13, pp. 433--486, Hillsdale, N.J.: Lawrence Erlbaum Publishers, 1995.
....time, denoted as BPTT(h) If the truncation depth h is set to zero, BPTT reduces precisely to ordinary static backpropagation. We build our treatment of derivative adaptive critics on BPTT(h) and use the resulting derivatives in the same way. Several slightly different ways of executing BPTT exist [4]. We here present a form closely related to but slightly different from that we have presented previously [5] We choose to use this particular form here because it leads to a more natural correspondence between the derivatives it produces and those produced by derivative adaptive critics. In ....
R. J. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity," in Backpropagation: Theory, Architectures, and Applications, Y. Chauvin and D. E. Rumelhart, Eds., New Jersey: Lawrence Erlbaum Associates, pp. 433--485, 1995.
....of the MLP of a NARX network or NSAR, several methods of weight elimination [5, 31, 47, 64, 66] can be incorporated into the training algorithm. In the following experiments, networks are trained using weight decay [31] All experiments were trained using Back Propagation Through Time (BPTT) [68]. 4.1 Grammatical Inference: Learning A 512 state Finite Memory Machine NARX networks have been shown to be able to simulate and learn a class of finite state machines [8, 21] called respectively definite and finite memory machines. When being trained on strings which are encoded as temporal ....
R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications, chapter 13, pages 433--486. Lawrence Erlbaum Publishers, Hillsdale, N.J., 1995.
....1992) The most widely used algorithms for learning what to put in short term memory, however, take too much time or don t work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. With conventional Back Propagation Through Time (BPTT, e.g. Williams and Zipser 1992) or Real Time Recurrent Learning (RTRL, e.g. Robinson and Fallside 1987) error signals flowing backwards in time tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights. Case (1) may lead to oscillating ....
....briefly review previous work. Section 6 will discuss certain limitations and advantages of LSTM. The appendix contains a detailed description of the algorithm (A.1) and explicit formulae for error flow (A.2) 2 CONSTANT ERROR BACKPROP 2. 1 EXPONENTIALLY DECAYING ERROR Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k s target at time t is denoted by d k (t) Using mean squared error, k s error signal is # k (t) f 0 k (net k (t) d k (t) Gamma y k (t) where y i (t) f i (net i (t) is the activation of a non input unit i with differentiable activation function f i , net i (t) ....
Williams, R. J. and Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Chauvin, Y. and Rumelhart, D. E., editors, Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum.
....(action) network to minimize J and, eventually, J. In backpropagation through time (BPTT) we also aim at minimizing (2) but T=k h, where h is the depth of truncation in BPTT (BPTT of a particular truncation depth is denoted as BPTT(h) BPTT(0) corresponds to static backpropagation) 3] [4]. If we assume that h is large enough to satisfy lim(g t k ) 0 in (2) both DHP and BPTT(h) then have the same minimization criterion (2) To illustrate the similarity between DHP and BPTT, we consider the following ordered discrete system x t x t i i ext ( 1 i m net t W t x t W t ....
R. Williams, and D. Zipser. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity. In Backpropagation: Theory, Architectures, and Applications, Chauvin and Rumelhart, Eds., LEA, 1995, pp. 433-485.
No context found.
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications. Hillsdale, 1992.
No context found.
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1992.
No context found.
R.J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D.E. Rumelhart, editors, Backpropagation: Theory, Architectures and Applications, chapter 13, pages 433--486. Lawrence Erlbaum Publishers, 1995.
No context found.
Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks and their computational complexity. In Chauvin, Y., Rumelhart, D.E., eds.: Back-propagation: Theory, Architectures and Applications. Lawrence Erlbaum Publishers, Hillsdale, N.J. (1995) 433--486
No context found.
R. J. Williams and D. Zipser. Gradient-Based Learning Algorithms for Recurrent Networks and Their Computatinal Complexity. In Y. Chauvin and D. E. Rumelhart, editors, Back-propagation:Theory, Architectures and Applications. Lawrence Erlbaum, Hillsdale, NJ: Erlbaum, 1995.
No context found.
R. J. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity", in Back-propagation: Theory, Architectures and Applications, ed. Y. Chauvin and D. E. Rumelhart (Hillsdale, Erlbaum, New York, 1992).
No context found.
R. J. Williams and D. Zipser, "Gradient-based learning algorithms for recurrent networks and their computational complexity," in Y. Chauvin and D. E. Rumelhart (eds.), Back-propagation: Theory, Architectures and Applications, Hillsdale, NJ: Erlbaum, chap. 13, pp. 433--486, 1995. 11
No context found.
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin and D. E. Rummelhart, editors, Backpropagation: Theory, Architectures and Applications. Lauwrence Erlbaum Associates, Hillsdale, NJ, 1994. 3.1
No context found.
. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum.
No context found.
Williams, R. J. and Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Chauvin, Y. and Rumelhart, D. E., editors, Back-propagation: Theory, Architectures and Applications, chapter 13, pages 433-486. Hillsdale, NJ: Erlbaum.
No context found.
Williams, R.J. & Zipser, D. (1995). Gradient-Based Learning algorithms for Recurrent Networks In Y. Chauvin & D.E. Rumelhart (Eds.), Backpropagation: Theory, Architectures, and Applications (pp. 99--136). Hillsdale, NJ: Lawrence Erlbaum Associates.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC