DMCA
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding (1996)
Cached
Download Links
- [ftp.cs.umass.edu]
- [www-anw.cs.umass.edu]
- [www.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [incompleteideas.net]
- [webdocs.cs.ualberta.ca]
- [papers.nips.cc]
- CiteULike
- DBLP
Other Repositories/Bibliography
Venue: | Advances in Neural Information Processing Systems 8 |
Citations: | 433 - 20 self |
Citations
1670 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...pproximation Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience, simulation, or search (Barto, Bradtke & Singh, 1995; Sutton, 1988; =-=Watkins, 1989-=-). Many of these methods, e.g., dynamic programming and temporal-difference learning, build their estimates in part on the basis of other estimates. This may be worrisome because, in practice, the est... |
1520 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...and Function Approximation Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience, simulation, or search (Barto, Bradtke & Singh, 1995; =-=Sutton, 1988-=-; Watkins, 1989). Many of these methods, e.g., dynamic programming and temporal-difference learning, build their estimates in part on the basis of other estimates. This may be worrisome because, in pr... |
557 | Robot Dynamics and Control, - Spong, Vidyasagar - 1989 |
415 | Practical issues in temporal difference learning.
- Tesauro
- 1992
(Show Context)
Citation Context ...iklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice (Lin, 1991; =-=Tesauro, 1992-=-; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The experiments in this paper are part of narrowing the answer... |
381 | Online Q-learning using connectionist systems. - Rummery, Niranjan - 1994 |
323 | Improving elevator performance using reinforcement learning
- Crites, Barto
- 1996
(Show Context)
Citation Context ...an & Moore, 1995). On the other hand, other methods havesbeen proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practices(Lin, 1991; Tesauro, 1992; Zhang & Diett erich , 1995; =-=Crites & Barto, 1996-=-). What aresthe key requirements of a method or task in order to obtain good performance? Thesexperiments in this paper are part of narrowing the answer to this question.sThe reinforcement learning me... |
315 | Self-improving reactive agents based on reinforcement learning, planning and teaching. - Lin - 1992 |
306 | Residual algorithms: Reinforcement learning with function approximation
- Baird
- 1995
(Show Context)
Citation Context ...sed as the targets for other estimates, it seems possible that the ultimate result might be very poor estimates, or even divergence. Indeed some such methods have been shown to be unstable in theory (=-=Baird, 1995-=-; Gordon, 1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in ... |
305 | Generalization in reinforcement learning: Safely approximating the value function - Boyan, Moore - 1995 |
274 | Temporal Credit Assignment in Reinforcement Learning., - Sutton - 1984 |
263 | Stable function approximation in dynamic programming.
- Gordon
- 1995
(Show Context)
Citation Context ...rgets for other estimates, it seems possible that the ultimate result might be very poor estimates, or even divergence. Indeed some such methods have been shown to be unstable in theory (Baird, 1995; =-=Gordon, 1995-=-; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice (Lin,... |
241 | Reinforcement learning with replacing eligibility traces,”
- Singh, Sutton
- 1996
(Show Context)
Citation Context ...erformance? Thesexperiments in this paper are part of narrowing the answer to this question.sThe reinforcement learning methods we use are variations of the sarsa algorithm (Rummery & Niranjan, 1994; =-=Singh & Sutton, 1996-=-). This method is the same as the TD(>.)salgorithm (Sutton, 1988), except applied to state-action pairs instead of states, andswhere the predictions are used as the basis for selecting actions. The le... |
205 | Neuronlike Elements That Can Solve Difficult Learning Control Problems," - Barto, Sutton, et al. - 1983 |
114 | Real-Time Learning and Control using Asynchronous Dynamic Programming - Barto, Bradtke, et al. - 1991 |
85 | CMAC: An associative neural network alternative to backpropagation. - Miller, Glanz, et al. - 1990 |
62 |
The convergence of TD() for general
- Dayan
- 1992
(Show Context)
Citation Context ... be unstable in theory (Baird, 1995; Gordon, 1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; =-=Dayan, 1992-=-) and very effective in practice (Lin, 1991; Tesauro, 1992; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The ... |
60 | Online learning with random representations
- Sutton, Whitehead
- 1993
(Show Context)
Citation Context ...all effect is much like a network with fixed radial basissfunctions, except that it is particularly efficient computationally (in other respects oneswould expect RBF networks and similar methods (see =-=Sutton & Whitehead, 1993-=-) toswork just as well). It is important to note that the tilings need not be simple grids.sFor example, to avoid the "curse of dimensionality," a common trick is to ignore somesdimensions in some til... |
54 | The convergence of td(λ) for general λ
- Dayan
- 1992
(Show Context)
Citation Context ... be unstable in theory (Baird, 1995; Gordon, 1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; =-=Dayan, 1992-=-) and very effective in practice (Lin, 1991; Tesauro, 1992; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The ... |
30 |
Cmac-based adaptive critic self-learning control
- Lin, Kim
- 1991
(Show Context)
Citation Context ...1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice (=-=Lin, 1991-=-; Tesauro, 1992; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The experiments in this paper are part of narro... |
29 | A counter-example to temporal difference learning
- Bertsekas
- 1994
(Show Context)
Citation Context ...ods and as in TD() withs= 1, or to learn on the basis of interim estimates, as in TD() withs! 1. Theoretically, the former has asymptotic advantages when function approximators are used (Dayan, 1992; =-=Bertsekas, 1995-=-), but empirically the latter is thought to achieve better learning rates (Sutton, 1988). However, hitherto this question has not been put to an empirical test using function approximators. Figures 6 ... |
28 | Feature-based methods for large-scale dynamic programming - Tsitsiklis, Roy - 1996 |
20 |
Swinging Up the Acrobot: An Example of Intelligent Control,”
- DeJong, Spong
- 1994
(Show Context)
Citation Context ...mented with a larger and more difficult task not attempted by Boyan and Moore. The acrobot is a two-link under-actuated robot (Figure 5) roughly analogous to a gymnast swinging on a highbar (Dejong & =-=Spong, 1994-=-; Spong & Vidyasagar, 1989). The first joint (corresponding to the gymnast's hands on the bar) cannot exert torque, but the second joint (corresponding to the gymnast bending at the waist) can. The ob... |
16 | Reinforcement Learning for planning and Control - Dean, Basye, et al. - 1993 |
15 | Online Function Approximation for Scaling Up Reinforcement Learning,”
- Tham
- 1994
(Show Context)
Citation Context ...ntinuous state space, we combined it with a sparse, coarse-coded function approximator known as the CMAC (Albus, 1980; Miller, Gordon & Kraft, 1990; Watkins, 1989; Lin & Kim, 1991; Dean et al., 1992; =-=Tham, 1994-=-). A CMAC uses multiple overlapping tilings of the state space to produce a feature representation for a final linear mapping where all the learning takes place. See Figure 2. The overall effect is mu... |
2 | On-line Q-Iearning using connectionist systems - Rummery, Niranjan - 1994 |
1 |
Generalization in reinforcement learning: Safelyapproximating the value function. NIPS-7
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...ate resultsmight be very poor estimates, or even divergence. Indeed some such methods havesbeen shown to be unstable in theory (Baird, 1995; Gordon, 1995; Tsitsiklis & Van Roy,s1994) and in practice (=-=Boyan & Moore, 1995-=-). On the other hand, other methods havesbeen proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practices(Lin, 1991; Tesauro, 1992; Zhang & Diett erich , 1995; Crites & Barto, ... |
1 | Temporal Credit A"ignment in Reinforcement Learning - Sutton - 1984 |