Abstract:
Abstract. We provide analytical expressions governing changes to the bias and variance of the lookup table estimators provided by various Monte Carlo and temporal di#erence value estimation algorithms with o#ine updates over trials in absorbing Markov reward processes. We have used these expressions to develop software that serves as an analysis tool: given a complete description of a Markov reward process, it rapidly yields an exact mean-square-error curve, the curve one would get from averaging together sample mean-square-error curves from an infinite number of learning trials on the given problem. We use our analysis tool to illustrate classes of mean-squareerror curve behavior in a variety of example reward processes, and we show that although the various temporal di#erence algorithms are quite sensitive to the choice of step-size and eligibilitytrace parameters, there are values of these parameters that make them similarly competent, and generally good.
Citations
|
938
|
Learning from Delayed Rewards
– Watkins
- 1989
|
|
885
|
Learning to Predict by the Methods of Temporal Differences
– Sutton
- 1988
|
|
162
|
On the convergence of stochastic iterative dynamic programming algorithms
– Jaakkola, Jordan, et al.
- 1994
|
|
147
|
Learning to predict by the methods of temporal di erences
– Sutton
- 1988
|
|
136
|
Reinforcement learning with replacing eligibility traces
– Singh, Sutton
- 1996
|
|
135
|
Neuronlike elements that can solve difficult learning control problems
– Barto, Sutton, et al.
- 1983
|
|
105
|
Asynchronous stochastic approximation and Q-learning
– Tsitsiklis
- 1994
|
|
64
|
The convergence of TD(*) for general
– Dayan
- 1992
|
|
56
|
Rigorous learning curve bounds from statistical mechanics
– Haussler, Kearns, et al.
- 1996
|
|
41
|
Td(lambda) converges with probability 1
– Dayan, Sejnowski
- 1994
|
|
27
|
Neuronlike elements that can solve di cult learning control problems
– Barto, Sutton, et al.
- 1983
|
|
19
|
The convergence of td(λ) for general λ
– Dayan
- 1992
|
|
12
|
Monte carlo matrix inversion and reinforcement learning
– Barto, Duff
- 1994
|
|
8
|
Temporal-difference methods and Markov models
– Barnard
- 1993
|
|
8
|
A note on the inversion of matrices by random walks
– Wasow
- 1952
|
|
2
|
Learning curve bounds for markov decision processes with undiscounted rewards
– Saul, Singh
- 1996
|