#### DMCA

## Near-optimal reinforcement learning in polynomial time (1998)

### Cached

### Download Links

- [www.cse.wustl.edu]
- [www.cis.upenn.edu]
- [www.cis.upenn.edu]
- [www.cs.berkeley.edu]
- [158.130.69.163]
- [www.cs.colorado.edu]
- [deas.harvard.edu]
- [ftp.cs.colorado.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 301 - 5 self |

### Citations

5464 | Reinforcement Learning: An Introduction
- Sutton, R, et al.
- 1998
(Show Context)
Citation Context ...ocesses, exploration versus exploitation 1. Introduction In reinforcement learning, an agent interacts with an unknown environment, and attempts to choose actions that maximize its cumulative payoff (=-=Sutton & Barto, 1998-=-; Barto, Sutton, & Watkins, 1990; Bertsekas & Tsitsiklis, 1996). The environment is typically modeled as a Markov decision process (MDP), and it is assumed that the agent does not know the parameters ... |

1849 |
Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ... ergodic (that is, has a well-dened stationary distribution) . For the development and exposition, it will be easiest to consider MDPs for which every policy is ergodic, the so-called unichain MDPs (P=-=uterman, 1994-=-). In a unichain MDP, the stationary distribution of any policy does not depend on the start state. Thus, considering the unichain case simply allows us to discuss the stationary distribution of any p... |

1667 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...s Q-learning algorithm guarantees asymptotic convergence to optimal values (from which the optimal actions can be derived) provided every state of the MDP has been visited an innite number of times (W=-=atkins, 198-=-9; Watkins & Dayan, 1992; Jaakkola et al., 1994; Tsitsiklis, 1994). This asymptotic result does not specify a strategy for achieving this innite exploration, and as such does not provide a solution to... |

1488 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...fact that the state space may be so large that we will have to resort to methods such as function approximation. While some results are available on reinforcement learning and function approximation (=-=Sutton, 1988-=-; Singh et al., 1995; Gordon, 1995; Tsitsiklis & Roy, 1996), and for partially observable MDPs (Chrisman, 1992; Littman et al., 1995; Jaakkola et al., 1995), they are all asymptotic in nature. The ext... |

1151 |
Nonlinear Programming
- Bertsekas
- 1995
(Show Context)
Citation Context ...1 Introduction In reinforcement learning, an agent interacts with an unknown environment, and attempts to choose actions that maximize its cumulative payoff (Sutton & Barto, 1998; Barto et al., 1990, =-=Bertsekas & Tsitsiklis, 1996-=-). The environment is typically modeled as a Markov decision process (MDP), and it is assumed that the agent does not know the parameters of this process, but has to learn how to act directly from exp... |

941 |
Parallel and Distributed Computation, Numerical Methods
- Bertsekas, Tsitsiklis
- 1989
(Show Context)
Citation Context ...DP with N states in O(N 2T ) computation steps for both the discounted and the undiscounted cases. For the sake of completeness, we present the undiscounted and discounted value iteration algorithms (=-=Bertsekas & Tsitsiklis, 1989-=-) below. The optimal T -step policy may be nonstationary, and is denoted by a sequence π ∗ ={π∗ 1 ,π∗ 2 ,π∗ 3 ,...,π∗ T optimal action to be taken from state i on the tth step. }, where π ∗ t (i) is t... |

473 |
Dynamic Programming: Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...ot know the parameters of this process, but has to learn how to act directly from experience. Thus, the reinforcement learning agent faces a fundamental trade-o between exploitation and exploration (B=-=ertsekas, 198-=-7; Kumar & Varaiya, 1986; Thrun, 1992): that is, should the agent exploit its cumulative experience so far, by executing the action that currently seems best, or should it execute a dierent action, wi... |

430 | Generalization in reinforcement learning: Successful examples using sparse coarse coding
- Sutton
- 1996
(Show Context)
Citation Context ...ion for asymptotic convergence to optimal actions, and asymptotic exploitation, for both the Q-learning and SARSA algorithms (a variant of Q-learning) (Rummery & Niranjan, 1994; Singh & Sutton, 1996; =-=Sutton, 199-=-5). Gullapalli and Barto (1994) and Jalali and Ferguson (1989) presented algorithms that learn a model of the environment from experience, perform value iteration on the estimated model, and with inni... |

378 | Online Q-learning using connectionist systems (Tech
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...rategies that guarantee both sufficient exploration for asymptotic convergence to optimal actions, and asymptotic exploitation, for both the Q-learning and SARSA algorithms (a variant of Q-learning) (=-=Rummery & Niranjan, 1994-=-; Singh & Sutton, 1996; Sutton, 1995). Gullapalli and Barto (1994) and Jalali and Ferguson (1989) presented algorithms that learn a model of the environment from experience, perform value iteration on... |

376 | Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning 13(1):102–130
- Moore, Atkeson
- 1993
(Show Context)
Citation Context .... It is likely that a practical implementation based on the algorithmic ideas given here would enjoy performance on natural problems that is considerably better than the current bounds indicate. (See =-=Moore and Atkeson, 1993-=-, for a related heuristic algorithm.) 4.6 Eliminating Knowledge of T and opt(Π T,ɛ M ) In order to simplify our presentation of the main theorem and the E3 algorithm, we made the assumption that the l... |

291 | Kaelbling. Learning policies for partially observable environments: Scaling up
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...vailable on reinforcement learning and function approximation (Sutton, 1988; Singh, Jaakkola, & Jordan, 1995; Gordon, 1995; Tsitsiklis & Roy, 1996), and for partially observable MDPs (Chrisman, 1992; =-=Littman, Cassandra, & Kaelbling, 1995-=-; Jaakkola, Singh, & Jordan, 1995), they are all asymptotic in nature. The extension of our results to such cases is left for future work. The outline of the paper is as follows: in Section 2, we give... |

261 | Stable function approximation in dynamic programming
- Gordon
- 1995
(Show Context)
Citation Context ...o large that we will have to resort to methods such as function approximation. While some results are available on reinforcement learning and function approximation (Sutton, 1988; Singh et al., 1995; =-=Gordon, 1995-=-; Tsitsiklis & Roy, 1996), and for partially observable MDPs (Chrisman, 1992; Littman et al., 1995; Jaakkola et al., 1995), they are all asymptotic in nature. The extension of our results to such case... |

252 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1993
(Show Context)
Citation Context ...ptotic in nature, providing no explicit guarantees on either the number of actions or the computation time the agent requires to achieve near-optimal performance (Sutton, 1988; Watkins & Dayan, 1992; =-=Jaakkola et al., 1994-=-; Tsitsiklis, 1994; Gullapalli & Barto, 1994). On the other hand, finite-time results become available if one considers restricted classes of MDP’s, if the model of learning is modified from the stand... |

234 | Reinforcement learning with replacing eligibility traces
- Singh, Sutton
- 1996
(Show Context)
Citation Context ...cient exploration for asymptotic convergence to optimal actions, and asymptotic exploitation, for both the Q-learning and SARSA algorithms (a variant of Q-learning) (Rummery & Niranjan, 1994; Singh & =-=Sutton, 1996-=-; Sutton, 1995). Gullapalli and Barto (1994) and Jalali and Ferguson (1989) presented algorithms that learn a model of the environment from experience, perform value iteration on the estimated model, ... |

216 | Reinforcement learning with perceptual aliasing: The perceptual distinctions approach
- Chrisman
- 1992
(Show Context)
Citation Context ...ion. While some results are available on reinforcement learning and function approximation (Sutton, 1988; Singh et al., 1995; Gordon, 1995; Tsitsiklis & Roy, 1996), and for partially observable MDPs (=-=Chrisman, 1992-=-; Littman et al., 1995; Jaakkola et al., 1995), they are all asymptotic in nature. The extension of our results to such cases is 2 left for future work. The outline of the paper is as follows: in Sect... |

207 |
Algorithms for random generation and counting: A markov chain approach
- Sinclair
- 1993
(Show Context)
Citation Context ...or bounding this mixing time in terms of the second eigenvalue of the transition matrix P M , and also in terms of underlying structural properties of the transition graph, such as the conductance (S=-=inclair, 199-=-3). It turns out that we can state our results for a weaker notion of mixing that only requires the expected return after T steps to approach the asymptotic return. Denition 5 Let M be a Markov decisi... |

202 | Asynchronous stochastic approximation and Q-learning
- Tsitsiklis
- 1994
(Show Context)
Citation Context ...ptimal values (from which the optimal actions can be derived) provided every state of the MDP has been visited an innite number of times (Watkins, 1989; Watkins & Dayan, 1992; Jaakkola et al., 1994; T=-=sitsiklis, 19-=-94). This asymptotic result does not specify a strategy for achieving this innite exploration, and as such does not provide a solution to the inherent exploitation-exploration trade-o. To address this... |

193 |
Learning to predict by the methods of temporal di®erences
- Sutton
- 1988
(Show Context)
Citation Context ...nt learning in general MDP's are asymptotic in nature, providing no explicit guarantees on either the number of actions or the computation time the agent requires to achieve near-optimal performance (=-=Sutton, 1988-=-; Watkins & Dayan, 1992; Jaakkola et al., 1994; Tsitsiklis, 1994; Gullapalli & Barto, 1994). On the other hand, finite-time results become available if one considers restricted classes of MDP's, if th... |

177 | Feature-based methods for large scale dynamic programming
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ...o resort to methods such as function approximation. While some results are available on reinforcement learning and function approximation (Sutton, 1988; Singh, Jaakkola, & Jordan, 1995; Gordon, 1995; =-=Tsitsiklis & Roy, 1996-=-), and for partially observable MDPs (Chrisman, 1992; Littman, Cassandra, & Kaelbling, 1995; Jaakkola, Singh, & Jordan, 1995), they are all asymptotic in nature. The extension of our results to such c... |

169 | Reinforcement learning algorithm for partially observable Markov decision problems - Jaakkola, Singh, et al. - 1994 |

155 |
Stochastic Systems: Estimation Identification and Adaptive Control
- Kumar, Varaiya
- 1986
(Show Context)
Citation Context ...ters of this process, but has to learn how to act directly from experience. Thus, the reinforcement learning agent faces a fundamental trade-off between exploitation and exploration (Bertsekas, 1987; =-=Kumar & Varaiya, 1986-=-; Thrun, 1992): that is, should the agent exploit its cumulative experience so far, by executing the action that currently seems best, or should it execute a different action, with the hope of gaining... |

153 | Convergence results for single-step on-policy reinforcement-learning algorithms - Singh, Jaakkola, et al. |

151 | The role of exploration in learning control
- Thrun
- 1992
(Show Context)
Citation Context ...but has to learn how to act directly from experience. Thus, the reinforcement learning agent faces a fundamental trade-o between exploitation and exploration (Bertsekas, 1987; Kumar & Varaiya, 1986; T=-=hrun, 199-=-2): that is, should the agent exploit its cumulative experience so far, by executing the action that currently seems best, or should it execute a dierent action, with the hope of gaining information o... |

125 | Reinforcement Learning with Soft State Aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ...state space may be so large that we will have to resort to methods such as function approximation. While some results are available on reinforcement learning and function approximation (Sutton, 1988; =-=Singh, Jaakkola, & Jordan, 1995-=-; Gordon, 1995; Tsitsiklis & Roy, 1996), and for partially observable MDPs (Chrisman, 1992; Littman, Cassandra, & Kaelbling, 1995; Jaakkola, Singh, & Jordan, 1995), they are all asymptotic in nature. ... |

87 | Efficient reinforcement learning in factored MDPs
- Kearns, Koller
- 1999
(Show Context)
Citation Context ...o study the applicability of recent methods for dealing with large state spaces, such as function approximation, to our algorithm. This has been recently investigated in the context of factored MDPs (=-=Kearns & Koller, 1999-=-). Acknowledgments We give warm thanks to Tom Dean, Tom Dietterich, Tommi Jaakkola, Leslie Kaelbling, Michael Littman, Lawrence Saul, Terry Sejnowski, and Rich Sutton for valuable comments. Satinder S... |

34 | Efficient reinforcement learning
- Fiechter
- 1994
(Show Context)
Citation Context ...time results become available if one considers restricted classes of MDP’s, if the model of learning is modified from the standard one, or if one changes the criteria for success (Saul & Singh, 1996; =-=Fiechter, 1994-=-; Fiechter, 1997; Schapire & Warmuth, 1994; Singh & Dayan, in press). Fiechter (1994,1997), whose results are closest in spirit to ours, considers only the discounted case, and makes the learning prot... |

30 |
Convergence of indirect adaptive asynchronous value iteration algorithms. Advances in neural information processing systems
- Gullapalli, Barto
- 1994
(Show Context)
Citation Context ...uarantees on either the number of actions or the computation time the agent requires to achieve near-optimal performance (Sutton, 1988; Watkins & Dayan, 1992; Jaakkola et al., 1994; Tsitsiklis, 1994; =-=Gullapalli & Barto, 1994-=-). On the other hand, finite-time results become available if one considers restricted classes of MDP’s, if the model of learning is modified from the standard one, or if one changes the criteria for ... |

28 |
Sequential decision problems and neural networks
- Baxto, Sutton, et al.
- 1990
(Show Context)
Citation Context ...oitation trade-off. 1 Introduction In reinforcement learning, an agent interacts with an unknown environment, and attempts to choose actions that maximize its cumulative payoff (Sutton & Barto, 1998; =-=Barto et al., 1990-=-, Bertsekas & Tsitsiklis, 1996). The environment is typically modeled as a Markov decision process (MDP), and it is assumed that the agent does not know the parameters of this process, but has to lear... |

25 | Analytical Mean Squared Error Curves for Temporal Difference Learning
- Singh, Dayan
- 1998
(Show Context)
Citation Context ...ton that allows his agent to return to a set of start-states at arbitrary times. Others have provided non-asymptotic results for prediction in uncontrolled Markov processes (Schapire & Warmuth, 1994; =-=Singh & Dayan, 1998-=-). Thus, despite the many interesting previous results in reinforcement learning, the literature has lacked algorithms for learning optimal behavior in general MDPs with provably finite bounds on the ... |

13 | Expected mistake bound model for on-line reinforcement learning
- Fiechter
- 1997
(Show Context)
Citation Context ...ome available if one considers restricted classes of MDP’s, if the model of learning is modified from the standard one, or if one changes the criteria for success (Saul & Singh, 1996; Fiechter, 1994; =-=Fiechter, 1997-=-; Schapire & Warmuth, 1994; Singh & Dayan, in press). Fiechter (1994,1997), whose results are closest in spirit to ours, considers only the discounted case, and makes the learning protocol easier by a... |

13 | On the worst-case analysis of temporal-difference learning algorithms
- Schapire, Warmuth
- 1996
(Show Context)
Citation Context ...ilability of a “reset” button that allows his agent to return to a set of start-states at arbitrary times. Others have provided non-asymptotic results for prediction in uncontrolled Markov processes (=-=Schapire & Warmuth, 1994-=-; Singh & Dayan, 1998). Thus, despite the many interesting previous results in reinforcement learning, the literature has lacked algorithms for learning optimal behavior in general MDPs with provably ... |

5 | Hasselmo (Eds.), Advances in neural information processing systems - Touretzky, Mozer, et al. |

3 | A distributed asynchronous algorithm for expected average cost dynamic programming - Jalali, Ferguson - 1989 |

2 | Learning curve bounds for markov decision processes with undiscounted rewards
- Saul, Singh
- 1996
(Show Context)
Citation Context ... other hand, finite-time results become available if one considers restricted classes of MDP’s, if the model of learning is modified from the standard one, or if one changes the criteria for success (=-=Saul & Singh, 1996-=-; Fiechter, 1994; Fiechter, 1997; Schapire & Warmuth, 1994; Singh & Dayan, in press). Fiechter (1994,1997), whose results are closest in spirit to ours, considers only the discounted case, and makes t... |

2 | On the Worst-case Analysis of TemporalDi®erence Learning Algorithms - Schapire, Warmuth - 1996 |

1 |
Feature-based methods for large scale dynamic programming
- Roy
- 1996
(Show Context)
Citation Context ...o resort to methods such as function approximation. While some results are available on reinforcement learning and function approximation (Sutton, 1988; Singh et al., 1995; Gordon, 1995; Tsitsiklis & =-=Roy, 1996-=-), and for partially observable MDPs (Chrisman, 1992; Littman et al., 1995; Jaakkola et al., 1995), they are all asymptotic in nature. The extension of our results to such cases is 2 left for future w... |