• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Generalization in reinforcement learning: Safely approximating the value function (1995)

by J A Boyan, A W Moore
Venue:In
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 307
Next 10 →

Reinforcement learning: a survey

by Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore - Journal of Artificial Intelligence Research , 1996
"... This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract - Cited by 1714 (25 self) - Add to MetaCart
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
(Show Context)

Citation Context

...n approximator is used to represent the value function by mapping a state description to a value. 257Kaelbling, Littman, & Moore Many reseachers have experimented with this approach: Boyan and Moore =-=[18]-=- used local memory-based methods in conjunction with value iteration� Lin [59] used backpropagation networks for Q-learning� Watkins [128] used CMAC for Q-learning� Tesauro [118, 120]usedbackpropagati...

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

by Richard S. Sutton - Advances in Neural Information Processing Systems 8 , 1996
"... On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have ..."
Abstract - Cited by 433 (20 self) - Add to MetaCart
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (...

An analysis of temporal-difference learning with function approximation

by John N. Tsitsiklis, Benjamin Van Roy - IEEE Transactions on Automatic Control , 1997
"... We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator on-line, duringasingle endless trajectory of an irreducible aperiodi ..."
Abstract - Cited by 313 (8 self) - Add to MetaCart
We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator on-line, duringasingle endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of on-line updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning. Second, we present anexample illustrating the possibility of divergence when temporal-difference learning is used in the presence of a nonlinear function approximator.

Stable Function Approximation in Dynamic Programming

by Geoffrey J. Gordon - IN MACHINE LEARNING: PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE , 1995
"... The success of reinforcement learning in practical problems depends on the ability tocombine function approximation with temporal difference methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theo ..."
Abstract - Cited by 263 (6 self) - Add to MetaCart
The success of reinforcement learning in practical problems depends on the ability tocombine function approximation with temporal difference methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the difficulty of reasoning about function approximators that generalize beyond the observed data. We provide a proof of convergence for a wide class of temporal difference methods involving function approximators such as k-nearest-neighbor, and show experimentally that these methods can be useful. The proof is based on a view of function approximators as expansion or contraction mappings. In addition, we present a novel view of approximate value iteration: an approximate algorithm for one environment turns out to be an exact algorithm for a different environment.

Tree-based batch mode reinforcement learning

by Damien Ernst, Pierre Geurts, Louis Wehenkel - JOURNAL OF MACHINE LEARNING RESEARCH , 2005
"... Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt,ut,rt,xt+1) where xt denotes the system state a ..."
Abstract - Cited by 224 (42 self) - Add to MetaCart
Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.
(Show Context)

Citation Context

...methods, divergence to infinity problems were plaguing the fitted Q iteration algorithm (Section 5.3.3); such problems have already been highlighted in the context of approximate dynamic programming (=-=Boyan and Moore, 1995-=-). 4.5 Computation of maxu∈U ˆQN(x,u) when u Continuous In the case of a single regression tree, ˆQN(x,u) is a piecewise-constant function of its argument u, when fixing the state value x. Thus, to de...

Algorithms for Sequential Decision Making

by Michael Lederman Littman , 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of ..."
Abstract - Cited by 213 (8 self) - Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a long-run measure of reward, and "I" is an automated planning or learning system (agent). In particular,
(Show Context)

Citation Context

...horizon [62] minimax expected average reward over the in nite horizon [183] maximum expected average reward over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) =-=[29]-=- minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability [87] maximum multiagent discounted expected reward [22] Table 1.3: Several popular objective functio...

Stochastic Dynamic Programming with Factored Representations

by Craig Boutilier, Richard Dearden, Moisés Goldszmidt , 1997
"... Markov decision processes(MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we pro-p ..."
Abstract - Cited by 189 (10 self) - Add to MetaCart
Markov decision processes(MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we pro-pose new representational and computational techniques for MDPs that exploit certain types of problem structure. We use dynamic Bayesian networks (with decision trees representing the local families of conditional probability distributions) to represent stochastic actions in an MDP, together with a decision-tree representation of rewards. Based on this representation, we develop versions of standard dynamic programming algorithms that directly manipulate decision-tree representations of policies and value functions. This generally obviates the need for state-by-state computation, aggregating states at the leaves of these trees and requiring computations only for each aggregate state. The key to these algorithms is a decision-theoretic generalization of classic regression analysis, in which we determine the features relevant to predicting expected value. We demonstrate the method empirically on several planning problems,
(Show Context)

Citation Context

...ee with little loss in accuracy, in contrast to pruning for the purpose of preventing overfitting [64]. 37 The approximation is thus careful enough to avoid the problems of approximation described in =-=[18]-=-. 57 abstraction, decision-theoretic regression groups together states that have identical value or policy choice at various points in the dynamic programming computations required to solve an MDP. We...

Value-function approximations for partially observable Markov decision processes

by Milos Hauskrecht - Journal of Artificial Intelligence Research , 2000
"... Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advanta ..."
Abstract - Cited by 167 (1 self) - Add to MetaCart
Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price — exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain. 1.
(Show Context)

Citation Context

...he drawback of the approach is that, when combined with the value-iteration method, it can lead to instability and/or divergence. This has been shown for MDPs by several researchers (Bertsekas, 1994; =-=Boyan & Moore, 1995-=-; Baird, 1995; Tsitsiklis & Roy, 1996). 25. This is similar to the QMDP method, which allows both lookahead and greedy designs. In fact, QMDP can be viewed as a special case of the grid-based method w...

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

by Satinder Singh, Tommi Jaakkola, Michael L. Littman, Csaba Szepesvári - MACHINE LEARNING , 1998
"... An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract - Cited by 154 (7 self) - Add to MetaCart
An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

Kernel-Based Reinforcement Learning

by Dirk Ormoneit, Saunak Sen - Machine Learning , 1999
"... We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second ..."
Abstract - Cited by 153 (1 self) - Add to MetaCart
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
(Show Context)

Citation Context

...orks) to represent the value function of the underlying Markov Decision Process (MDP). For a detailed discussion of this problem, as well as a list of exceptions, the interested reader is referred to =-=[5, 31]-=-. By adopting a non-parametric perspective on reinforcement learning, we suggest an algorithm that always converges to a unique solution. This algorithm assigns value function estimates to the states ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University