| A. G. Barto, S. J. Bradtke, and S. P. Singh. Real--time learning and control using asynchronous dynamic programming. Artificial Intelligence, 72:81--138, 1995. |
....formally, a control learning problem can be described in the following way: Control learning problem: A,W, F: TM A su that F maximizes over time This formulation of the robot control learning problem has been extensively studied in the field of Reinforcement Learning. Sutton, 1984] [Barto et al. 1991]. Thus far, most approaches that emerged from this field have studied robot learning with a minimal set of assumptions: The robot is able to sense, it is able to execute actions, actions have an effect on future sensations, and there is a pre given reward function that defines the goals of the ....
....approaches to reinforcement learning is that the robot is able to sense the state of the world reliably. If this is the case, it suffices to learn the policy as a function of the most recent sensation to action: F: S A, i.e. the control policy is purely reactive. As Barto et al. pointed out [Barto et al. 1991], the problem of learning a control policy can then be attacked by Dynamic Programming techniques [Bellman, 1957] As it turns out, even if robots have access to complete state descriptions of their environments, learning control in complex robot worlds with large state spaces is practically not ....
Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
....constructs utility functions Q (5, a) that 19 map sensations s and actions a to task specific utility values. C values will be positive for final successes and negative for final failures. In between, utilities are calculated recursively using an asynchronous dynamic programming technique [Barto et al. 1991]. More specifically, the utility Q(s, a) at time t is estimated through a mixture of the utilities of subsequent observation action pairs, up to the final utility at the end of the episode. The exact update equation used in the experiments, combined with methods of temporal differencing [Sutton, ....
Andy G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
....Q Learning constructs utility functions Q ( a) that map sensations and actions a to task specific utility values. Q values will be positive for final successes and negative for final failures. In between, utilities are calculated recursively using an asynchronous dynamic programming technique [Barto et al. 1991]. More specifically, the utility Q(s, at) at time t is estimated through a mixture of the utilities of subsequent observation action pairs, up to the final utility at the end of the episode. The exact update equation used in the experiments, combined with methods of temporal differencing [Sutton, ....
Andy G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
....constructs utility functions Q(s, a) that map sensations s and actions a to task specific utility values. Q values will be positive for final successes and negative for final failures. In between, utilities are calculated recur 11 sively using an asynchronous dynamic programming technique [Barto et al. 1991]. More specifically, the utility Q(st, at) at time t is estimated through a mixture of the utilities of subsequent observation action pairs, up to the final utility at the end of the episode. The exact update equation used in the experiments, combined with methods of temporal differencing [Sutton, ....
Andy G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
.... from reward and or punishment is known as reinforcement learning [15] Very recently strong connections between dynamic programming in operations research [6] heuristic evaluation functions in artificial intelligence [10] adaptive control [11] and reinforcement learning have been discovered [3, 27, 31]. This has caused a rapidly growing interest in this field . Although a multitude of algorithms and architectures is available that implement reinforcement learning [21, 29, 34] we only consider reinforcement learning systems that deal with delayed reinforcement [5, 21, 28] In addition, all ....
....: 7 ( Noe ha he discoun factor 0 7 1 assures he convergence of he infinite 8Ul. In his sense he optimal policy H yields he maximal evaluation for all saes and action pairs: e f, mx7 (2) 7 mx7 This immediately leads to Bellman s optimality equation in dynamic program ming [3] ( 7. mx (5( 4) at Watkins [27] has designed a learning rule for autonomous systems based on this equation 4. Watkins proved that this learning rule is convergent under specific assumptions: 2 Often it is also called reinforcement, reward, immediate payoff, outcome 3 Also ....
A. G. Barto, S. J. Bradtke, and S. P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical report, University of Massachusetts, Departement of Computer Science, Amherst MA 01003, August 1991.
....a, and 0 is a positive constant value, often referred to as temperature. Thus, the higher the exploitation utility u(a) of action a, the more likely a is to get selected. Boltzmann distributed exploration is often found in reinforcement learning literature, if the number of actions is finite [5, 27, 47, 57]. In general, undirected exploration techniques select actions stochastically, whereas actions with a larger exploitation utility are equally or more than likely to get selected than actions with a smaller exploitation utility. Undirected exploration techniques cover a large spectrum of ....
....utility of each state (or state action pair, respectively) Exploitation will then prefer unexplored actions (and thus states) due to their large utility estimate. The same effect can be achieved by assigning negative reward (costs) to each action, and initialize the exploitation utilities with 0 [5], as it is done in Markov decision problems [14, 22] In a similar fashion, the interval estimation algorithm by Kaelbling [iS] pp. 55 77, is also based on overestimating of the utility values. In this algorithm, exploitation utilities are expressed as upper bound of the expected reward, ....
[Article contains additional citation context not shown here]
Andy G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asyn- chronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
....initially unknown environments is the top level goal of COLUMBUS. In order to find low cost paths to the unexplored, the model is discretized yielding a grid representation of the environment, and dynamic programming is employed to propagate exploration utility through this discretized model [Barto et al. 1991], Sutton, 1990] More specifically, this is done in the following way: To each grid point x in the discretized model there is a real valued exploration utility U(x) associated. Initially, the exploration utility of x is set to the negative cumulative confidence c (x) Cumulative confidence (c.f. ....
Andy G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
....System [ 11 ] or specifying a situated automata [ 21 ] using the rex [ 12 ] or gapps [ 13 ] languages. Within AI research, the most common approach to learning reactive policies has been via some style of reinforcement learning; e.g. temporal difference [ 24 ] real time dynamic programming [ 1 ] , or Q learning [ 25 ] Construction of reactive policies via planning is used in, for example, Robosoar [ 15 ] Theo [ 17 ] and ere [ 8; 4 ] This work s primary objective is to scale up the number of top level goals that a reactive system can solve without a substantial increase in the ....
Barto,A.E., Bradtke, S.J., & Singh, S.P. 1991. RealTime Learning and Control using Asynchronous Dynamic Programming. Tech. rept. 91-57, Computer and Information Sciences, University of Massachusetts at Amherst.
....and about the information from the world that is available to them, and so it permits a link to be made with more ethological theories of animal behaviour for which optimalities of various sorts are the starting point. In fact the algorithms turn out to be novel contributions to engineering too [28]. Of course, these conditioning theories are all incomplete Mackintosh [29] and Pearce and Hall [30] point out the importance of attentional effects, and the true relationship between classical and instrumental conditioning is still unclear however the interplay licenced in the framework of ....
Barto, AG, Bradtke, SJ and Singh, SP. Real-Time Learning and Control using Asynchronous Dynamic Programming. TR 91-57, Department of Computer Science, University of Amherst, MA. 1992.
....[BeT96] and by Sutton and Barto [SuB98] A more limited textbook discussion is given in the DP textbook by Bertsekas [Ber95] The 2nd edition of the first volume of this DP text [Ber00] contains a detailed discussion of rollout algorithms. The extensive survey by Barto, Bradtke, and Singh [BBS95], and the overviews by Werbos [Wer92a] Wer92b] and other papers in the edited volume by White and Sofge [WhS92] point out the connections between the artificial intelligence reinforcement learning viewpoint and the control theory DP viewpoint, and give many references. ....
Barto, A. G., Bradtke, S. J., and Singh, S. P., 1995. "Real-Time Learning and Control Using Asynchronous Dynamic Programming," Artificial Intelligence, Vol. 72, 1995, pp. 81-138.
....for it by non Markov decision processes. Other algorithms found in the literature [9, 17, 55, 38] suffer a similar fate. For a more complete review of Markov decision processes and Q learning, the reader may wish to consult [11] and [47] For a review of reinforcement learning in general see [8]. 2.1 Modeling agent environment interaction Figure 1 illustrates a model of agent environment interaction that is widely used in reinforcement learning research. In this model the agent and the environment are represented by two synchronized finite state automatons interacting in a discrete ....
A.B. Barto, S.J. Bradtke, and S.P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report 91-57, University of Massachusetts, Amherst, MA, 1991.
....to use an approach that focuses on the states an agent will most likely encounter while using heuristic or worst case bounds to handle states for which the agent does not have a full plan. Methods such as Plexus (Dean, Kaelbling, Kirman, Nicholson, 1995) and real time dynamic programming (RTDP) (Barto, Bradtke, Singh, 1991) use this approach. This type of on line planning or plan refinement is not addressed explicitly with the algorithms presented here, but they can be modified in an obvious way to work with Plexus or RTDP. A more general loosening of the assumptions used in on line planning allows an agent to ....
Barto, A. G., Bradtke, S. J., & Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming. Technical report TR-91-57, University of Massachusetts Computer Science Department, Amherst, Massachusetts.
....decision models as MDPs (Kaelbling 1996) as opposed to bandit problems that represent the singlestate case. Our technique may be implemented in direct (model free) algorithms such as Watkins Q learning (1989) and in indirect (model based) algorithms such as adaptive dynamic programming (Barto et al. 1991, 1995) It is based on two principles: 1. to de ne local measures of the uncertainty in the form of exploration bonuses, using the theory of bandit problems; 2. to add these bonuses to reward and back propagate them with the dynamic programming (DP) or temporal di erence (TD) mechanisms. The ....
.... value of q k i , for all (i; k) Provided that each action is tried in each state an in nite number of times, the MLE converges to the true value of the unknown parameter, thus ensuring that the policy calculated by successive applications of (8) converges to the optimal policy (cf. e.g. Barto et al. 1991, 1995) ADP is called an indirect algorithm because it uses a model P of the problem, and calculates the estimated optimal policy starting from this estimate in the same way as in the non adaptive case. On the contrary, RL also proposes direct algorithms that estimate the value of the ....
[Article contains additional citation context not shown here]
Barto, A. G., Bradtke, S. J., & Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming. Technical Report 91-57, University of Massachusetts, Dept. of Computer Science.
....to train a neural network to learn the Q function. For Q learning, the following temporal di#erence error #Sutton, 1988# e t = r t 1 # max a t 1 #Q#x t 1 ;a t 1 ## ,Q#x t ;a t #: is derived by using max a t 1 #Q#x t 1 ;a t 1 ## as an approximation to P 1 k=0 # k r t k 2 . See #Barto, Bradtke, and Singh, 1991# for further discussion of the relationships between reinforcement learning and dynamic programming. 4 Q Learning Network For the inverted pendulum experiments reported here, a neural network with a single hidden layer was used to learn the Q#x; a# function. As shown in Figure , the network ....
A. G. Barto, S. J. Bradtke, and S. P. Singh. #1991#. Real-time learning and control using asynchronous dynamic programming. Technical Report 91-57, Department of Computer Science, University of Massachusetts, Amherst, MA, Aug.
....when it has gained enough infor On Training Automated Agents 14 mation about the task to stop exploring and to start relying on the domain decision policy it has learned. Much of the research in reinforcement learning is aimed at solving these two problems (Sutton, 1984; Kaelbling, 1990; Barto, Bradtke Singh, 1991; Gullapalli, 1992; Whitehead, 1992; Thrun Moller, 1992, and many more) The remainder of this section describes three approaches to reinforcement learning. First, another of Samuel s checkers players will be explored. Then, two particular reinforcement learning algorithms that have been ....
Barto, A. G., Bradtke, S. J., & Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming, (COINS Technical Report 91-57), Amherst, MA: University of Massachusetts, Department of Computer and Information Science.
....that is to be optimized, is not a vector, but is itself a function (possibly even a function with multiple outputs) called a policy . The ranges are constrained to be [ Gamma1; 1] again. Advanced versions of this class have also been studied [Sut88] in which stochastic values play a role [BBS91] or the iterative closure of the solution [Wat89, RN95] These are not considered here. A candidate (policy) has n in inputs and n out outputs. The output of the objective function is a scalar, the reinforcement signal. If the reinforcement signal has a positive value, this can be interpreted as ....
A.G. Barto, S.H. Bradtke, and S.P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report TR-91-57, University of Massachusetts Computer Science Department, Amherst, Massachusetts, 1991.
.... been obtained are for the control of Markov processes with unknown transition probabilities (e.g. 31,34] Also relevant are formal results showing that optimal controls can] often be computed using more asynchronous or incremental forms of dynamic programming than are conventionally used (e.g. [9,39,42]) Empirical (simulation) results using reinforcement learning combined with neural networks or other associative memory structures have shown robust efficient learning on a variety of nonlinear control problems (e.g. 5,13,19,20,24,25,29,32,38,43] An overview of the role of reinforcement ....
Barto, A.G., Bradtke, S.J., Singh, S.P. (1991) Real-time learning and control using asynchronous dynamic programming. University of Massachusetts at Amherst Technical Report 91-57.
No context found.
Barto, AG, Bradtke, SJ & Singh, SP (1991). Real-Time Learning and Control using Asynchronous Dynamic Programming. COINS technical report 91-57. Amherst: University of Massachusetts.
.... Analysis Running Head: Loss from Approximate Optimal Value Functions 1 Introduction Recent progress in reinforcement learning has been made by forming connections to the theory of Markov decision processes and the associated optimization method of dynamic programming (DP) Barto et al. 1990; Barto et al. 1991; Sutton, 1988; Watkins, 1989; Sutton, 1990) Theoretical results guarantee that many DP based learning methods will find optimal solutions for a wide variety of search, planning, and control problems. Unfortunately, such results often do not sufficiently constrain the computational resources ....
....value functions which assign numeric estimates of utility to task states. A common theoretical assumption is that such functions are implemented as lookup tables, i.e. that all elements of the function s domain are individually represented and updated (e.g. Sutton, 1988; Watkins Dayan, 1992; Barto et al. 1991; however, see Bertsekas, 1987, and Bradtke, Forthcoming, for approximation results in restricted domains) If practical concerns dictate that value functions be approximated, how might performance be effected Is it possible that, despite some empirical evidence to the contrary (e.g. Barto et ....
[Article contains additional citation context not shown here]
Barto, A. G., Bradtke, S. J., & Singh, S. P. (1991). Real-time learning and control using asynchronous dynamic programming. Technical Report TR-91-57, Department of Computer Science, University of Massachusetts.
No context found.
A. G. Barto, S. J. Bradtke, and S. P. Singh. Real--time learning and control using asynchronous dynamic programming. Artificial Intelligence, 72:81--138, 1995.
No context found.
Andy G. Barto, Steven J. Bradtke, and Satinder P. Singh. Real-time learning and control using asynchronous dynamic programming. Technical Report COINS 91-57, Department of Computer Science, University of Massachusetts, MA, August 1991.
No context found.
Barto, A. G. and Bradtke, S. H., Real-Time Learning and Control using Asynchronous Dynamic Programming. (1991) TR 91-57, Department of Computer Science, University of Massachusetts, Amherst.
No context found.
A. G. Barto, S. J. Bradtke, and S. P. Singh, Real--Time Learning and Control Using Asynchronous Dynamic Programming. Artificial Intelligence, 72:81--138, 1995.
No context found.
Barto, AG, Bradtke, SJ & Singh, SP (1991). Real-Time Learning and Control using Asynchronous Dynamic Programming. TR 91-57, Department of Computer Science, University of Amherst, MA.
No context found.
Barto, AG, Bradtke, SJ & Singh, SP (1991). Real-Time Learning and Control using Asynchronous Dynamic Programming. TR 91-57, Department of Computer Science, University of Amherst, MA.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC