| R. S. Sutton, "Integrated Architectures for Learning, Planning and Reacting Based on Approximating Dynamic Programming", in Proceedings of the 7th Int. Conf. on Machine Learning, June 1990. |
....i = U i t ) In the proposed Greedy function, exploration (undirected and directed) is introduced by a random vector and a counter associated to actions. Meanwhile, many other exploration exploitation techniques exist, like undirected methods based on Boltzmann distribution for example [14] [39], 31] or directed methods like recency based [39] error based [40] or interval estimation [41] 10 To tune the w vector, the actor learning rule uses the principles of the DP policy iteration procedure: Let V t (S t ) be a relevant measure to trigger action U t (the one corresponding to the ....
....exploration (undirected and directed) is introduced by a random vector and a counter associated to actions. Meanwhile, many other exploration exploitation techniques exist, like undirected methods based on Boltzmann distribution for example [14] 39] 31] or directed methods like recency based [39], error based [40] or interval estimation [41] 10 To tune the w vector, the actor learning rule uses the principles of the DP policy iteration procedure: Let V t (S t ) be a relevant measure to trigger action U t (the one corresponding to the optimal policy found at time step t and called ....
R.S. Sutton, "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming," in Proceedings of the Seventh International Conference on Machine Learning, San Mateo, 1990, pp. 216--224.
....environments: the real world, with concrete rewards, and a world model, which is learned. The agent learns from its experiences in both worlds. Experience in the real world is also used to update the agent s world model. Experience in the world model is analogous to trying things in your head. [8] Several methods have been proposed to expand the Dyna architecture in such a way as to motivate the agent to explore useful parts of the state space. Sutton proposed an exploration bonus which causes unexplored parts of the state space to be estimated with high values. This encourages the agent ....
....a way as to motivate the agent to explore useful parts of the state space. Sutton proposed an exploration bonus which causes unexplored parts of the state space to be estimated with high values. This encourages the agent to seek out novel stituations and events (ie to be curious or optimistic ) [8] Peng and Williams [9] and by Moore and Atkeson [10] both developed systems which prioritize states for exploration based on prediction difference. When the agent s value estimate is shown to be incorrect for a particular state, the predecessors of that state become candidates for Dyna s ....
Richard S. Sutton. "Integrated Architectures for Learning, Planning and Reacting based on Approximating Dynamic Programming." Proceedings of the Seventh International Conference on Machine Learning, 1990.
....Dynamic Expectancy Model (DEM) which combines an unsupervised learning component with a mechanism to create Policy Maps for action selection dynamically from this learned information as tasks arise or task priorities change. Action selection and learning models based on Reinforcement Learning ([18], 20] 21] 5] for a broader survey) propagate the effects of occasional reward backward to rank sense act pairings in the form of a policy map according to iteratively developed estimates of future reward. Classifier Systems ( 3] also adopt the notion of propagating the effects of ....
....cost from that Sign to the top goal. In use it has the form of a reactive look up table, if this is the current situation, then select this action . In this respect, it is similar to the policy map constructed by some classes of reinforcement learning algorithms, notably those using Q learning ([18], 20] 21] The DEM is profoundly different in that the policy is computed (and re computed) frequently and quickly, relative to a specific goal, rather than as an iteratively formed policy estimating future discounted and anonymous rewards. While not a plan in the conventional sense, it is ....
[Article contains additional citation context not shown here]
Sutton, R.S. (1990) "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming", in Proc. 7 th Int. Conf. on Machine Learning, pp. 216-224
.... each control decision: the updates of the values can take place incrementally in a series of time intervals interleaved with real time control decisions (note that this implies a change in the overall operation of the method of this section) Techniques like this are described in [2] 21] and [26]. Employing the minimax algorithm [29] i.e. searching the tree of all paths leading to the goal, is an alternate approach, that could have been used to solve the minimax shortest path problem. The value of can be if, when we are at cell , our adversary can permanently prevent us from reaching ....
R. S. Sutton, "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming," in Proc. 7th Int. Conf. Machine Learning (ML'90). San Mateo, CA: Morgan Kaufmann, 1990, pp. 216--224.
.... network implementations that naturally provide some kind of input generalisation [2 ] attempts to explicitly define a convenient coding have been made using, for instance, the CMAC algorithm [1 ] 4 ] Another approach is the use of additional models on which experimental actions are performed [5 ] We investigated a conceptually simpler technique based on the spreading of cost predictions towards a neigbourhood of the visited state in the state space, with an influence weighted by a similarity function. The approach is particularly appealing in environments where this similarity is ....
....of the agent in any of these worlds is to avoid any of the trapping states, from which transitions to other states are impossible. At each time step t, one out of two possible actions 1 or 1 must be chosen by the agent and, as a consequence, it moves to a neighbouring state x t 1 with a 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Fig. 2: Distribution of action values for process W1 probability P (x t 1 j x t ; a t ) For every state x defined by its coordinates (a; b) a neighbouring state is any state z = a 0 ; b 0 ) such that d(x; z) 1:0, where d is the Euclidean distance. The reinforcement ....
[Article contains additional citation context not shown here]
R.S. Sutton, "Integrated Architectures for Learning, Planning and Reacting Based on Approximating Dynamic Programming", Procs. of 7th ICML, pp. 216-224, 1990.
....external reinforcement is taken. In our system the action or steering signal is generated by the Associative Search Element or ASE (see Fig. 2) and is a probabilistic function of the state. If the system is in state j, an action u i is selected according to a Boltzmann distribution, as used by [12]: P (u i j x j ) e w ij P i e w ij (2) where u i is drawn from the set U = fleft, rightg and w ij denotes the weight between state x j and the ASE. The external reinforcement signal r is in our application set to Gamma1 upon collision and is 0 otherwise. However, the collision signal ....
Sutton, R.S. "Integrated architectures for learning, planning and reacting based on approximating dynamic programming", Proc. of the Seventh Int. Conf. on Machine Learning, 1990
....model f is required, one can make more powerful use of it. According to the certainty equivalence principle, f can substitute the real world, and planning can be run in mental simulations instead of interaction with the real world. In reinforcement learning, this idea was originally pursued by Sutton s (1990) DYNA algorithms for discrete state action spaces. Here we will explore in how far a continuous version of DYNA, DYNA CTD, can help in learning from demonstration. The only difference compared to CTD in Section 2.1.1 is that after every real trial, DYNA CTD performs five mental trials in which ....
Sutton, R. S. (1990). "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." In: Proceedings of the International Machine Learning Conference.
....of an exploration, and then choose the exploration with the best expected influence on the current decision. Their work is complementary to ours, in the sense that they have focussed on more effective exploration, and we have focussed on better decision making. Sutton, Barto and others [Sut91, Sut90b, Sut90a, BBS93] have proposed a dynamic programming approach to incrementally generating plans in situations where the same or similar problems recur. This is a very interesting approach that can be easily extended to stochastic problem domains [BBS93] Their work focusses on learning to solve a ....
Richard S. Sutton. "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming." In Proceedings of the Seventh International Workshop on Machine Learning (ML90), pp. 216--224, 1990.
....and are, therefore, wasted. Thus, by optimizing the cost function, the problem solver learns to avoid barriers automatically. # 6) Managing a History of Recent Decisions Available solutions to this problem use either the episode structure of problem instances [43, 132] or the Markovian property [5, 177] to limit the amount of information stored. When the problem variables do not vary with time, the state of the external environment does not change outside the control of the problem solver; in this case, the episode ends when the final feedback signal related to this instance is received. This ....
....techniques require a large number of problem solving episodes. The amount of information extracted from each example is small relative to Dietterich and Buchanan s model. That causes slowness of learning, which is an important factor behind several recent proposals for hybrid architectures [177, 196]. 156 Accurate and efficient estimation of strategic knowledge using statistical techniques requires resolution of the exploration convergence dilemma. Systems based on Minsky s model work only with problem solvers employing stochastic strategies: the randomness of such strategies is a natural ....
[Article contains additional citation context not shown here]
R. S. Sutton, "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming," Proc. 7th Int'l. Conf. Machine Learning, pp. 216-224, Morgan Kaufmann, Palo Alto, CA, 1990.
....0 X t 0 t fl t 0 Gammat r(t 0 1) 1 A ; with 0 fl 1 (1) This evaluation function is approximated by learning values v i for each discrete region i. Comparing evaluations at successive time steps in r(t) gives useful means of evaluating the system s performance. Similar to [2] the function u = F ( x) is a stochastic function where for a state x j a steering signal u(t) is selected according to a Boltzmann distribution: P (u i j x j ) e w ij P i e w ij (2) where w ij is weight of the connection between state x j and action u i . The correct mapping is obtained ....
....actions taken at time steps t 00 t. A negative value of r(t) punishes the steering signals selected at previous time steps by reducing the probability of its selection P (u i jx j ) A positive value for r increases this probability. This procedure for updating the weights is adapted from [2]. A problem with reinforcement learning is the a priori selection of the resolution and region boundaries of the state space. Instead of a fixed quantisation of state space, an adaptive vector quantisation scheme can be used. When the x i s are considered as locally tuned neurons, a ....
[Article contains additional citation context not shown here]
Sutton, R.S. "Integrated architectures for learning, planning and reacting based on approximating dynamic programming", Proc. of the Seventh Int. Conf. on Machine Learning, 1990
....model f is required, one can make more powerful use of it. According to the certainty equivalence principle, f can substitute the real world, and planning can be run in mental simulations instead of interaction with the real world. In reinforcement learning, this idea was originally pursued by Sutton s (1990) DYNA algorithms for discrete state action spaces. Here we will explore in how far a continuous version of DYNA, DYNA CTD, can help in learning from demonstration. The only difference compared to CTD in Section 2.1.1 is that after every real trial, DYNA CTD performs five mental trials in which ....
Sutton, R. S. (1990). "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming." In: Proceedings of the International Machine Learning Conference.
....In fact, we envision a technique for creating a more intelligent, adaptive player. 6. We would like to create an imagination in the form of a limited, downloadable simulator, which will help the robot to speculate on the future. A related technique has been described for the DYNA system [S1] [S2]. ....
R. S. Sutton, "Integrated architectures for learning, planning, and reacting based on approximate dynamic programming," Proceedings of the Seventh International Conference on Machine Learning, 216--224, Morgan Kaufmann, 1990.
.... and generalization of the work of Singh (1992) Dayan (1993) and Sutton Pinette (1985) 1 Multi Scale Planning and Modeling Model based reinforcement learning offers a potentially elegant solution to the problem of integrating planning into a real time learning and decisionmaking agent (Sutton, 1990; Barto et al. 1995; Peng Williams, 1993, Moore Atkeson, 1994; Dean et al. in prep) However, most current reinforcementlearning systems assume a single, fixed time step: actions take one step to complete, and their immediate consequences become available after one step. This makes it ....
Sutton, R. S. (1990) "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming," Proceedings of the Seventh International Conference on Machine Learning, 216--224.
No context found.
R. S. Sutton, "Integrated Architectures for Learning, Planning and Reacting Based on Approximating Dynamic Programming", in Proceedings of the 7th Int. Conf. on Machine Learning, June 1990.
No context found.
R. S. Sutton, "Integrated Architectures For Learning, Planning, and Reacting Based on Approximating Dynamic Programming," Proceedings of the Seventh Int. Conf. on Machine Learning, pp. 216-224, Morgan Kaufmann, 1990.
No context found.
R. S. Sutton, "Integrated Architectures For Learning, Planning, and Reacting Based on Approximating Dynamic Programming," Proceedings of the Seventh Int. Conf. on Machine Learning, pp. 216-224, Morgan Kaufmann, 1990.
No context found.
R. S. Sutton, "Integrated Architectures For Learning, Planning, and Reacting Based on Approximating Dynamic Programming," Proceedings of the 7th International Conference on Machine Learning, pp. 216-224, Morgan Kaufmann, 1990.
No context found.
R. S. Sutton, "Integrated Architectures for Learning, Planning, and Reacting based on Approximating Dynamic Programming," in Proceedings of the Seventh International Conference on Machine Learning, 1990.
No context found.
R. Sutton, "Integrated architectures for learning, planning, and reacting based on approximating dynamic programming," in Proceedings of the International Conference on Machine Learning (ICML), 1990.
No context found.
Sutton, R., "Integrated Architectures for Learning, Planning, and Reacting based on Approximating Dynamic Programming, " Machine Learning: Proceedings of the Seventh International Workshop, pp. 216-224, San Francisco: Morgan Kaufmann, 1990.
No context found.
Rich Sutton, "Integrated Architectures for Learning, Planning and Reacting based on Approximating Dynamic Programming," in Proceedings of the Seventh International Conference in Machine Learning, Austin, Texas, June 1990.
No context found.
Rich S. Sutton, "Integrated Architectures for Learning, Planning and Reacting based on Approximating Dynamic Programming, " in Proceedings of the Seventh International Conference in Machine Learning, Austin, Tx, June 1990.
No context found.
Sutton 1990 Sutton, R. S., "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming," Seventh on Machine Learning, Morgan Kaufmann, San Mateo, CA, June 1990.
No context found.
R.S. Sutton, "Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming," Proceedings of the Seventh International Conference on Machine Learning, Morgan Kaufmann, San Mateo, CA, 216-224, 1990.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC