| D. Chapman and L. P. Kaelbling, "Learning from Delayed Reinforcement in a Complex Domain", Proc. of the IJCAI, 1991. |
....approach works well for 2d problems like the Car on the Hill . However, for more complex problems, these local methods fail to perform better than uniform grids. Local value based splitting is an ecient, model based, relative of the Q learningbased tree splitting criteria used, for example, by (Chapman Kaelbling, 1991; Simons, Van Brussel, De Schutter, Verhaert, 1982; McCallum, 1995) But it is only when combined with new non local measures that we are able to get truly e ective, near optimal performance on dicult control problems. The tree based, state space partitions in (Moore, 1991; Moore Atkeson, ....
Chapman, D., & Kaelbling, L. P. (1991). Learning from Delayed Reinforcement In a Complex Domain. In IJCAI-91.
.... utility of actions; these utilities provide the system with an optimal reactive strategy (in every situation, the system should choose the action with highest expected utility) Current research in this area does not adequately address the problem of generalizing these utility models (but see [Chapman and Kaelbling, 1990] for preliminary work on this problem) Also, the convergence of this method can be extremely slow, and convergence is only guaranteed if every possible situation is observed an unbounded number of times. Dyna Q [Sutton, 1990] uses a similar technique to learn a policy. The planning method is ....
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. Technical Report TR-90-11, Teleos Research, December 1990.
....use a neural network [12] to learn to predict the context dependent effects of arbitrary actions. The network could then serve as the static action model and could be used to find base actions for path following behaviors. Reinforcement learning. It may be possible to use reinforcement learning [3, 23, 36, 42, 45] to learn homing and path following behaviors without the need for the primitive actions or explicit action models. An advantage of such an approach is that it does not presume that a particular model of the sensorimotor apparatus has been learned. A disadvantage is that it is difficult to train ....
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. Tech. Report TR-90-11, Teleos Research, Palo Alto, California, December 1990.
....promising method for control systems to program and improve themselves. This paper addresses its biggest stumbling block: the curse of dimensionality [ Bellman, 1957 ] in which costs increase exponentially with the number of state variables. Some earlier work [ Simons et al. 1982, Moore, 1991, Chapman and Kaelbling, 1991, Dayan and Hinton, 1993 ] has considered recursively partitioning state space while learning from delayed rewards. The new ideas in the partigame algorithm include (i) a game theoretic splitting criterion to robustly choose spatial resolution (ii) real time incremental maintenance and planning ....
D. Chapman and L. P. Kaelbling. Learning from Delayed Reinforcement In a Complex Domain. Technical Report, Teleos Research, 1991.
....approach to solve continuous time and space control problems. We described several local splitting criteria, based on the VF or the policy approximation. Local value based splitting is an efficient, model based, relative of the Q learning based tree splitting criteria used, for example, by [ Chapman and Kaelbling, 1991; Simons et al. 1982; McCallum, 1995 ] But it is only when combined with new measures based on policy and on global considerations (influence and variance) that we are able to get truly effective, near optimal performance on our control problems. The tree based state space partitions in [ ....
D. Chapman and L. P. Kaelbling. Learning from Delayed Reinforcement In a Complex Domain. In IJCAI-91, 1991.
....and observer algorithms to IDP problems with hidden state information. Werbos [94] also discusses incorporating an observer as part of an IDP architecture. Other approaches to the issue of hidden state information in IDP are described by McCallum [59] and Chrisman [17,18] Chapman and Kaelbling [13,14] discuss the related problem of how to extract the relevant state information from a perception y t that contains much irrelevant information. We address the problem of hidden state in Chapter 5, where we show how we can use a tapped delay line state representation as input to the adaptive policy ....
Chapman, D. and Kaelbling, L. P. Learning from delayed reinforcement in a complex domain. Technical Report TR-90-11, Teleos Research, 576 Middlefield Road, Palo Alto, CA 94301, December 1990. 114
....by learning from delayed rewards in multidimensional, continuous state spaces with unknown system dynamics and control laws. Parti game learns a controller from a start partition to a goal partition(Moore and Atkeson,1995) Opposite to other approaches that recursively partition state space (e.g. Chapman and Kaelbing,1991; Dayan and Hinton,1993) Parti game uses ideas from game theory and a database DB of all previous experiences. Parti game rests upon two prerequisites. First, the system is always able to compute its configuration (or state) In manipulator environments this is easily possible using a sensor in ....
Chapman, D. and Kaelbing, L. P. (1991). Learning from delayed reinforcement in a complex domain.
....but we can advance that the environment complexity depends on the relationship between the animat and the tasks it must solve in the environment. The first one considers the adaptation of the animat to the environment as a learning process. This idea has been exposed by several authors [15] [7] [13] For them, the adaptation process consists in learning (usually by means of a reinforcement signal) a mapping between the set of actions and the set of perceptions that reveals an adequate behavior to the environment. These methods frequently begin the learning process without any previous ....
D. Chapman and L. Kaelbling. Learning from delayed reinforcement in a complex domain. Technical Report 90-11, Teleos Research, 1990.
....task structure, the method learns quickly, creates only as much memory as needed for the task at hand, and handles noise well. Utile Suffix Memory uses a tree structured representation, and is related to work on Prediction Suffix Trees [Ron et al. 1994] Parti game [Moore, 1993] G algorithm [Chapman and Kaelbling, 1991] , and Variable Resolution Dynamic Programming [Moore, 1991] 1 INTRODUCTION The sensory systems of embedded agents are inherently limited. When a reinforcement learning agent s sensory limitations hide features of the environment from the agent, we say that the agent suffers from hidden state. ....
....technique and desired features from UDM and NSM, but many of its ideas come from the combination of other algorithms too. Ideas from four algorithms in particular inspired the workings of USM; these are: Probabilistic Suffix Tree Learning [Ron et al. 1994] Parti game [Moore, 1993] G algorithm [Chapman and Kaelbling, 1991] and Variable Resolution Dynamic Programming [Moore, 1991] All four of the algorithms use trees to represent distinctions and grow the trees in order to learn finer distinctions. Probabilistic Suffix Tree. USM has more in common with Probabilistic Suffix Tree Learning (PSTL) than any other ....
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. In Twelfth International Joint Conference on Artificial Intelligence, 1991.
....highway. The task involves hidden state, time pressure, stochasticity, a large world state space, and a large perceptual state space. U Tree uses a tree structured representation, and is related to work on Prediction Suffix Trees (Ron, Singer, Tishby 1994) Parti game (Moore 1993) Galgorithm (Chapman Kaelbling 1991), and Variable Resolution Dynamic Programming (Moore 1991) UTree is a direct descendant of Utile Suffix Memory (McCallum 1995c) which used short term memory, but not selective perception. Unlike Whitehead s Lion algorithm, the algorithm handles noise, large state spaces, and uses short term ....
Chapman, D., and Kaelbling, L. P. 1991. Learning from delayed reinforcement in a complex domain. In Twelfth International Joint Conference on Artificial Intelligence.
....dimension of internal state space determined by inputs other than the current perception. Brooks, 1991a; Wixson and Ballard, 1991 ] However, recent research in learning for robotics has strongly emphasized purely reactive agents [ Maes and Brooks, 1990; Sutton, 1991; Mahadevan and Connell, 1991; Chapman and Kaelbling, 1991 ] Even past research on learning with selective perception has used a purely reactive approach. In [ Whitehead, 1992 ] the agent uses a selective perception system based on two markers [ Agre and Chapman, 1987 ] to perform manipulations in a blocks world. Whitehead s Lion algorithm conquers the ....
....splitting this state would help the agent more consistently predict reward. One split may allow further distinctions because it will create new separate transitions into other states. In that UDM can recursively build a tree of distinctions it is similar to Chapman and Kaelbling s G algorithm [ Chapman and Kaelbling, 1991 ] except that UDM builds distinctions in memory space instead of perception space. On the incoming transitions to a state, UDM stores statistics about the return received after leaving the state in question. If the state is Markovian with respect to return, then the return values on all the ....
[Article contains additional citation context not shown here]
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. In Proceedings of IJCAI, 1991.
....approach to solve continuous time and space control problems. We described several local splitting criteria, based on the VF or the policy approximation. Local value based splitting is an efficient, model based, relative of the Q learning based tree splitting criteria used, for example, by [ Chapman and Kaelbling, 1991; Simons et al. 1982; McCallum, 1995 ] But it is only when combined with new measures based on policy and influence that we are able to get truly effective, near optimal performance on our control problems. The tree based state space partitions in [ Moore, 1991; Moore and Atkeson, 1995 ] were ....
D. Chapman and L. P. Kaelbling. Learning from Delayed Reinforcement In a Complex Domain. In IJCAI-91, 1991.
....for separating noise from task structure, the method learns quickly, creates only task relevant state distinctions, and handles noise well. U Tree uses a tree structured representation, and is related to work on Prediction Suffix Trees [Ron et al. 1994] Parti game [Moore, 1993] G algorithm [Chapman and Kaelbling, 1991] , and Variable Resolution Dynamic Programming [Moore, 1991] It builds on Utile Suffix Memory [McCallum, 1995c] which only used short term memory, not selective perception. The algorithm is demonstrated solving a highway driving task in which the agent weaves around slower and faster traffic. ....
....Nearest Sequence Memory [McCallum, 1995b] but many of its ideas come from the combination of other algorithms too. Ideas from four algorithms in particular inspired the workings of U Tree; these are: Probabilistic Suffix Tree Learning [Ron et al. 1994] Parti game [Moore, 1993] G algorithm [Chapman and Kaelbling, 1991] and Variable Resolution Dynamic Programming [Moore, 1991] All four of the algorithms use trees to represent distinctions and grow the trees in order to learn finer distinctions. Probabilistic Suffix Tree. U Tree has more in common with Probabilistic Suffix Tree Learning (PSTL) than any other ....
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. In Twelfth International Joint Conference on Artificial Intelligence, 1991.
....state spaces (i.e. hidden state) by augmenting the provided features with short term memory of past features. The algorithm, called U Tree, uses a tree structured representation, and is related to work on Prediction Suffix Trees (Ron, Singer, Tishby 1994) Parti game (Moore 1993) Galgorithm (Chapman Kaelbling 1991), and Variable Resolution Dynamic Programming (Moore 1991) U Tree is a direct descendant of Utile Suffix Memory (McCallum 1995c) which used short term memory, but not selective perception. U Tree is demonstrated solving a highway driving task in which the agent weaves around slower and faster ....
Chapman, D., and Kaelbling, L. P. 1991. Learning from delayed reinforcement in a complex domain. In Twelfth International Joint Conference on Artificial Intelligence.
.... by combining instancebased state identification with the structure noise separation method from Utile Distinction Memory [31] The algorithm, called Utile Suffix Memory, uses a tree structured representation, and is related to work on Prediction Suffix Trees [43] Parti game [36] G algorithm [11], and Variable Resolution Dynamic Programming [34] Preliminary results are very promising. See [32, 33] for more details. B.3 Separating Instance Based Learning and Dynamic Programming As pointed out in the section on details of the algorithm, Nearest Sequence Memory performs two kinds of ....
D. Chapman and L. P. Kaelbling, "Learning from delayed reinforcement in a complex domain," in Twelfth International Joint Conference on Artificial Intelligence, 1991.
.... a local linear model, LQR was able to create a stable controller based on only 31 state transitions Other current investigations which attempt to perform generalization in conjunction with reinforcement learning are Mahadevan and Connell (1990) which investigates clustering parts of the policy, Chapman and Kaelbling (1990) which investigates automatic detection of locally relevant state variables, and Singh (1991) which considers how to automatically discover the structure in tasks such as the multiple flags example of Figure 16. 7.1 Related work The Dyna Q queue algorithm of Peng and Williams Peng and Williams ....
Chapman, D. and Kaelbling, L. P. (1990). Learning from Delayed Reinforcement In a Complex Domain. Technical Report No. TR-90-11, Teleos Research.
....and splitting this state would help the agent more consistently predict reward. One split may allow further distinctions because it will create new separate transitions into other states. In that UDM can recursively build a tree of distinctions it is similar to Chapman and Kaelbling s G algorithm [ Chapman and Kaelbling, 1991 ] except that it builds distinctions in memory space instead of perception space. 3 DETAILS OF THE ALGORITHM A slightly longer description of UDM can be found in [ McCallum, 1992 ] A hidden Markov model is comprised of a finite set of states, S = fs 1 ; s 2 ; s N g and a finite number ....
....distinctions that predict future reward. For this reason UDM begins with one state per percept, a strategy that will obviously not work for large perception spaces. Some method for making perceptual distinctions will be necessary, and it seems plausible that Chapman and Kaelbling s G algorithm [ Chapman and Kaelbling, 1991 ] or even some technique based on confidence intervals for the unused perception bits in a state should work in conjunction with UDM. Although UDM can build memory chains of arbitrary length, it does require that some statistically significant benefit be detectable for each split individually in ....
[Article contains additional citation context not shown here]
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. In Proceedings of IJCAI, 1991.
....of [19] 4] the naive grid approach has a number of dangers which will be detailed in this paper. 2 A.W. MOORE AND C.G. ATKESON This paper studies the pitfalls of discretization during reinforcement learning and then introduces the parti game algorithm. Some earlier work [26] 20] [8], 10] considered recursively partitioning state space while learning from delayed rewards. The new ideas in the parti game algorithm include (i) a game theoretic splitting criterion to robustly choose spatial resolution, ii) real time incremental maintenance and planning with a database of ....
....in real valued multivariate statespaces where straightforward discretization would fall prey to the curse of dimensionality. This is another approach to partitioning state space but has the drawback that, unlike parti game, it requires a guess at an initially valid trajectory through state space. [8] proposed an interesting algorithm, which used more sophisticated statistics to decide which attributes to split. Their objectives were very hard because they wished to avoid remembering transitions between cells and they did not assume continuous paths through state space, and so they obtained ....
[Article contains additional citation context not shown here]
D. Chapman and L. P. Kaelbling. Learning from Delayed Reinforcement In a Complex Domain. Technical Report, Teleos Research, 1991.
....approach works well for 2d problems like the Car on the Hill . However, for more complex problems, these local methods fail to perform better than uniform grids. Local value based splitting is an efficient, model based, relative of the Q learningbased tree splitting criteria used, for example, by (Chapman Kaelbling, 1991; Simons, Van Brussel, De Schutter, Verhaert, 1982; McCallum, 1995) But it is only when combined with new non local measures that we are able to get truly effective, near optimal performance on our control problems. The tree based state space partitions in (Moore, 1991; Moore Atkeson, 1995) ....
Chapman, D., & Kaelbling, L. P. (1991). Learning from Delayed Reinforcement In a Complex Domain.
No context found.
David Chapman and Leslie Pack Kaelbling. Learning from delayed reinforcement in a complex domain. Technical Report No. TR-90-11, Teleos, 1990.
....Airport Hierarchy. It differs from HDG and is more similar to Feudal Learning [ Dayan and Hinton, 1993 ] in that each state is a member of many partitions: some large and abstract, others small and specific. It differs from Feudal Learning and other partitioning structures (such as G learning [ Chapman and Kaelbling, 1991 ] and PartiGame [ Moore, 1994 ] in that junior partitions are not necessarily subsets of their seniors, and many partitions are overlapping. We will begin by defining the properties that we do require from these partitions, before proceeding to execution and generation algorithms that exploit ....
D. Chapman and L. P. Kaelbling. Learning from Delayed Reinforcement In a Complex Domain. In IJCAI-91, 1991.
....Airport Hierarchy. It differs from HDG and is more similar to Feudal Learning [3] in that each state is a member of many partitions: some large and abstract, others small and specific. It differs from Feudal Learning and other multi level hierarchical partitioning structures (such as G learning [2] and PartiGame [6] in that junior partitions are not necessarily subsets of their seniors, and many partitions are overlapping. We will begin by defining the properties that we do require from these partitions, before proceeding to execution and generation algorithms that exploit these ....
D. Chapman and L. P. Kaelbling. Learning from Delayed Reinforcement In a Complex Domain. Technical Report, Teleos Research, 1991.
No context found.
D. Chapman and L. P. Kaelbling, "Learning from Delayed Reinforcement in a Complex Domain", Proc. of the IJCAI, 1991.
No context found.
Chapman, D. and Kaelbling, L. P. (1990). Learning from delayed reinforcement in a complex domain. Tech. Report TR-90-11, Teleos Research, Palo Alto, California.
No context found.
Chapman D., Kaelbling L. (1991): "Learning from delayed reinforcement in a complex domain", Procs. of the IJCAI.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC