#### DMCA

## Recent advances in hierarchical reinforcement learning (2003)

### Cached

### Download Links

- [www.fias.uni-frankfurt.de]
- [www-anw.cs.umass.edu]
- [fias.uni-frankfurt.de]
- [www-all.cs.umass.edu]
- [www.cs.iastate.edu]
- [www.cs.umass.edu]
- [www.cs.umass.edu]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [people.cs.umass.edu]
- [www.cs.umass.edu]
- [people.cs.umass.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 227 - 24 self |

### Citations

5613 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1999
(Show Context)
Citation Context ...sing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting. 1 1 Introduction Reinforcement learning (RL) =-=[5, 72]-=- is an active area of machine learning research that is also receiving attention from the fields of decision theory, operations research, and control engineering. RL algorithms address the problem of ... |

1714 | Reinforcement learning: A survey,
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...ent interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., =-=[2, 5, 32, 72]-=-). Despite the utility of RL methods in many applications, the amount of time they can take to form acceptable approximate solutions can still be unacceptable. As a result, RL researchers are investig... |

1672 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...he adjustable parameters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning =-=[82, 83]-=- and Sarsa [59, 70]. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectiv... |

1180 |
Nonlinear Programming
- Bertsekas
- 1999
(Show Context)
Citation Context ...sing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting. 1. Introduction Reinforcement learning (RL) (=-=Bertsekas and Tsitsiklis, 1996-=-; Sutton and Barto 1998) is an active area of machine learning research that is also receiving attention from the ®elds of decision theory, operations research, and control engineering. RL algorithms ... |

1095 | Planning and acting in partially observable stochastic domains.
- Kaelbling, Littman, et al.
- 1998
(Show Context)
Citation Context ...proach is formalized in terms of Partially observable Markov decision processes (POMDPs), where agents learn policies over belief states, i.e., probability distributions over the underlying state set =-=[31]-=-. It can be shown that belief states satisfy the Markov property and consequently yield a new (and more complex) MDP over information states. Belief states can be recursively updated using the transit... |

822 |
Multiagent systems: a modern approach to distributed artificial intelligence
- WEISS
- 1999
(Show Context)
Citation Context ... where the two agents, A1 and A2, will maximize their performance at the task if they learn to coordinate with each other. Here, we want to design learning algorithms for cooperative multiagent tasks =-=[84]-=-, where the agents learn the coordination skills by trial and error. The key idea here is that coordination skills are learned more e#ciently if agents learn to synchronize using a hierarchical repres... |

780 | Some studies in machine learning using the game of checkers II: Recent progress.
- Samuel
- 1967
(Show Context)
Citation Context ...state of understanding rather than the intuition underlying the origination of these methods. Indeed, DP-based learning originated at least as far back as Samuel's famous checkers player of the 1950s =-=[61, 60]-=-, which, however, made no reference to the DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the c... |

632 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...ent interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., =-=[2, 5, 32, 72]-=-). Despite the utility of RL methods in many applications, the amount of time they can take to form acceptable approximate solutions can still be unacceptable. As a result, RL researchers are investig... |

601 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...s the utility of A2 picking up trash from T1 if A1 is also picking up from the same bin, and so on). The proposed approach di#ers significantly from previous work in multiagent reinforcement learning =-=[38, 74]-=- in using hierarchical task structure to accelerate learning, and as well in its use of concurrent activities. To illustrate the use of this decomposition in learning multiagent coordination, for the ... |

568 | Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
- Sutton, Precup, et al.
- 1999
(Show Context)
Citation Context ...erally defined for a subset of the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options =-=[73]-=-, skills [80], behaviors [9, 27], or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extensio... |

549 |
A Model for Reasoning About Persistence and Causation,
- Dean, Kanazawa
- 1989
(Show Context)
Citation Context ...nal structure [17]. Much work in artificial intelligence has focused on exploiting this structure to develop compact representations of single-step actions (e.g., the Dynamic Bayes Net representation =-=[13]-=-). A natural question to consider is how to extend these single-step compact models into compact models of temporally-extended activities, such as options. The problem is a bit subtle since even if al... |

520 |
Learning and executing generalized robot plans.
- Fikes, Hart, et al.
- 1972
(Show Context)
Citation Context ...ficial intelligence researchers have addressed the need for large-scale planning and problem solving by introducing various forms of abstraction into problem solving and planning systems, e.g., refs. =-=[18, 37]. Abstraction allows-=- a system to ignore details that are irrelevant for the task at hand. One of the simplest types of abstraction is the idea of a "macro-operator," or just a "macro," which is a sequ... |

483 |
Dynamic Programming, Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...heir properties. Here we briefly describe this well-known framework, with a few twists characteristic of how it is used in RL research; additional details can be found in many references (e.g., refs. =-=[4, 5, 55, 58, 72]-=-). A finite MDP models the following type of problem. At each stage in a sequence of stages, an agent (the controller) observes a system's state s, contained in a finite set S, and executes an action ... |

442 | Hierarchical reinforcement learning with the MAXQ value function decomposition
- Dietterich
(Show Context)
Citation Context ...es to hierarchical RL: the options formalism of Sutton, Precup, and Singh [73], the hierarchies of abstract machines (HAMs) approach of Parr and Russell [48, 49], and the MAXQ framework of Dietterich =-=[14]-=-. Although these approaches were developed relatively independently, they have many elements in common. In particular, they all rely on the theory of semi-Markov decision processes to provide a formal... |

433 | Generalization in reinforcement learning: Successful examples using sparse coarse coding,” in
- Sutton
- 1996
(Show Context)
Citation Context ...eters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa =-=[59, 70]-=-. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectively replaced by a s... |

415 | Practical issues in temporal difference learning.
- Tesauro
- 1992
(Show Context)
Citation Context ...ing function approximation methods for accumulating value function information, RL algorithms have produced good results on problems that pose significant challenges for standard methods (e.g., refs. =-=[11, 75]-=-). However, current RL methods by no means completely circumvent the curse of dimensionality: the exponential growth of the number of parameters to be learned with the size of any compact encoding of ... |

411 | The complexity of decentralized control of Markov decision processes.
- Bernstein, Zilberstein, et al.
- 2000
(Show Context)
Citation Context ...f observing y if action a was performed and resulted in state s. However, mapping belief states to optimal actions is known to be intractable, particularly in the decentralized multiagent formulation =-=[3]-=-. Also, learning a perfect model of the underlying POMDP is a challenging task. An empirically more e#ective (but theoretically less powerful) approach is to use finite memory models as linear chains ... |

381 | Online Q-learning using connectionist systems.
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...eters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa =-=[59, 70]-=-. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectively replaced by a s... |

354 |
Introduction to Stochastic Dynamic Programming.
- Ross
- 1983
(Show Context)
Citation Context ...heir properties. Here we briefly describe this well-known framework, with a few twists characteristic of how it is used in RL research; additional details can be found in many references (e.g., refs. =-=[4, 5, 55, 58, 72]-=-). A finite MDP models the following type of problem. At each stage in a sequence of stages, an agent (the controller) observes a system's state s, contained in a finite set S, and executes an action ... |

328 | Transition network grammar for natural language analysis.
- Woods
- 1970
(Show Context)
Citation Context ... do not need to be treated as part of the program state, a point we gloss over in our discussion. This kind of machine hierarchy is an instance of a Recurisve Transition Network as discussed by Woods =-=[88]-=-. 11 can be applied to reduce(H # M) to approximate optimal policies for H # M. The important strength of an RL method like SMDP Q-learning in this context is that it can be applied to reduce(H # M) w... |

326 | The hierarchical hidden markov model: Analysis and applications. Machine Learning 32(1):41–62
- Fine, Singer, et al.
- 1998
(Show Context)
Citation Context ...lanning algorithms scale poorly with model size. Theocharous et al. [79] developed a hierarchical POMDP formalism, termed H-POMDPs (Figure 7), by extending the hierarchical hidden Markov model (HHMM) =-=[19]-=- to include rewards and temporally-extended 19 . . . . . . * ? T-junction corner dead end D3 D1 D3 D2 D1 D3 * ? d3 d2 d3 d3 d2 d2 abstraction level: navigation abstraction level: traversal abstraction... |

322 |
Reinforcement learning with selective perception and hidden state.
- McCallum
- 1996
(Show Context)
Citation Context ...f the underlying POMDP is a challenging task. An empirically more e#ective (but theoretically less powerful) approach is to use finite memory models as linear chains or nonlinear trees over histories =-=[42]-=-. However, such finite memory structures can be defeated by long sequences of mostly irrelevant observations and actions that conceal a critical past observation. We briefly summarize three multiscale... |

313 | An analysis of temporal-difference learning with function approximation.
- Tsitsiklis, Roy
- 1997
(Show Context)
Citation Context ... of the agent. The agent’s policy does not need high precision in 4sstates that are rarely visited. Feature 3 is the least understood aspect of RL, but results exist for the linear case (notably ref=-=. [81]-=-) and numerous examples illustrate how function approximation schemes that are nonlinear in the adjustable parameters (e.g., multilayer nerual networks) can be effective for difficult problems (e.g., ... |

310 | Multiagent reinforcement learning: Independent vs. cooperative agents,”
- Tan
- 1993
(Show Context)
Citation Context ...s the utility of A2 picking up trash from T1 if A1 is also picking up from the same bin, and so on). The proposed approach di#ers significantly from previous work in multiagent reinforcement learning =-=[38, 74]-=- in using hierarchical task structure to accelerate learning, and as well in its use of concurrent activities. To illustrate the use of this decomposition in learning multiagent coordination, for the ... |

305 | A unified framework for hybrid control: model and optimal control theory,”
- Branicky, Borkar, et al.
- 1998
(Show Context)
Citation Context ...avior of the plant and intervenes when its state enters a set of boundary states. Intervention takes the form of switching to a new low-level regulator. This is not unlike many hybrid control methods =-=[8]-=- except that the low-level process is formalized as a finite MDP and the supervisor's task as a finite SMDP. The supervisor's decisions occur whenever the plant reaches a boundary state, which e#ectiv... |

300 | Tractable inference for complex stochastic processes
- Boyen, Koller
- 1998
(Show Context)
Citation Context ...generally does not hold over an extended activity. One approach that Rohanimanesh and Mahadevan [56] have been studying is how to exploit results from approximation of structured stochastic processes =-=[6]-=- to develop structured ways of approximating the next-state predictions of temporally-extended activities. The key idea is that by clustering the state variables into disjoint subsets, and keeping tra... |

286 |
TD-Gammon, a self-teaching backgammon program, achieves master-level play
- Tesauro
- 1994
(Show Context)
Citation Context ...motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the current interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon =-=[75, 76]-=-. Additional information about RL can be found in several references (e.g., [2, 5, 32, 72]). Despite the utility of RL methods in many applications, the amount of time they can take to form acceptable... |

285 | Reinforcement Learning with Hierarchies of Machines
- Parr, Russell
(Show Context)
Citation Context ... particularly that of Iba [28], who proposed a method for discovering macro-operators in problem solving. Related ideas have been studied by Digney [15, 16]. 4.2 Hierarchies of Abstract Machines Parr =-=[48, 49]-=- developed an approach to hierarchically structuring MDP policies called Hierarchies of Abstract Machines or HAMs. Like the options formalism, HAMs exploit the theory of SMDPs, but the emphasis is on ... |

282 | Toward a modern theory of adaptive networks: Expectation ancl prediction.
- Sutton, Barto
- 1981
(Show Context)
Citation Context ...rs player of the 1950s [61, 60], which, however, made no reference to the DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis =-=[45, 33, 34, 71]-=-. Much of the current interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several ... |

255 | On the convergence of stochastic iterative dynamic programming.
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ...missible state-action pairs are updated infinitely often, and # k decays with increasing k while obeying the usual stochastic approximation conditions, then {Q k } converges to Q # with probability 1 =-=[29, 5]-=-. As long as these conditions are satisfied, the policy followed by the agent during learning is irrelevant. Of course, when Q-learning is being used, the agent's policy does matter since one is usual... |

192 |
Singular Perturbation Method in Control: Analysis and Design,
- Kokotovic, Khalil, et al.
- 1986
(Show Context)
Citation Context ...roblems. Further work is required for understanding how to build task hierarchies in such cases, and how to integrate this approach to related systems approaches such as singular perturbation methods =-=[36, 46]-=-. 6.3 Dynamic Abstraction Systems such as those outlined in this article naturally provide opportunities for using di#erent state representations depending on the activity that is currently executing.... |

177 |
Dervish: An Office-Navigating Robot.
- Nourbakhsh
- 1998
(Show Context)
Citation Context ...tes to actions provide good performance in robot navigation (e.g, the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the “peak” of the belief state distributi=-=on) [35, 63, 47]-=-. Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of the... |

154 | Convergence results for single-step on-policyreinforcement-learning algorithms,”
- Singh, Jaakkola, et al.
- 2000
(Show Context)
Citation Context ...called this algorithm Sarsa due to its dependence on s, a, r, s # , and a # . Eq. (9) is actually a special case called Sarsa(0).) Unlike Q-learning, here the agent's policy does matter. Singh et al. =-=[65]-=- show that if the policy has the property that each action is executed infinitely often in every state that is visited infinitely often, and it is greedy with respect to the current action-value funct... |

147 | Automatic discovery of subgoals in reinforcement learning using diverse density.
- McGovern, Barto
- 2001
(Show Context)
Citation Context ...lues, and Dietterich [14], whose approach we discuss in Section 4.3, proposes a similar scheme using pseudo-reward functions. A natural question, then, is how are useful subgoals determined? McGovern =-=[43, 44]-=- developed a method for automatically identifying potentially useful subgoals by detecting regions that the agent visits frequently on successful trajectories but not on unsuccessful trajectories. An ... |

138 | Reinforcement learning for dynamic channel allocation in cellular telephone systems
- Singh, Bertsekas
- 1997
(Show Context)
Citation Context ...nd numerous examples illustrate how function approximation schemes that are nonlinear in the adjustable parameters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. =-=[11, 40, 64, 75]-=-). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa [59, 70]. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected m... |

136 | Learning topological maps with weak local odometric information,” in
- Shatkay, Kaelbling
- 1997
(Show Context)
Citation Context ...tion (e.g., the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the ``peak'' of the belief state distribution) (Koenig and Simmons, 1997; Nourbakhsh et al., 1995; =-=Shatkay and Kaelbling, 1997-=-). Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of th... |

135 | Reinforcement learning methods for continuous-time Markov decision problems,
- Bradtke
- 1995
(Show Context)
Citation Context ... r t+i is the immediate reward received at time step t + i. The return accumulated during the waiting time must be bounded, and it can be computed recursively during the waiting time. Bradtke and Du# =-=[7]-=- showed how to do this for continuous-time SMDPs, Parr [48] proved that it converges under essentially the same conditions required for Q-learning convergence, and Das et al. [12] developed the averag... |

130 | Average reward reinforcement learning: Foundations, algorithms, and empirical results.
- Mahadevan
- 1996
(Show Context)
Citation Context ...e simplest class of MDPs, and here we restrict attention to discounted problems. However, RL algorithms have also been developed for MDPs with other definitions of return, such as average reward MDPs =-=[39, 62]-=-. Playing important roles in many RL algortihms are action-value functions, which assign values to admissible state-action pairs. Given a policy #, the value of (s, a), a # A s , denoted Q # (s, a), i... |

127 |
Approximate dynamic programming for real-time control and neural modeling,”
- Werbos
- 1992
(Show Context)
Citation Context ... DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the current interest is attributable to Werbos =-=[85, 86, 87]-=-, Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., [2, 5, 32, 72]). Despite the utility of RL methods ... |

123 |
A heuristic approach to the discovery of macro-operators.
- Iba
- 1989
(Show Context)
Citation Context ... knowledge transfer as previously-discovered options are reused in related tasks. This approach builds on previous work in artificial intelligence that addresses abstraction, particularly that of Iba =-=[28]-=-, who proposed a method for discovering macro-operators in problem solving. Related ideas have been studied by Digney [15, 16]. 4.2 Hierarchies of Abstract Machines Parr [48, 49] developed an approach... |

122 | Hierarchical Control and Learning for Markov Decision Processes,
- Parr
- 1998
(Show Context)
Citation Context ... ideas of RL, and then we review three approaches to hierarchical RL: the options formalism of Sutton, Precup, and Singh [73], the hierarchies of abstract machines (HAMs) approach of Parr and Russell =-=[48, 49]-=-, and the MAXQ framework of Dietterich [14]. Although these approaches were developed relatively independently, they have many elements in common. In particular, they all rely on the theory of semi-Ma... |

120 |
A reinforcement learning method for maximizing undiscounted rewards,”
- Schwartz
- 1993
(Show Context)
Citation Context ...e simplest class of MDPs, and here we restrict attention to discounted problems. However, RL algorithms have also been developed for MDPs with other definitions of return, such as average reward MDPs =-=[39, 62]-=-. Playing important roles in many RL algortihms are action-value functions, which assign values to admissible state-action pairs. Given a policy #, the value of (s, a), a # A s , denoted Q # (s, a), i... |

120 | Scaling Reinforcement Learning toward RoboCup Soccer
- Stone, Sutton
- 2001
(Show Context)
Citation Context ...g problem. They demonstrated that the policies learned for this problem were better than standard heuristics used in industry, such as the "go to the nearest free machine" heuristic. Stone a=-=nd Sutton [68] applied t-=-he framework of options to a "keep away" task in Robot soccer. This task involves a set of players from one team passing the ball between them and keeping the ball in their possession agains... |

116 | Q-decomposition for reinforcement learning agents.
- Russell, Zimdars
- 2003
(Show Context)
Citation Context ...d the state transition probabilities, P (s # |s, a), s, s # # S, together comprise what RL researchers often call the one-step 2 model of action a. A (stationary, stochastic) policy # : S # s#S A s # =-=[0, 1]-=-, with #(s, a) = 0 for a ## A s , specifies that the agent executes action a # A s with probability #(s, a) whenever it observes state s. For any policy # and s # S, V # (s) denotes the expected infin... |

115 | Xavier: A robot navigation architecture based on partially observable markov decision process models
- Koenig, Simmons
- 1998
(Show Context)
Citation Context ...tes to actions provide good performance in robot navigation (e.g, the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the "peak" of the belief state dist=-=ribution) [35, 63, 47]-=-. Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of the... |

112 | Finding structure in reinforcement learning
- Thrun, Schwartz
- 1995
(Show Context)
Citation Context ...d for a subset of the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options [73], skills =-=[80]-=-, behaviors [9, 27], or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extension adds to the... |

106 |
Achieving artificial intelligence through building robots. Memo 899,
- Brooks
- 1986
(Show Context)
Citation Context ...f the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options [73], skills [80], behaviors =-=[9, 27]-=-, or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extension adds to the sets of admissible... |

98 | Elevator group control using multiple reinforcement learning agents
- Crites, Barto
- 1998
(Show Context)
Citation Context ...ing function approximation methods for accumulating value function information, RL algorithms have produced good results on problems that pose significant challenges for standard methods (e.g., refs. =-=[11, 75]-=-). However, current RL methods by no means completely circumvent the curse of dimensionality: the exponential growth of the number of parameters to be learned with the size of any compact encoding of ... |

94 | Discovering hierarchy in reinforcement learning with HEXQ,”
- Hengst
- 2002
(Show Context)
Citation Context ...ly discussed in Section 4.1 automated methods for identifying useful subtoals [15, 16, 43, 44] which address some aspects of this problem. Another approach called HEXQ was recently proposed by Hengst =-=[24]-=-. It exploits a factored state representation and sorts state variables into an ordered list, beginning with the variable that changes most rapidly. HEXQ builds a task hierarchy, consisting of one lev... |

85 | Multi-time models for temporally abstract planning - Precup, Sutton - 1998 |

69 | Reinforcement learning with a hierarchy of abstract models.
- Singh
- 1992
(Show Context)
Citation Context ...btasks, M a , and its transition probabilities, P i (s # , # |s, a) (cf. Eqs. 6 and 7), are well-defined given the policies of the lower-level subtasks. A key observation, which follows that of Singh =-=[66, 67]-=-, is that this SMDP's expected immediate reward, R i (s, a), for executing action (subtask) a is the projected value of # on subtask M a . That is, for all s # S i and all child subtasks M a of M i , ... |

67 |
Practical issues in temporal dierence learning
- Tesauro
- 1992
(Show Context)
Citation Context ...ing function approximation methods for accumulating value function information, RL algorithms have produced good results on problems that pose significant challenges for standard methods (e.g., refs. =-=[11, 75]-=-). However, current RL methods by no means completely circumvent the curse of dimensionality: the exponential growth of the number of parameters to be learned with the size of any compact encoding of ... |

66 | A feedback control structure for on-line learning tasks,
- Huber, Grupen
- 1997
(Show Context)
Citation Context ...f the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options [73], skills [80], behaviors =-=[9, 27]-=-, or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extension adds to the sets of admissible... |

66 |
Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research,
- Werbos
- 1987
(Show Context)
Citation Context ... DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the current interest is attributable to Werbos =-=[85, 86, 87]-=-, Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., [2, 5, 32, 72]). Despite the utility of RL methods ... |

65 | Temporal abstraction in reinforcement learning.
- Precup
- 2000
(Show Context)
Citation Context ...ly involved in the hierarchical specification. This dependence is made more explicit in the work of Parr [48] and Dietterich [14], which we discuss below. Using this machinery (made precise by Precup =-=[52]-=-), one can define hierarchical options as triples #I, , ##, where I and # are the same as for Markov options butsis a semi-Markov policy over options. Value functions for option policies can be define... |

62 | Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments,” in From animals to animats 4: proceedings of the
- Digney
- 1996
(Show Context)
Citation Context ...nce. A key open question is how to form task hierarchies automatically, such as those used in the MAXQ framework. We briefly discussed in Section 4.1 automated methods for identifying useful subtoals =-=[15, 16, 43, 44]-=- which address some aspects of this problem. Another approach called HEXQ was recently proposed by Hengst [24]. It exploits a factored state representation and sorts state variables into an ordered li... |

60 |
Learning to Solve Problems by Searching for Macro-Operators.
- Korf
- 1985
(Show Context)
Citation Context ...ficial intelligence researchers have addressed the need for large-scale planning and problem solving by introducing various forms of abstraction into problem solving and planning systems, e.g., refs. =-=[18, 37]. Abstraction allows-=- a system to ignore details that are irrelevant for the task at hand. One of the simplest types of abstraction is the idea of a "macro-operator," or just a "macro," which is a sequ... |

58 | Theoretical results on reinforcement learning with temporally abstract behaviors. lOth Eur. Con f.
- Precup, Sutton, et al.
- 1998
(Show Context)
Citation Context ...r all s # S and o # O s . The system of equations (11) and (12) can be solved respectively for V # O and Q # O , exactly or approximately, using methods that generalize the usual DP and RL algorithms =-=[54]-=-. For example, the DP backup analogous to (5) for computing option-values is: Q k+1 (s, o) = R(s, o) + X s # #S P (s # |s, o) max o # #O s # Q k (s # , o # ), and the corresponding Q-learning update a... |

54 | Self-improving factory simulation using continuous-time average-reward reinforcement learning
- Mahadevan, Marchalleck, et al.
- 1997
(Show Context)
Citation Context ...hich the amount of time between one decision and the next is a random variable, either real- or integervalued. In the real-valued case, SMDPs model continuous-time discrete-event systems (e.g., refs. =-=[40, 55]-=-). In a discrete-time SMDP [26] decisions can be made only at (positive) integer multiples of an underlying time step. In either case, it is usual to treat the system as remaining in each state for a ... |

53 |
Advanced Forecasting Methods for Global Crisis Warning and Models of Intelligence,”
- Werbos
- 1977
(Show Context)
Citation Context ... DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the current interest is attributable to Werbos =-=[85, 86, 87]-=-, Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., [2, 5, 32, 72]). Despite the utility of RL methods ... |

51 | Autonomous discovery of temporal abstractions from interaction with an environment,”
- McGovern
- 2002
(Show Context)
Citation Context ...lues, and Dietterich [14], whose approach we discuss in Section 4.3, proposes a similar scheme using pseudo-reward functions. A natural question, then, is how are useful subgoals determined? McGovern =-=[43, 44]-=- developed a method for automatically identifying potentially useful subgoals by detecting regions that the agent visits frequently on successful trajectories but not on unsuccessful trajectories. An ... |

47 | Learning hierarchical partially observable Markov decision processes for robot navigation
- Theocharous, Rohanimanesh, et al.
- 2001
(Show Context)
Citation Context ...s and actions that conceal a critical past observation. We briefly summarize three multiscale memory models that have been explored recently by Hernandez and Mahadevan [25], Theocharous and Mahadevan =-=[79]-=-, and Jonsson and Barto [30]. These models combine temporal abstraction with previous methods for dealing with hidden state. Hierarchical Su#x Memory (HSM) [25] generalizes the su#x tree model [42], t... |

46 | Solving semi-Markov decision problems using average reward reinforcement learning,
- Das, Gosavi, et al.
- 1999
(Show Context)
Citation Context ...time. Bradtke and Du# [7] showed how to do this for continuous-time SMDPs, Parr [48] proved that it converges under essentially the same conditions required for Q-learning convergence, and Das et al. =-=[12]-=- developed the average reward case. Crites [10, 11] used SMDP Q-learning in a continuous-time discrete-event formulation of an elevator dispatching problem, an application that illustrates two useful ... |

46 | Learning to Improve Coordinated Actions in Cooperative Distributed Problem-Solving Environments,”
- Sugawara, Lesser
- 1998
(Show Context)
Citation Context ...ordination skills by trial and error. The key idea here is that coordination skills are learned more e#ciently if agents learn to synchronize using a hierarchical representation of the task structure =-=[69]-=-. In particular, rather than each robot learning its response to low-level primitive actions of the other robots (for instance, if A1 goes forward, what should A2 do), they learn high-level coordinati... |

42 | Automated State Abstraction for Options using the U-Tree Algorithm,
- Jonsson, Barto
- 2001
(Show Context)
Citation Context ... critical past observation. We briefly summarize three multiscale memory models that have been explored recently by Hernandez and Mahadevan [25], Theocharous and Mahadevan [79], and Jonsson and Barto =-=[30]-=-. These models combine temporal abstraction with previous methods for dealing with hidden state. Hierarchical Su#x Memory (HSM) [25] generalizes the su#x tree model [42], to SMDP-based temporally-exte... |

41 |
Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem,
- Minsky
- 1954
(Show Context)
Citation Context ...rs player of the 1950s [61, 60], which, however, made no reference to the DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis =-=[45, 33, 34, 71]-=-. Much of the current interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several ... |

41 |
Markov Decision Problems.
- Puterman
- 1994
(Show Context)
Citation Context ...heir properties. Here we briefly describe this well-known framework, with a few twists characteristic of how it is used in RL research; additional details can be found in many references (e.g., refs. =-=[4, 5, 55, 58, 72]-=-). A finite MDP models the following type of problem. At each stage in a sequence of stages, an agent (the controller) observes a system's state s, contained in a finite set S, and executes an action ... |

39 | Hierarchical multi-agent reinforcement learning
- Makar, Mahadevan, et al.
(Show Context)
Citation Context ...ified policy. This approach requires very minimal modification of the approach described in the previous section. In contrast, we now describe an extension of this approach due to Makar and Mahadevan =-=[41]-=- in which agents learn both coordination skills and the base-level policies using a multiagent MAXQ-like task graph. However, convergence to (hierarchically) optimal policies is no longer assured sinc... |

29 |
Multilayer control of large Markov chains.
- Forestier, Varaiya
- 1978
(Show Context)
Citation Context ...evel activities that comprise a. 6 To the best of our knowledge (and as pointed out by Parr [48]), the approach most closely related to this in the control literature is that of Forestier and Varaiya =-=[20]-=-, which we discuss briefly in Section 4.2. 4.1 Options Sutton, Precup, and Singh [73] formalize this approach to including activities in RL with their notion of an option. Starting from a finite MDP, ... |

29 | Scaling reinforcement learning algorithms by learning variable temporal resolution models
- Singh
- 1992
(Show Context)
Citation Context ...btasks, M a , and its transition probabilities, P i (s # , # |s, a) (cf. Eqs. 6 and 7), are well-defined given the policies of the lower-level subtasks. A key observation, which follows that of Singh =-=[66, 67]-=-, is that this SMDP's expected immediate reward, R i (s, a), for executing action (subtask) a is the projected value of # on subtask M a . That is, for all s # S i and all child subtasks M a of M i , ... |

28 | Integrating experimentation and guidance in relational reinforcement learning
- Driessens, Dzeroski
- 2002
(Show Context)
Citation Context ...l structure. For example, states are very often represented as vectors of state variables (usually called factored states by machine learning researchers), or even possess richer relational structure =-=[17]-=-. Much work in artificial intelligence has focused on exploiting this structure to develop compact representations of single-step actions (e.g., the Dynamic Bayes Net representation [13]). A natural q... |

24 | Decision-theoretic planning with concurrent temporally extended actions
- Rohanimanesh, Mahadevan
- 2001
(Show Context)
Citation Context ...the SMDP framework can be extended to concurrent activities, multiagent domains, and partially observable states. 5.1 Concurrent Activities Here we summarize recent work by Rohanimanesh and Mahadevan =-=[57]-=- towards a general framework for modeling concurrent activities. This framework is motivated by situations in which a single agent can execute multiple parallel processes, as well as by the multiagent... |

23 | Hierarchical multi-agent reinforcement learning.
- Ghavamzadeh, Mahadevan, et al.
- 2006
(Show Context)
Citation Context ...ified policy. This approach requires very minimal modification of the approach described in the previous section. In contrast, we now describe an extension of this approach due to Makar and Mahadevan =-=[41]-=- in which agents learn both coordination skills and the base-level policies using a multiagent MAXQ-like task graph. However, convergence to (hierarchically) optimal policies is no longer assured sinc... |

22 | Continuous-time Hierarchical Reinforcement Learning
- Ghavamzadeh, Mahadevan
- 2001
(Show Context)
Citation Context ...s restricted to options over discrete-time SMDPs and having deterministic policies, but the main ideas extend readily to the other variants (HAMs, MAXQ), as well as to continuoustime SMDPs. (See ref. =-=[21]-=- for a treatment of hierarchical RL for continuous-time SMDPs.) The sequential option model is generalised to a multi-option, which is a set of options that can be executed in parallel. Here we discus... |

21 | Large-scale dynamic optimization using teams of reinforcement learning agents,
- Crites
- 1996
(Show Context)
Citation Context ... for continuous-time SMDPs, Parr [48] proved that it converges under essentially the same conditions required for Q-learning convergence, and Das et al. [12] developed the average reward case. Crites =-=[10, 11]-=- used SMDP Q-learning in a continuous-time discrete-event formulation of an elevator dispatching problem, an application that illustrates two useful features of RL methods for discrete-event systems. ... |

21 |
Dynamic Probabilistic Systems: Semi-Markov and Decision Processes
- Howard
- 1971
(Show Context)
Citation Context ... decision and the next is a random variable, either real- or integervalued. In the real-valued case, SMDPs model continuous-time discrete-event systems (e.g., refs. [40, 55]). In a discrete-time SMDP =-=[26]-=- decisions can be made only at (positive) integer multiples of an underlying time step. In either case, it is usual to treat the system as remaining in each state for a random waiting time [26], at th... |

21 |
The Hedonistic Neuron: A Theory of
- Klopf
- 1982
(Show Context)
Citation Context ...rs player of the 1950s [61, 60], which, however, made no reference to the DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis =-=[45, 33, 34, 71]-=-. Much of the current interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several ... |

20 | Lyapunov Design for Safe Reinforcement Learning.
- Perkins, Barto
- 2003
(Show Context)
Citation Context ...heir policies a priori is an opportunity to use background knowledge about the task to try to accelerate learning and/or provide guarantees about system performance during learning. Perkins and Barto =-=[51, 50]-=-, for example, consider collections of options each of which descends on a Lyapunov function. Not only is learning accelerated, but the goal state is reached on every learning trial while the agent le... |

20 | Approximate Planning with Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation,"
- Theocharous, Mahadevan
- 2002
(Show Context)
Citation Context ... parameters of an H-POMDP model from sequences of observations and actions. Extensive tests on a robot navigation domain show learning and planning performance is much improved over flat POMDP models =-=[79, 78]-=-. The hierarchical EM-based parameter estimation algorithm scales more gracefully to large models because previously learned sub-models can be reused when learning at higher levels. In addition, the e... |

19 |
Hierarchical learning and planning in partially observable Markov decision processes,
- Theocharous
- 2002
(Show Context)
Citation Context ...pplied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of the H-POMDP model, as well as its application to robot navigation, see =-=[77]-=-. Jonsson and Barto [30] also addressed partial observability by adapting su#x tree methods to hierarchical RL systems. Their approach focused on automating the process of constructing activity-specif... |

18 |
Dervish: An oce-navigating robot
- Nourbakhsh, Powers, et al.
- 1995
(Show Context)
Citation Context ...tes to actions provide good performance in robot navigation (e.g, the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the "peak" of the belief state dist=-=ribution) [35, 63, 47]-=-. Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of the... |

17 |
An analysis of temporal-dierence learning with function approximation
- Tsitsiklis, Roy
- 1997
(Show Context)
Citation Context ... of the agent. The agent's policy does not need high precision in 4 states that are rarely visited. Feature 3 is the least understood aspect of RL, but results exist for the linear case (notably ref. =-=[81]-=-) and numerous examples illustrate how function approximation schemes that are nonlinear in the adjustable parameters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., ref... |

13 |
Hierarchical memory-based reinforcement learning
- Hernandez, Mahadevan
- 2000
(Show Context)
Citation Context ...of mostly irrelevant observations and actions that conceal a critical past observation. We briefly summarize three multiscale memory models that have been explored recently by Hernandez and Mahadevan =-=[25]-=-, Theocharous and Mahadevan [79], and Jonsson and Barto [30]. These models combine temporal abstraction with previous methods for dealing with hidden state. Hierarchical Su#x Memory (HSM) [25] general... |

13 | Lyapunov-constrained action sets for reinforcement learning
- Perkins, Barto
- 2001
(Show Context)
Citation Context ...heir policies a priori is an opportunity to use background knowledge about the task to try to accelerate learning and/or provide guarantees about system performance during learning. Perkins and Barto =-=[51, 50]-=-, for example, consider collections of options each of which descends on a Lyapunov function. Not only is learning accelerated, but the goal state is reached on every learning trial while the agent le... |

13 | A time aggregation approach to Markov decision processes - Cao, Ren, et al. - 2002 |

12 | Localizing search in reinforcement learning
- Grudic, Ungar
(Show Context)
Citation Context ...so have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options [73], skills [80], behaviors [9, 27], or the more control-theoretic modes =-=[22]-=-. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extension adds to the sets of admissible actions, A s , s # S, sets of activities,... |

11 | Hierarchically optimal average reward reinforcement learning
- Ghavamzadeh, Mahadevan
- 2002
(Show Context)
Citation Context ...d for average reward MDPs (Mahadevan, 1996; Schwartz, 1993), and some research has been done on extending aspects of the hierarchical approaches we discuss in this article to the average reward case (=-=Ghavamzadeh and Mahadevan, 2002-=-). Playing important roles in many RL algortihms are action-value functions, which assign values to admissible state-action pairs. Given a policy p, the value of …s; a†, a [ a s, denoted Q p …s; a†, i... |

4 |
Learning hierarchical control structure from multiple tasks and changing environments. From Animals to Animats 5
- Digney
- 1998
(Show Context)
Citation Context ... artificial intelligence that addresses abstraction, particularly that of Iba [28], who proposed a method for discovering macro-operators in problem solving. Related ideas have been studied by Digney =-=[15, 16]-=-. 4.2 Hierarchies of Abstract Machines Parr [48, 49] developed an approach to hierarchically structuring MDP policies called Hierarchies of Abstract Machines or HAMs. Like the options formalism, HAMs ... |

4 |
Singular Perturbation Methodology in Control Systems, Peter Peregrinus,
- Naidu
- 1988
(Show Context)
Citation Context ...required for understanding how to build task hierarchies in such cases, and how to integrate this approach to related systems approaches such as singular perturbation methods (Kokotovic et al., 1986; =-=Naidu, 1988-=-). M10671 Kluwer Academic Publishers Discrete Event Dynamic Systems: Theory and Applications (DISC) Tradespools Ltd., Frome, Somersets3B2 Version 6.05e/W (Mar 29 1999) {Kluwer}Disc/Disk 13_12/5110821/... |

3 |
function and adaptive systems---A heterostatic theory. Technical Report AFCRL-72-0164, Air Force Cambridge Research Laboratories
- Brain
- 1972
(Show Context)
Citation Context |

3 |
Singular Perturbation Methodology
- Naidu
- 1988
(Show Context)
Citation Context ...roblems. Further work is required for understanding how to build task hierarchies in such cases, and how to integrate this approach to related systems approaches such as singular perturbation methods =-=[36, 46]-=-. 6.3 Dynamic Abstraction Systems such as those outlined in this article naturally provide opportunities for using di#erent state representations depending on the activity that is currently executing.... |

3 |
Achieving arti¯cial intelligence through building robots
- Brooks
- 1986
(Show Context)
Citation Context ...ust also have well-de®ned termination conditions. These partial policies are sometimes called temporally-extended actions, options (Sutton et al., 1999), skills (Thrun and Schwartz, 1995), behaviors (=-=Brooks, 1986-=-; Huber and Grupen, 1997), or the more control-theoretic modes (Grudic and Ungar, 2000). When not discussing a speci®c formalism, we will use the term activity, as suggested by Harel (1987). For MDPs,... |

2 | Academic Publishers Discrete Event Dynamic Systems: Theory and Applications (DISC) Tradespools Ltd - Kluwer - 1986 |

1 |
Statecharts: A visual formalixm for complex systems
- Harel
- 1987
(Show Context)
Citation Context ...xtended actions, options [73], skills [80], behaviors [9, 27], or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel =-=[23]-=-. For MDPs, this extension adds to the sets of admissible actions, A s , s # S, sets of activities, each of which can itself invoke other activities, thus allowing a hierarchical specification of an o... |

1 |
Structured approximation of stochastic temporally extended actions
- Rohanimanesh, Mahadevan
(Show Context)
Citation Context ...t subtle since even if all actions have limited single-step influence on state variables, this property generally does not hold over an extended activity. One approach that Rohanimanesh and Mahadevan =-=[56]-=- have been studying is how to exploit results from approximation of structured stochastic processes [6] to develop structured ways of approximating the next-state predictions of temporally-extended ac... |

1 | Statecharts: Avisual formalixm for complex systems - Harel - 1987 |

1 |
Brain function and adaptive systemsÐA heterostatic theory. Technical Report AFCRL-72± 0164, Air Force Cambridge Research Laboratories
- Klopf
- 1974
(Show Context)
Citation Context ...muel, 1963, 1967), which, however, made no reference to the DPliterature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis (Minsky, 1954; =-=Klopf, 1972-=-, 1982; Sutton and Barto, 1981). Much of the current interest is attributable to Werbos (1977, 1987, 1992), Watkins (1989), and Tesauro's backgammon-playing system TD-Gammon (Tesauro, 1992, 1994). Add... |

1 | Dervish: An of®ce-navigation robot - Nourbakhsh, Powers, et al. - 1995 |

1 | Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38 - unknown authors - 2000 |

1 | Goldszmidt (eds - Sammut, M |