The key idea in this research work is that challenging planning and control problems in stochastic domains can be solved using a general optimization technique combined with carefully constructed representations. An objective of articial intelligence (AI) is to model the behavior of an intelligent agent's interaction with its environment. Traditionally this interaction is captured by Markov decision processes (MDPs). MDPs are typically solved by value-search methods yielding memoryless controllers. However, the an agent's interactions with real world are better modeled by partially observable MDPs (POMDPs). In POMDPs, value is an ill-dened concept, and controllers with memory are necessary for optimal behavior. Methods that search directly in the space of policies provide a viable alternative to value-based methods. The performance of policy-search methods crucially depends on the architecture on the controller. We present and analyze various architectures for controllers in POMDPs. We propose the future work which will provide useful insights into building controllers able to take
|
2103
|
A tutorial in hidden Markov models and selected applications in speech recognition
– Rabiner
- 1989
|
|
1673
|
Reinforcement learning: An introduction
– Sutton, Barto
- 1998
|
|
938
|
Learning from Delayed Rewards
– Watkins
- 1989
|
|
887
|
Reinforcement learning: A survey
– Kaelbling, Littman, et al.
- 1996
|
|
672
|
A Course in Game Theory
– Osborne, Rubinstein
- 1994
|
|
501
|
Pengi: an implementation of a theory of activity
– Agre, Chapman
- 1987
|
|
408
|
Planning and acting in partially observable stochastic domains
– Kaelbling, Littman, et al.
- 1998
|
|
377
|
Neuronlike adaptive elements that can solve difficult learning control problems
– Barto, Sutton, et al.
- 1983
|
|
361
|
Markov Decision Processes
– Puterman
- 1994
|
|
353
|
Dynamic Programming and Markov Processes
– Howard
- 1960
|
|
323
|
Markov games as a framework for multi-agent reinforcement learning
– Littman
- 1994
|
|
221
|
The optimal control of Partially Observable Markov Processe
– Sondik
- 1971
|
|
213
|
Simulation and the Monte Carlo Method
– Rubinstein
- 1981
|
|
210
|
Acting optimally in partially observable stochastic domains
– Cassandra, Kaelbling, et al.
- 1994
|
|
200
|
Dynamic Programming and Optimal Control. Athena Sci
– Bertsekas
- 2005
|
|
191
|
The dynamics of reinforcement learning in cooperative multiagent systems
– Claus, Boutilier
- 1998
|
|
190
|
Multiagent reinforcement learning: Theoretical framework and an algorithm
– Hu, Wellman
- 1998
|
|
174
|
Simple statistical gradient-following algorithms for connectionist reinforcement learning
– Williams
- 1992
|
|
171
|
Reinforcement Learning with Selective Perception and Hidden State
– McCallum
- 1995
|
|
157
|
Policy gradient methods for reinforcement learning with function approximation
– Sutton, McAllester, et al.
- 1999
|
|
137
|
From local actions to global tasks: Stigmergy and collective robotics
– Beckers, Holland, et al.
- 1994
|
|
122
|
I.: Reinforcement Learning Algorithm for Partially Observable
– Jaakkola, Singh, et al.
- 1994
|
|
105
|
Asynchronous stochastic approximation and Q-learning
– Tsitsiklis
- 1994
|
|
102
|
Gradient descent for general reinforcement learning
– Baird, C
- 1998
|
|
95
|
The complexity of decentralized control of markov decision processes
– Bernstein, Givan, et al.
- 2002
|
|
93
|
Learning without stateestimation in partially observable Markovian decision problems
– Singh, Jaakkola, et al.
- 1994
|
|
90
|
Sequential optimality and coordination in multiagent systems
– Boutilier
- 1999
|
|
86
|
High-level Vision
– Ullman
- 1996
|
|
84
|
The RoboCup Synthetic Agent Challenge, 97
– Kitano, Tambe, et al.
- 1997
|
|
79
|
Exact and Approximate Algorithms for Partially Observable Markov Decision Processes
– Cassandra
- 1998
|
|
78
|
Actor-critic algorithms
– Konda, Tsitsiklis
- 2000
|
|
78
|
Memoryless policies: theoretical limitations and practical results
– Littman
- 1994
|
|
69
|
Learning to perceive and act by trial and error
– Whitehead, Ballard
- 1991
|
|
65
|
Solving POMDPs by searching in policy space
– Hansen
- 1998
|
|
56
|
Continual Learning in Reinforcement Environments
– Ring
- 1994
|
|
53
|
Reinforcement learning in POMDP’s via direct gradient ascent
– Baxter, Bartlett
|
|
47
|
Learning policies with external memory
– Peshkin, Meuleau, et al.
- 1999
|
|
46
|
Simulation-based optimization of Markov reward processes
– Marbach, Tsitsiklis
- 1998
|
|
38
|
Intra-option learning about temporally abstract actions
– Sutton, Precup, et al.
- 1998
|
|
35
|
Planning and control in stochastic domains with imperfect information
– Hauskrecht
- 1997
|
|
32
|
Distributed value functions
– Schneider, Wong, et al.
- 1999
|
|
29
|
Finite-Memory Control of Partially Observable Systems
– Hansen
- 1998
|
|
26
|
Model-free reinforcement learning for non-markovian decision problems
– Singh, Jaakkola, et al.
- 1994
|
|
20
|
Simulation-based methods for Markov decision processes. Doctoral dissertation
– Marbach
- 1998
|
|
20
|
Learning Automata
– Narendra, Thathachar
- 1989
|
|
20
|
Reinforcement learning in connectionist networks: A mathematical analysis
– Williams
- 1986
|
|
18
|
A class of gradient-estimating algorithms for reinforcement learning in neural networks
– Williams
- 1987
|
|
17
|
Csaba Szepesvari. Convergence results for single-step on-policy reinforcementlearning algorithms
– Singh, Jaakkola, et al.
- 2000
|
|
15
|
Reinforcement learning through gradient descent
– Baird
- 1999
|
|
14
|
Learning controllers for partially observable environments
– Meuleau, Peshkin, et al.
- 1999
|