| N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of Uncertainty in Artificial Intelligence, pages 427--436, Stockholm, 1999. |
....that is a distribution over world states will be intractable. Also, in many cases optimal control can be achieved with a far smaller set of internal states (in the simplest of cases such as the load unload problem only two states or a one bit mimory are required) In a similar vein to [6], this paper extends GPOMDP to classes of parameterized policies whose actions depend on an internal state. We consider parameterized classes of internal state machines, and compute the gradient of the long term average reward with respect to both policy and internal state transition parameters. ....
....an automatic quantization of the belief state space to provide a locally optimal policy representable by n b states. As n b 1, it is at least possible in principle to represent the optimal policy arbitrarily accurately. Another way to view this process is as the direct learning of a policy graph [6]. Specifically, our goal is to find policy parameters 2 R np and internal state transition parameters 2 R n i maximizing the long term average reward: lim T 1 1 T E ; T X t=1 r(i t ) # : where E ; denotes the expectation over all sequences (i 0 ; b 0 ) i 1 ; ....
[Article contains additional citation context not shown here]
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth International Conference on Uncertainty in Artificial Intelligence, 1999.
.... Learning by Williams (1992) with his Reinforce algorithm (similar algorithms were described earlier in the Monte Carlo literature by Glynn (1986) Reiman and Weiss (1986) A practical diculty with Reinforce, and a number of subsequent algorithms (Marbach Tsitsiklis, 1998; Baird Moore, 1999; Meuleau et al. 1999; Meuleau et al. 2000) is that they estimate the gradient from a speci ed recurrent state, which requires perfect, rather than partial, observability (at least in that state) In addition, the variance of the gradient estimate is related to the time to return to this recurrent state, which can be ....
Meuleau, N., Peshkin, L., Kim, K.-E., & Kaelbling, L. P. (1999). Learning Finite-State Controllers for Partially Observable Environments. Proceedings of the Fifteenth International Conference on Uncertainty in Articial Intelligence.
....The representation of the history of observations and actions does not have to literally correspond to the most recent observations and actions. It may instead correspond to memory bits that the controller has learned to switch on and off (Littman, 1994; Lanzi, 2000; Cliff Ross, 1994; Peshkin, Meuleau, Kaelbling, 1999). Alternatively, it may correspond to recurrent activations from hidden units in recurrent neural networks (Lin Mitchell, 1993; Bakker van der Voort van der Kleij, 2000; Gomez Miikkulainen, 1999) Finally, the policy may be implemented as a finite state automaton (FSA) where the state of ....
.... from hidden units in recurrent neural networks (Lin Mitchell, 1993; Bakker van der Voort van der Kleij, 2000; Gomez Miikkulainen, 1999) Finally, the policy may be implemented as a finite state automaton (FSA) where the state of the FSA represents the history of observations and actions (Meuleau, Peshkin, Kim, Kaelbling, 1999; McCallum, 1993) 1.2 Long term dependencies Most of the approaches described above have problems if there are long term dependencies between relevant events. An example of a long term dependency problem is a maze navigation task where the only way to distinguish between two T junctions that ....
Meuleau, N., Peshkin, L., Kim, K. E., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence.
....efficient evolution. In several robot control benchmark tasks, ESP was compared to other neuro evolution methods such as SANE, GENITOR [28] and Cellular Encoding [9, 29] as well as to other reinforcement learning methods such as Adaptive Heuristic Critic [3, 1] Q learning [25, 20] and VAPS [14]. ESP turns out to be consistently the most powerful, solving problems faster, and solving harder problems [8] It therefore forms a solid foundation for an extension to multi agent systems evolution. In this paper, ESP is adapted to allow for the simultaneous evolution of multiple agents. We ....
Meuleau, N., Peshkin, L., Kim, K.-E., and Kaelbling, L. P. (1999). Learning finite state controllers for partially observable environments. In Proceedings of the 15th International COnference of Uncertainty in Artificial Intelligence.
....many variants of POMDP problems for which it had not been proved that finding tractable exact solutions or provably good approximations is hard. There is a growing literature on heuristic solutions for POMDP (see [Cassandra 1998; Hansen 1998b; Hauskrecht 1997; Lovejoy 1991; Lusena et al. 1999; Meuleau et al. 1999; Peshkin et al. 1999; Platzman 1977; Smallwood and Sondik 1973] for instance. Since these algorithms do not yield guaranteed optimal or near optimal solutions, we leave a discussion of them to other sources. In this paper, we address the computational complexity, given a process and ....
Meuleau, N., Peshkin, L., Kim, K.-E., and Kaelbling, L. P. 1999. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (1999), pp. 427--436.
....in POMDPs, we can restrict attention to policies that depend only on the observables. This restriction results in a subclass of stochastic memoryfree policies. 1 By introducing artificial memory variables into the process state, we can also define stochastic limited memory policies [9] (which certainly permits some belief state tracking) 1 Although we have not explicitly addressed stochastic policies so far, they are a straightforward generalization (e.g. using the transformation to deterministic policies given in [7] Since we are interested in the planning problem, we ....
N. Meuleau, L. Peshkin, K-E. Kim, and L.P. Kaelbling. Learning finite-state controllers for partially observable environments. In Uncertainty in Artificial Intelligence, Proceedings of the Fifteenth conference, 1999.
....In particular, in POMDPs, we can restrict attention to policies that depend only on the observables. This restriction results in a subclass of stochastic memory free policies. By introducing artificial memory bits into the process state, we can also define stochastic limited memory policies. [6] Each has a value V [ V [ as specified above. To find the best policy in , we can search for the that maximizes V [ If we can compute or approximate V [ there are many algorithms that can be used to find a local maximum. Some, such as Nelder Mead simplex search (not to be ....
N. Meuleau, L. Peshkin, K-E. Kim, and L.P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proc. UAI 15, 1999.
....policy for a non Markov problem is NP hard. Results can be improved if stochastic policies are considered (Jaakkola, Singh, and Jordan 1995) or if a complete model of the underlying non Markov problem is available. In this case approximate solutions can be searched (Hansen 1998; Hauskrecht 1998; Meuleau, Peshkin, Kim, and Kaelbling 1999) as well as optimal ones (Kaelbling, Littmann, and Cassandra 1998) All other methods use some sort of memory of past observations. One way to include memory in a reinforcement learning algorithm like Q learning is to use a recurrent neural network to represent the Qvalues. The resulting network ....
....added to ZCS (Wilson 1994) although optimal performance was not obtained. Furthermore, Cliff and Ross (1994) suggested that the approach would not scale up. Optimal 6 performance in non Markov environments was reported by Lanzi (1998a) when he added the memory register to XCS. Recently Peshkin, Meuleau, Kim, and Kaelbling (1999) applied the memory register idea to tabular SARSA( reporting interesting results for simple non Markov problems. 4 The XCS Classifier System and Internal Memory Classifiers in XCS have three main parameters: i) the prediction p, which estimates the payoff that the system expects if the ....
Meuleau, N., L. Peshkin, K. Kim, and L. Kaelbling (1999). Learning finite-state controllers for partially observable environments. In Fifteenth Conference on Uncertainity in Artificial Intelligence. AAAI. (to appear).
No context found.
Meuleau, N.; Peshkin, L.; Kim, K.-E.; and Kaelbling, L. 1999. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Puterman, M. L. 1994. Markov Decision Processes. New York: John Wiley & Sons.
No context found.
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 427--436. Morgan Kaufmann, 1999.
No context found.
Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 427--436. Morgan Kaufmann, 1999.
....Bellman s equation does not transfer to non Markovian environments. 2 Policy search, on the other hand, accommodates partial observability and non Markovianism very well. It can be used to find (locally) optimal controllers under many kinds of constraints, with many different forms of memory [6]. Hence, policy search becomes particularly interesting in partially observable settings. The basic policy search algorithm is Williams (episodic) REINFORCE [11] 3 It performs 1 Unless we use (Bayesian) belief states instead of the original states, which increases the complexity of the ....
....an environment. 2 Gradient Based Policy Search (REINFORCE) To simplify the presentation, we suppose that the environment is a finite MDP and that REINFORCE is used to optimize the parameters (action probabilities) of a stochastic memoryless policy, which is sufficient in such an environment (see [6] for more complex setting) We also suppose that the problem is a goal achievement task, i.e. there is an absorbing goal state that must be reached as fast as possible. We will insure that all policies are proper, i.e. that we finally reach the goal with probability 1 under any policy, by ....
N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning finite-state controllers for partially observable environments. Proceedings of UAI-99, pages 127--136, 1999.
....a Monte Carlo estimation of the gradient instead of an exact calculation, and that they limit themselves to RPs, which is much less general than our approach. Moreover, Jaakkola et al. do not use the exponentially discounted criterion (1) but the average reward per time step. In a companion paper [17], we propose a stochastic gradient descent approach for learning finite policy graph during a trial based interaction with the process. 3.3 OTHER APPROACHES A Monte Carlo approach based on Watkins Q learning [25, 24] is also applicable to our problem. For instance, we can an use Q learning based ....
....of the problem to accelerate the computation. If this is not sufficient, bigger leverage can be gained by imposing structure on the policy. However, our algorithms are limited by necessity to enumerate at least once per iteration, the complete state space of the POMDP. In a companion paper [17], we propose an indirect learning algorithm that avoids this bottleneck. ....
N. Meuleau, L. Peshkin, K.E. Kim, and L.P. Kaelbling. Learning finite-state controllers for partially observable environments. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, To appear, 1999.
....but have incomplete, unreliable, and generally different perceptions of the world state. In such environments, valuesearch methods are generally inappropriate, causing us to turn to policy search methods [19, 2, 3] which we have applied previously to single agent partially observable domains [10, 13]. In this paper we describe a gradient descent policy search algorithm for cooperative multi agent domains. In this setting, after each agent performs its action given its observation according to some individual strategy, they all receive the same payoff. Our objective is to find a learning ....
....with FSCs in a POIPSG. it has been shown that in partially observable environments, the best reactive policy can be arbitrarily worse than the best policy using memory [16] This statement can also be easily extended to POIPSGs. There are many possibilities for constructing policies with memory [13, 10]. In this work we use a finite state controller (FSC) for each agent. A more detailed description of FSCs and derivation of algorithms for learning them may be found in a previous paper [10] we simply state the definition here. A finite state controller (FSC) for an agent with action space A ....
[Article contains additional citation context not shown here]
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 427--436. Morgan Kaufmann, 1999.
No context found.
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of Uncertainty in Artificial Intelligence, pages 427--436, Stockholm, 1999.
No context found.
Meuleau, N., Peshkin, L., Kim, K.-E., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth International Conference on Uncertainty in Artificial Intelligence.
No context found.
N. Meuleau, L. Peshkin, K-E. Kim, and L.P. Kaelbling. Learning finite-state controllers for partially observable environments. In Uncertainty in Artificial Intelligence, Proceedings of the Fifteenth conference, 1999.
No context found.
N. Meuleau, L. Peshkin, K. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. Proc. UAI-99, pp.427--436, Stockholm, 1999.
No context found.
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. In UAI, pages 427--436, Stockholm, 1999.
No context found.
Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling, `Learning finite-state controllers for partially observable environments', in Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI '99). Morgan Kaufmann, (1999).
No context found.
Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling, `Learning finite-state controllers for partially observable environments', in Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI '99). Morgan Kaufmann, (1999).
No context found.
N. Meuleau, L. Peshkin, K-E. Kim, and L. Kaelbling. Learning finitestate controllers for partially observable environments. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 427--436, 1999.
No context found.
N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling. Learning finite-state controllers for partially observable environments. Proc. UAI-99, pp.427--436, Stockholm, 1999.
No context found.
Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state controllers for partially observable environments. In Proc. of UAI-99, pages 427--436, Stockholm, 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC