| Marbach, P. (1998). Simulation-Based Methods for Markov Decision Processes. Ph.D. thesis, Laboratory for Information and Decision Systems, MIT. |
.... in which the gradient estimates r were used to optimize the performance of a variety of different MDPs and POMDPs, including a simple three state Markov chain controlled by a linear function, a two dimensional puck controlled by a neural network, and the call admission problem treated in [19]. 1.1 Related Work The approach we take in this paper is closely related to certain direct adaptive control schemes that are used to tune (deterministic) controllers for discrete time systems. A number of authors [14, 12, 15] have presented algorithms for the approximate computation in closed ....
....position, making the recurrence time to i very large. It may be possible to alleviate these problems by judicially altering the recurrent state as training proceeds, but no algorithms along those lines have been presented. Approximate algorithms for computing the gradient were also given in [20, 19], one that sought to solve the aforementioned recurrence problem by demanding only recurrence to one of a set of recurrent states, and another that abandoned recurrence and used discounting, which is closer in spirit to our algorithm. However the latter algorithm still used estimates of the ....
P. Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Labortory for Information and Decision Systems, MIT, 1998.
.... GPOMDP, a new algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes (POMDP s) Our algorithm is essentially an extension of Williams REINFORCE algorithm [17] and similar more recent algorithms [7, 5, 9, 8]. More specifically, suppose 2 R are the parameters controlling the POMDP. For example, could be the parameters of an approximate neural network value function that generates a stochastic policy by some form of randomized look ahead, or could be the parameters of an approximate Q ....
....is to reliably navigate the puck from any starting configuration to an arbitrary target location in the minimum time, while only applying discrete forces in the x and y directions. In the third experiment, we use CONJPOMDP to train a controller for the call admission queueing problem treated in [8]. In this case CONJPOMDP finds nearoptimal solutions within about 2000 iterations of the underlying queue. In the fourth and final experiment, CONJPOMDP is used to train a switched neuralnetwork controller for a two dimensional variation on the classical mountain car task [13, Example 8.2] ....
[Article contains additional citation context not shown here]
P. Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Labortory for Information and Decision Systems, MIT, 1998.
....preferring direct policy approaches is that it is often far easier to construct a reasonable class of parameterized policies than it is to construct a class of value 1 functions; we often know how to act without being able to compute the value of acting. Building on a large body of earlier work [7, 36, 22, 15, 16, 20, 30, 23, 26, 25, 2], in [8, 9] we introduced GPOMDP an algorithm for estimating the gradient of the average reward in general Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. The chief advantages of GPOMDP are that it requires only a single sample path of the ....
Peter Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Laboratory for Information and Decision Systems, MIT, 1998.
....encountering that state, the algorithm returns an estimate of the gradient based on the trace of the reinforcement signal accumulated during the episode. Williams reinforce was generalized to optimize the average reward criterion in mdps based on a single sample path by Marbach and Tsitsiklis [32, 34]. Parameters can be updated either during visits to a certain recurrent state or on every time step. Their algorithm has the established convergence of the performance metric with probability one. Baird and Moore [5] presented an algorithm called vaps, which combines value function with policy ....
Peter Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, MIT, 1998. BIBLIOGRAPHY 73
.... GPOMDP, a new algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes (POMDP s) Our algorithm is essentially an extension of Williams REINFORCE algorithm [17] and similar more recent algorithms [7, 5, 9, 8]. See [2, Section 1.1] for a more comprehensive discussion of this related work. More specifically, suppose 2 R K are the parameters controlling the POMDP. For example, could be the parameters of an approximate neural network value function that generates a stochastic policy by some form of ....
....is to reliably navigate the puck from any starting configuration to an arbitrary target location in the minimum time, while only applying discrete forces in the x and y directions. In the third experiment, we use CONJPOMDP to train a controller for the call admission queueing problem treated in [8]. In this case CONJPOMDP finds nearoptimal solutions within about 2000 iterations of the underlying queue. In the fourth and final experiment, CONJPOMDP is used to train a switched neuralnetwork controller for a two dimensional variation on the classical mountain car task [13, Example 8.2] ....
[Article contains additional citation context not shown here]
P. Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Labortory for Information and Decision Systems, MIT, 1998.
....states. Instead, it is bounded by the mixing time of the POMDP (loosely, the time needed to approach stationarity) which is always shorter than recurrence time and often substantially so. Approximate algorithms for computing the gradient were also given in (Marbach Tsitsiklis, 1998; Marbach, 1998), one that sought to solve the aforementioned recurrence problem by demanding only recurrence to one of a set of recurrent states, and another that abandoned recurrence and used discounting, which is closer in spirit to our algorithm. 2. The Mathematical Framework Our setting is that of an agent ....
Marbach, P. (1998). Simulation-Based Methods for Markov Decision Processes. Doctoral dissertation, Labortory for Information and Decision Systems, MIT.
....visits to i is a constant. It is easy to prove the stronger result that the expected value of the estimate is the gradient, even when the number of steps is a random variable (see Section 3) Other researchers have investigated algorithms that estimate the gradient of the expected reward [6, 4, 9, 8, 2, 10, 7]. With the exception of [6] these algorithms are all restricted to episodic tasks, or for tasks where the long term average reward is accurately known. The weakness of approaches that are restricted to episodic tasks arises from the reliance on the identifiable recurrent state i . Although ....
P. Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Laboratory for Information and Decision Systems, MIT, 1998.
.... in which the gradient estimates r were used to optimize the performance of a variety of different MDPs and POMDPs, including a simple three state Markov chain controlled by a linear function, a two dimensional puck controlled by a neural network, and the call admission problem treated in [19]. 1.1 Related Work The approach we take in this paper is closely related to certain direct adaptive control schemes that are used to tune (deterministic) controllers for discrete time systems. A number of authors [14, 12, 15] have presented algorithms for the approximate computation in closed ....
....position, making the recurrence time to i very large. It may be possible to alleviate these problems by judicially altering the recurrent state as training proceeds, but no algorithms along those lines have been presented. Approximate algorithms for computing the gradient were also given in [20, 19], one that sought to solve the aforementioned recurrence problem by demanding only recurrence to one of a set of recurrent states, and another that abandoned recurrence and used discounting, which is closer in spirit to our algorithm. However the latter algorithm still used estimates of the ....
P. Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Labortory for Information and Decision Systems, MIT, 1998.
.... GPOMDP, a new algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes (POMDP s) Our algorithm is essentially an extension of Williams REINFORCE algorithm [17] and similar more recent algorithms [7, 5, 9, 8]. More specifically, suppose 2 R K are the parameters controlling the POMDP. For example, could be the parameters of an approximate neural network value function that generates a stochastic policy by some form of randomized look ahead, or could be the parameters of an approximate Q ....
....is to reliably navigate the puck from any starting configuration to an arbitrary target location in the minimum time, while only applying discrete forces in the x and y directions. In the third experiment, we use CONJPOMDP to train a controller for the call admission queueing problem treated in [8]. In this case CONJPOMDP finds nearoptimal solutions within about 2000 iterations of the underlying queue. In the fourth and final experiment, CONJPOMDP is used to train a switched neuralnetwork controller for a two dimensional variation on the classical mountain car task [13, Example 8.2] ....
[Article contains additional citation context not shown here]
P. Marbach. Simulation-Based Methods for Markov Decision Processes. PhD thesis, Labortory for Information and Decision Systems, MIT, 1998.
No context found.
Marbach, P. (1998). Simulation-Based Methods for Markov Decision Processes. Ph.D. thesis, Laboratory for Information and Decision Systems, MIT.
No context found.
Peter Marbach, Simulation-based methods for Markov decision processes, Ph.D. thesis, MIT, Cambridge, MA, 1998.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC