Results 1 -
5 of
5
Reinforcement Learning in POMDP's via Direct Gradient Ascent
- In Proc. 17th International Conf. on Machine Learning
, 2000
"... This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward. 1. Introduction "Reinforcement learning" is used to describe the general problem of training an agent to choose its actions so as to increase its long-term average reward. The structure of th...
Direct gradient-based reinforcement learning: I. gradient estimation algorithms
- National University
, 1999
"... In [2] we introduced ¢¡¤£¦¥¨§¦¡, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes ( ¡©£¦¥¨§¦ ¡ s). The algorithm’s chief advantages are that it requires only a single sample path of the underly ..."
Abstract
-
Cited by 51 (3 self)
- Add to MetaCart
In [2] we introduced ¢¡¤£¦¥¨§¦¡, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes ( ¡©£¦¥¨§¦ ¡ s). The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one ���� � ������ � free parameter which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces.
Experiments with Infinite-Horizon, Policy-Gradient Estimation
- Journal of Artificial Intelligence Research
, 2001
"... In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the p ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems. 1.
BASIS CONSTRUCTION AND UTILIZATION FOR MARKOV DECISION PROCESSES USING GRAPHS
, 2010
"... The ease or difficulty in solving a problem strongly depends on the way it is represented.
For example, consider the task of multiplying the numbers 12 and 24. Now imagine multiplying
XII and XXIV. Both tasks can be solved, but it is clearly more difficult to use
the Roman numeral representations of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The ease or difficulty in solving a problem strongly depends on the way it is represented.
For example, consider the task of multiplying the numbers 12 and 24. Now imagine multiplying
XII and XXIV. Both tasks can be solved, but it is clearly more difficult to use
the Roman numeral representations of twelve and twenty-four. Humans excel at finding
appropriate representations for solving complex problems. This is not true for artificial
systems, which have largely relied on humans to provide appropriate representations. The
ability to autonomously construct useful representations and to efficiently exploit them is
an important challenge for artificial intelligence.
This dissertation builds on a recently introduced graph-based approach to learning representations
for sequential decision-making problems modeled as Markov decision processes
(MDPs). Representations, or basis functions, forMDPs are abstractions of the problem’s
state space and are used to approximate value functions, which quantify the expected
long-term utility obtained by following a policy. The graph-based approach generates basis
functions capturing the structure of the environment. Handling large environments requires
efficiently constructing and utilizing these functions. We address two issues with
this approach: (1) scaling basis construction and value function approximation to large
graphs/data sets, and (2) tailoring the approximation to a specific policy’s value function.
We introduce two algorithms for computing basis functions from large graphs. Both
algorithms work by decomposing the basis construction problem into smaller, more manageable
subproblems. One method determines the subproblems by enforcing block structure,
or groupings of states. The other method uses recursion to solve subproblems which
are then used for approximating the original problem. Both algorithms result in a set of basis
functions from which we employ basis selection algorithms. The selection algorithms
represent the value function with as few basis functions as possible, thereby reducing the
computational complexity of value function approximation and preventing overfitting.
The use of basis selection algorithms not only addresses the scaling problem but also
allows for tailoring the approximation to a specific policy. This results in a more accurate
representation than obtained when using the same subset of basis functions irrespective of
the policy being evaluated. To make effective use of the data, we develop a hybrid leastsquares
algorithm for setting basis function coefficients. This algorithm is a parametric
combination of two common least-squares methods used for MDPs. We provide a geometric
and analytical interpretation of these methods and demonstrate the hybrid algorithm’s
ability to discover improved policies. We also show how the algorithm can include graphbased
regularization to help with sparse samples from stochastic environments.
This work investigates all aspects of linear value function approximation: constructing
a dictionary of basis functions, selecting a subset of basis functions from the dictionary,
and setting the coefficients on the selected basis functions. We empirically evaluate each
of these contributions in isolation and in one combined architecture.
Analysis and Improvement of Policy Gradient Estimation
"... Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients ..."
Abstract
- Add to MetaCart
Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

