Results 1  10
of
113
Fast gradientdescent methods for temporaldifference learning with linear function approximation
 In Danyluk et
, 2009
"... ..."
Toward OffPolicy Learning Control with Function Approximation
"... We present the first temporaldifference learning algorithm for offpolicy control with unrestricted linear function approximation whose pertimestep complexity is linear in the number of features. Our algorithm, GreedyGQ, is an extension of recent work on gradient temporaldifference learning, wh ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
(Show Context)
We present the first temporaldifference learning algorithm for offpolicy control with unrestricted linear function approximation whose pertimestep complexity is linear in the number of features. Our algorithm, GreedyGQ, is an extension of recent work on gradient temporaldifference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal actionvalue function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular offpolicy algorithms such as Qlearning are known to be unstable in this setting when used with linear function approximation. In reinforcement learning, the term “offpolicy learning” refers to learning about one way of behaving, called the target policy, from data generated by another way of selecting actions, called the behavior policy. The target policy is often an approximation to the optimal policy, which is typically deterministic, whereas the behavior policy is often stochastic, exploring all possible actions in each state as part of finding the optimal policy. Freeing the behavior policy from the target policy enables a greater variety of exploration strategies to be used. It also enables learning from training data generated by unrelated controllers, including manual human control, and from previously collected data. A third reason for interest in offpolicy learning is that it permits learning about multiple target policies (e.g., optimal policies for multiple subgoals) from a single stream of data generated by a
Regularized Policy Iteration
"... In this paper we consider approximate policyiterationbased reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of nonparametric methods with regularization, providing a convenient way to control the complexity of the function approx ..."
Abstract

Cited by 46 (8 self)
 Add to MetaCart
(Show Context)
In this paper we consider approximate policyiterationbased reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of nonparametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2regularization to two widelyused policy evaluation methods: Bellman residual minimization (BRM) and leastsquares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate valuefunctions belong to a reproducing kernel Hilbert space. We also provide finitesample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
"... We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.
R.: Analysis of a classificationbased policy iteration algorithm
 In: Proceedings of the 27th International Conference on Machine Learning
, 2010
"... Abstract We present a classificationbased policy iteration algorithm, called Direct Policy Iteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
Abstract We present a classificationbased policy iteration algorithm, called Direct Policy Iteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classificationbased policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity.
FiniteTime Bounds for Fitted Value Iteration
"... In this paper we develop a theoretical analysis of the performance of samplingbased fitted value iteration (FVI) to solve infinite statespace, discountedreward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come i ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
In this paper we develop a theoretical analysis of the performance of samplingbased fitted value iteration (FVI) to solve infinite statespace, discountedreward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finitetime bounds on the performance of two versions of samplingbased FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L pnorm performance bounds. As a result, our technique applies to a large class of functionapproximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discountedaverage concentrability of the futurestate distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned ” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings.
Rollout sampling approximate policy iteration
 Machine Learning
, 2008
"... Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supe ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multiarmed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountaincar. 1
FiniteSample Analysis of LSTD
"... In this paper we consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning algorithm. We report a finitesample analysis of LSTD. We first derive a bound on the performance of ..."
Abstract

Cited by 30 (13 self)
 Add to MetaCart
In this paper we consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning algorithm. We report a finitesample analysis of LSTD. We first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is βmixing. 1.
Managing power consumption and performance of computing systems using reinforcement learning
 In: Advances in Neural Information Processing Systems 20
, 2008
"... Electrical power management in largescale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save p ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
Electrical power management in largescale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multicriteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both handdesigned policies as well as obvious “cookbook ” RL implementations. 1