Results 1  10
of
38
Toward OffPolicy Learning Control with Function Approximation
"... We present the first temporaldifference learning algorithm for offpolicy control with unrestricted linear function approximation whose pertimestep complexity is linear in the number of features. Our algorithm, GreedyGQ, is an extension of recent work on gradient temporaldifference learning, wh ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
(Show Context)
We present the first temporaldifference learning algorithm for offpolicy control with unrestricted linear function approximation whose pertimestep complexity is linear in the number of features. Our algorithm, GreedyGQ, is an extension of recent work on gradient temporaldifference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal actionvalue function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting latent learning because the optimal policy, though learned, is not manifest in behavior. Popular offpolicy algorithms such as Qlearning are known to be unstable in this setting when used with linear function approximation. In reinforcement learning, the term “offpolicy learning” refers to learning about one way of behaving, called the target policy, from data generated by another way of selecting actions, called the behavior policy. The target policy is often an approximation to the optimal policy, which is typically deterministic, whereas the behavior policy is often stochastic, exploring all possible actions in each state as part of finding the optimal policy. Freeing the behavior policy from the target policy enables a greater variety of exploration strategies to be used. It also enables learning from training data generated by unrelated controllers, including manual human control, and from previously collected data. A third reason for interest in offpolicy learning is that it permits learning about multiple target policies (e.g., optimal policies for multiple subgoals) from a single stream of data generated by a
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
ModelFree Reinforcement Learning as Mixture Learning
"... We cast modelfree reinforcement learning as the problem of maximizing the likelihood of a probabilistic mixture model via sampling, addressing both the infinite and finite horizon cases. We describe a Stochastic Approximation EM algorithm for likelihood maximization that, in the tabular case, is eq ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
We cast modelfree reinforcement learning as the problem of maximizing the likelihood of a probabilistic mixture model via sampling, addressing both the infinite and finite horizon cases. We describe a Stochastic Approximation EM algorithm for likelihood maximization that, in the tabular case, is equivalent to a nonbootstrapping optimistic policy iteration algorithm like Sarsa(1) that can be applied both in MDPs and POMDPs. On the theoretical side, by relating the proposed stochastic EM algorithm to the family of optimistic policy iteration algorithms, we provide new tools that permit the design and analysis of algorithms in that family. On the practical side, preliminary experiments on a POMDP problem demonstrated encouraging results. 1.
Qlearning and Pontryagin’s Minimum Principle
"... Abstract — Qlearning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a nonoptimal policy. It has proven to be effective for models with finite state and action space. This paper establishes connections between Ql ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Abstract — Qlearning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a nonoptimal policy. It has proven to be effective for models with finite state and action space. This paper establishes connections between Qlearning and nonlinear control of continuoustime models with general state space and general action space. The main contributions are summarized as follows. (i) The starting point is the observation that the “Qfunction” appearing in Qlearning algorithms is an extension of the Hamiltonian that appears in the Minimum Principle. Based on this observation we introduce the steepest descent Qlearning (SDQlearning) algorithm to obtain the optimal approximation of the Hamiltonian within a prescribed finitedimensional function class. (ii) A transformation of the optimality equations is performed based on the adjoint of a resolvent operator. This is used to construct a consistent algorithm based on stochastic approximation that requires only causal filtering of the timeseries data. (iii) Several examples are presented to illustrate the application of these techniques, including application to distributed control of multiagent systems. I.
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Parametric Value Function Approximation: a Unified View
"... Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subt ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
(Show Context)
Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subtopic is to approximate this function when the system is too large for an exact representation. This survey reviews and unifies state of the art methods for parametric value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive leastsquares approach. Index Terms—Reinforcement learning, value function approximation, survey. I.
Dynamic Policy Programming
 Journal of Machine Learning Research
"... The following full text is a preprint version which may differ from the publisher's version. ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
The following full text is a preprint version which may differ from the publisher's version.
A Brief Survey of Parametric value Function approximation
"... Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforce ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforcement learning is to compute an approximation of this value function when the system is too large for an exact representation. This survey reviews state of the art methods for (parametric) value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive
Convergence analysis of onpolicy LSPI for multidimensional continuous state and actionspace mdps and extension with orthogonal polynomial approximation. Working paper
, 2010
"... We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the postdecision state variable to eliminate the expectation inside the optimization problem. We provide a formal convergence analysis of the algorithm under the assumption that value functions are spanned by finitely many known basis functions. Furthermore, the convergence result extends to the Central to the solution of Markov decision processes is Bellman’s equation, which is often written in the standard form (Puterman (1994)) Vt(xt) = max ut∈U {C(xt, ut) + γ ∑