#### DMCA

## Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path (2008)

### Cached

### Download Links

Venue: | MACHINE LEARNING JOURNAL (2008) 71:89-129 |

Citations: | 113 - 21 self |

### Citations

5603 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...nction approximation goes back to the early days of dynamic programming [11, 12]. With the recent growth of interest in reinforcement learning, work on value function approximation methods flourished =-=[13, 14]-=-. Recent theoretical results mostly concern supremum-norm approximation errors [15, 16], where the main condition on the way intermediate iterates are mapped (projected) to the function space is that ... |

956 |
Markov chains and stochastic stability
- Meyn, Tweedie
- 2009
(Show Context)
Citation Context ...m →∞. Note that there exist many other definitions of mixing. The weakest among those most commonly used is called α-mixing. Another commonly used one is φ-mixing which is stronger than β-mixing (see =-=[6]-=-). A β-mixing process is said to mix at an exponential rate with parameters b,κ > 0ifβm = O(exp(−bmκ )). .s580 A. Antos, C. Szepesvári, and R. Munos Assumption 2 (SamplePath Properties).Assumethat {(X... |

828 | Convergence of Stochastic Processes - POLLARD - 1984 |

778 | Some Studies in Machine Learning using the Game of Checkers, reprinted
- Samuel
- 1990
(Show Context)
Citation Context ...2 , for all 0 ≤ k<K <δ.Applying Lemma 9 with η = ɛ/2 ends the proof. ⊓⊔ 5 Discussion and Related Work The idea of using value function approximation goes back to the early days of dynamic programming =-=[11, 12]-=-. With the recent growth of interest in reinforcement learning, work on value function approximation methods flourished [13, 14]. Recent theoretical results mostly concern supremum-norm approximation ... |

461 | Least-squares policy iteration
- Lagoudakis, Parr
(Show Context)
Citation Context ...alued. The algorithm considered is an iterative procedure where each iteration involves solving a least-squares problem, similar to the Least-Squares Policy Iteration algorithm of Lagoudakis and Parr =-=[1]-=-. However, whilst Lagoudakis and Parr considered the so-called least-squares fixed-point approximation to avoid problems with Bellman-residual minimization in the case of correlated samples, we G. Lug... |

440 | Introduction to approximation theory - Cheney - 1966 |

416 |
Neural Network Learning – Theoretical Foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...e continuous w.r.t. ν. 2 During the course of the proof, we will need several capacity concepts of function sets. Here we assume that the reader is familiar with the concept of VCdimension (see, e.g. =-=[7]-=-), but we introduce covering numbers because slightly different definitions of it exist in the literature: For a semi-metric space (M,d)andforeachε>0, define the covering number N(ε,M,d) as the smalle... |

404 | Mixing - Properties and Examples - Doukhan - 1994 |

324 | A Distribution-free Theory of Non-parametric Regression - Györfi, Kohler, et al. - 1996 |

281 | Towards a modern theory of adaptive networks: expectation and prediction
- Sutton, Barto
- 1981
(Show Context)
Citation Context ...N(f;ˆπ). At a first sight, the choice of ˆ LN seems to be logical as for any given Xt,At and f, Rt+γf(Xt+1, ˆπ(Xt+1) is an unbiased estimate of (T ˆπ f)(Xt,At). However, as it is well known (see e.g. =-=[4]-=-[pp. 220], [5,1]), ˆ LN is not a “proper”approximation to the corresponding L2 � � Bellman-error: E ˆLN(f;ˆπ) �= L(f;ˆπ). In fact, anelementary calculus showsthat for Y ∼ P(·|x, a), R ∼ S(·|x, a), E �... |

263 | Stable function approximation in dynamic programming.
- Gordon
- 1995
(Show Context)
Citation Context ...the recent growth of interest in reinforcement learning, work on value function approximation methods flourished [13, 14]. Recent theoretical results mostly concern supremum-norm approximation errors =-=[15, 16]-=-, where the main condition on the way intermediate iterates are mapped (projected) to the function space is that the corresponding operator, Π,mustbea non-expansion. Practical examples when Π satisfie... |

260 | Linear least-squares algorithms for temporal difference learning. - Bradtke, Barto - 1996 |

258 |
Stochastic optimal control: The discrete time case
- Bertsekas, Shreve
- 1978
(Show Context)
Citation Context ...ction Q ∈ B(X ×A) if, for all x ∈ X,a ∈A, π(x) ∈ argmaxa∈A Q(x, a). Since A is finite, such a greedy policy always exist. It is known that under mild conditions the greedy policy w.r.t. Q∗ is optimal =-=[3]-=-. For a deterministic stationary policy π define the operator T π : B(X×A) → B(X×A)by(TπQ)(x, a)=r(x, a)+γ � Q(y,π(y))P(dy|x, a). For any deterministic stationary policy π : X→Alet the operator Eπ : B... |

224 | Tree-based batch mode reinforcement learning.
- Ernst, Geurts, et al.
- 2005
(Show Context)
Citation Context ...pped (projected) to the function space is that the corresponding operator, Π,mustbea non-expansion. Practical examples when Π satisfies the said property include certain kernel-basedmethods, see e.g. =-=[15,16,17,18]-=-. However,the growth-restriction imposed on Π rules out many popular algorithms, such as regression-based approaches that were found, however, to behave well in practice (e.g. [19,20, 1]). The need fo... |

204 | Empirical Processes: Theory and Applications - Pollard |

177 | Feature-based methods for large scale dynamic programming.
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ...the recent growth of interest in reinforcement learning, work on value function approximation methods flourished [13, 14]. Recent theoretical results mostly concern supremum-norm approximation errors =-=[15, 16]-=-, where the main condition on the way intermediate iterates are mapped (projected) to the function space is that the corresponding operator, Π,mustbea non-expansion. Practical examples when Π satisfie... |

152 | Kernel-based reinforcement learning - Ormoneit, Sen |

137 | Generalized polynomial approximations in Markovian decision processes. - Schweitzer, Seidmann - 1985 |

133 | Mixing and moment properties of various Garch and stochastic volatility models, - Carrasco, Chen - 2002 |

111 | Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension
- Haussler
- 1995
(Show Context)
Citation Context ... ˜ Q ′ ) ∈F L ×F L in the left-hand-side covering number is defined in the unusual way l x 1:N((f,Q ′ ), (g, ˜ Q ′ )) = 1 N N� |f(xt, ˆπ(xt; Q ′ )) − g(xt, ˆπ(xt; ˜ Q ′ ))|. t=1 Finally, see Haussler =-=[10]-=- (and Anthony and Bartlett [7, Theorem 18.4]) for Proposition 7 (Haussler [10] Corollary 3). For any set X,anypointsx1:N ∈ X N ,anyclassFof functions on X taking values in [0,K] with pseudo-dimension ... |

87 | Error bounds for approximate policy iteration.
- Munos
- 2003
(Show Context)
Citation Context ...first sight, the choice of ˆ LN seems to be logical as for any given Xt,At and f, Rt+γf(Xt+1, ˆπ(Xt+1) is an unbiased estimate of (T ˆπ f)(Xt,At). However, as it is well known (see e.g. [4][pp. 220], =-=[5,1]-=-), ˆ LN is not a “proper”approximation to the corresponding L2 � � Bellman-error: E ˆLN(f;ˆπ) �= L(f;ˆπ). In fact, anelementary calculus showsthat for Y ∼ P(·|x, a), R ∼ S(·|x, a), E � (f(x, a) − R − ... |

77 | Max-norm Projections for Factored MDPs.
- Guestrin, Koller, et al.
- 2001
(Show Context)
Citation Context ...pped (projected) to the function space is that the corresponding operator, Π,mustbea non-expansion. Practical examples when Π satisfies the said property include certain kernel-basedmethods, see e.g. =-=[15,16,17,18]-=-. However,the growth-restriction imposed on Π rules out many popular algorithms, such as regression-based approaches that were found, however, to behave well in practice (e.g. [19,20, 1]). The need fo... |

73 |
Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability
- Yu
- 1994
(Show Context)
Citation Context ...A. Antos, C. Szepesvári, and R. Munos Lemma 2. Suppose that Z0,...,ZN ∈Zis a stationary β-mixing process with mixing coefficients {βm}, Z ′ t ∈Z(t ∈ H) are the block-independent “ghost” samples as in =-=[8]-=-, and H = {2ikN + j :0≤ i<mN, 0 ≤ j<kN }, and that F is a permissible class of Z→[−K,K] functions. Then � � � � 1 P sup � �N f∈F � N� � � f(Zt) − E[f(Z0)] � � >ε � t=1 ≤ 16E[N1(ε/8, F, (Z ′ t ; t ∈ H)... |

57 | Nonparametric time series prediction through adaptive model selection
- Meir
(Show Context)
Citation Context ...rked with dependent samples. The technique used to deal with dependent samples was to introduce (strong) mixing conditions on the trajectory and extending Pollard’s inequality along the lines of Meir =-=[22]-=-. Also, the bounds developed in Section 4.2 are closely related to those developed in [5]. However, there only the case C(ν) < ∞ was considered, whilst in this paper the analysis was extended to the s... |

43 | Finite time bounds for sampling based fitted value iteration
- Szepesvdri, Munos
- 2005
(Show Context)
Citation Context ...four paper is that we introduced a modified Bellmanresidual that guarantees asymptotic consistency even with a single sample path. The closest to the present work is the paper of Szepesvári and Munos =-=[21]-=-. However,as opposed to paper [21], herewe dealt with a fitted policy iterationalgorithm and unlike previously, we worked with dependent samples. The technique used to deal with dependent samples was ... |

41 |
Functional approximations and dynamic programming
- Bellman, Dreyfus
- 1959
(Show Context)
Citation Context ...2 , for all 0 ≤ k<K <δ.Applying Lemma 9 with η = ɛ/2 ends the proof. ⊓⊔ 5 Discussion and Related Work The idea of using value function approximation goes back to the early days of dynamic programming =-=[11, 12]-=-. With the recent growth of interest in reinforcement learning, work on value function approximation methods flourished [13, 14]. Recent theoretical results mostly concern supremum-norm approximation ... |

38 |
Histogram regression estimation using datadependent partitions.
- Nobel
- 1996
(Show Context)
Citation Context ...partition family Π, define ⎧ ⎫ ⎨ � ⎬ G◦Π = f = gjI {Aj} : π = {Aj}∈Π,gj ∈G ⎩ ⎭ . Aj∈π We quote here a result of Nobel (with any domain X instead of R s andwithminimised premise): Proposition 4 (Nobel =-=[9]-=- Proposition 1). Let Π be any partition family with m(Π) < ∞, G be a class of functions on X, x 1:N ∈X N .LetφN(·) be such that for any ε>0, the empirical ε-covering numbers of G on all subsets of the... |

38 | Batch value function approximation via support vectors.
- Dietterich, Wang
- 2002
(Show Context)
Citation Context ..., see e.g. [15,16,17,18]. However,the growth-restriction imposed on Π rules out many popular algorithms, such as regression-based approaches that were found, however, to behave well in practice (e.g. =-=[19,20, 1]-=-). The need for analysing the behaviour of such algorithms provided the basic motivation for this work. One of the main novelties ofour paper is that we introduced a modified Bellmanresidual that guar... |

32 | Adaptive estimation in autoregression or β-mixing regression via model selection - Baraud, Comte, et al. - 2001 |

32 | Interpolation-based Q-learning - Szepesvári, Smart - 2004 |

26 | A generalization error for Q-learning. - Murphy - 2005 |

23 | Efficient Value Function Approximation Using Regression Trees
- Dietterich, G, et al.
- 1999
(Show Context)
Citation Context ..., see e.g. [15,16,17,18]. However,the growth-restriction imposed on Π rules out many popular algorithms, such as regression-based approaches that were found, however, to behave well in practice (e.g. =-=[19,20, 1]-=-). The need for analysing the behaviour of such algorithms provided the basic motivation for this work. One of the main novelties ofour paper is that we introduced a modified Bellmanresidual that guar... |

16 | Off-policy temporal difference learning with function approximation. - Precup, Sutton, et al. - 2001 |

14 | A Probabilistic Theory - Devroye, Györfi, et al. - 1996 |

6 | Mixing conditions for Markov chains. - Davidov - 1973 |

4 | III: 1994, ‘Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions - Williams, Baird |

2 |
Learning near-optimal policies with fitted policy iteration and a single sample path: approximate iterative policy evaluation
- Szepesvári, Munos
- 2006
(Show Context)
Citation Context ...r-Optimal Policies with Bellman-Residual Minimization 575 modify the original Bellman-residual objective. In a forthcoming paper we study policy iteration with approximate iterative policy evaluation =-=[2]-=-. The main conditions of our results can be grouped into three parts: Conditions on the system, conditions on the trajectory (and the behaviour policy used to generate the trajectory) and conditions o... |

1 | version 1 - 4 Jun 2013 - hal-00830201 |