| John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996. |
....even for simple factored systems can grow exponentially in size. Many researchers have proposed the use of a linear approximation, where an approximate value function is represented as a linear combination of basis function. This approach was first proposed for a variety of unfactored MDPs [Tsitsiklis and Van Roy, 1996] and applied to factored MDPs in [Koller and Parr, 2000; Guestrin et al. 2001] They show that even a small set of basis functions can provide a high quality approximation to a high dimensional value function. In this paper, we apply this idea to POMDPs, by using the same approximation for the ....
....an jSj k matrix A whose columns are the k basis functions, viewed as vectors. Our approximate value function is then represented by Aw. The idea of using linear value functions for dynamic programming was proposed, initially, by Bellman et al. 1963] and has been further explored recently [Tsitsiklis and Van Roy, 1996; Koller and Parr, 1999; 2000; Guestrin et al. 2001] The basic idea is as follows: in the solution algorithms, whether value iteration or policy iteration, we use only value functions within H. Whenever the algorithm takes a step that results in a value function V that is outside this space, we ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....this might look like a natural way to generalize value iteration to function approximators, this method does not work well. Baird showed that this approach to value iteration may diverge for a very small decision process [7] Tsitsiklis and Van Roy gave another very simple two state counterexample [72]: 2w 0 In this process, each state has only one possible action and all transitions give a reward of zero. Thus, when a discount factor # 1 is used, the value function is equal to zero everywhere. A linear function approximator is used to approximate this value function. The value gradient ....
John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996. 68 165
....probabilities. If the state variables are continuous or high dimensional, the TD learning rule is typically combined with some sort of function approximator e.g. a linear combination of feature vectors or a neural network which may well lead to numerical instabilities (see, for example, [BM95, TR96]) Specifically, the algorithm may fail to converge under several circumstances which, in the authors opinion, is one of the main obstacles to a more wide spread use of reinforcement learning (RL) in industrial applications. As a remedy, we adopt a non parametric perspective on reinforcement ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 22:59--94, 1996.
....control (observe that the robot does not move along perfectly orthogonal paths) and loss of Markovian characteristics caused by discretisation of the state space. 005 029 5. Related Literature Experience generalisation is related to state aggregation methods analysed by some authors [11, 12]. In particular, it can be shown [13] that Q learning acting on a set of aggregate states converges, provided a persistently exciting action policy is used. However, the set of action values asymptotically reached will depend on the limit distribution P #x# defined by this policy. The use of a ....
J. N. Tsitsiklis and B. V. Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....(local) methods. Decision tree based methods were used e.g. in (Chapman Kaelbling, 1991) and more recently in (Wang Dietterich, 1999) Discretization using triangularization has been advocated in (Munos, 1997) sometimes adaptive) state aggregation was employed in e.g. Moore Atkeson, 1995; Tsitsiklis Van Roy, 1996; Gordon, 1995) Recently, locally weighted regression models have been used by (Smart Kaelbling, 2000) Regression literature teaches us that the advantage of constructive or local methods is that the behavior of the method can be understood much better than that of the gradient based ....
....do not carry over seemlessly to the combinations of local methods and reinforcement learning. This has been observed by many researchers (e.g. Boyan Moore, 1995; Baird, 1995) and then researchers were urged to come up with convergent algorithms. In (Gordon, 1995) and indepentently in (Tsitsiklis Van Roy, 1996) convergence results were derived for approximate dynamic programming when the value tting operator was chosen to be a non expansion 1 More precisely, in (Gordon, 1995) an algorithm of the form V t 1 = GTV t was considered, where T is the value backup operator and G is a nonexpansion w.r.t. ....
[Article contains additional citation context not shown here]
Tsitsiklis, J. N., & Van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59-94.
....Problems (MDPs) were proposed as the model for the analysis of RL [37] and since then a mathematically well founded theory has been constructed for a large class of RL algorithms. These algorithms are based on two basic dynamic programming methods, namely the value and policy iteration algorithms [51, 14, 50, 38, 43]. The basic properties of most of the theoretical results are that they assume nite state and action spaces and discrete time models in which the full description of the state was available. In a real life problem, however, the state and action spaces are in nite, usually nondiscrete, time is ....
J. Tsitsiklis and B. Van Roy, Feature-based methods for large scale dynamic programming, Machine Learning 22 (1996), 59-94.
....even for simple factored systems can grow exponentially in size. Many researchers have proposed the use of a linear approximation, where an approximate value function is represented as a linear combination of basis function. This approach was first proposed for a variety of unfactored MDPs [Tsitsiklis and Van Roy, 1996] and applied to factored MDPs in [Koller and Parr, 2000; Guestrin et al. 2001] They show that even a small set of basis functions can provide a high quality approximation to a high dimensional value function. In this paper, we apply this idea to POMDPs, by using the same approximation for the ....
....an jSj k matrix A whose columns are the k basis functions, viewed as vectors. Our approximate value function is then represented by Aw. The idea of using linear value functions for dynamic programming was proposed, initially, by Bellman et al. 1963] and has been further explored recently [Tsitsiklis and Van Roy, 1996; Koller and Parr, 1999; 2000; Guestrin et al. 2001] The basic idea is as follows: in the solution algorithms, whether value iteration or policy iteration, we use only value functions within H. Whenever the algorithm takes a step that results in a value function V that is outside this space, we ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....A whose columns are the k basis functions, viewed as vectors. Our approximate value function is then represented by Aw. Linear value functions: The idea of using linear value functions for dynamic programming was proposed, initially, by Bellman et al. 1963] and has been further explored recently [Tsitsiklis and Van Roy, 1996; Koller and Parr, 1999; 2000] The basic idea is as follows: in the solution algorithms, whether value or policy iteration, we use only value functions within H. Whenever the algorithm takes a step that results in a value function V that is outside this space, we project the result back into the ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
.... function approximators (for example, linear combinations of feature vectors or neural networks) to represent the value function of the underlying Markov Decision Process (MDP) For a detailed discussion of this problem, as well as a list of exceptions, the interested reader is referred to [5, 31]. By adopting a non parametric perspective on reinforcement learning, we suggest an algorithm that always converges to a unique solution. This algorithm assigns value function estimates to the states in a sample trajectory and updates these estimates iteratively. Each update is based on ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 22:59-94, 1996.
....iteration. Hence there is a need to interface them with a suitable approximation architecture. The two standard paradigms for approximation are state aggregation and the use of parametrized families of functions (with a low dimensional parameter space) for approximating the value function [3] [24], 25] The latter, though it makes eminent sense as an approximation scheme, is not much use from a data compression perspective, whereas state aggregation flts it naturally. This motivates our analysis of a learning algorithm with state aggregation. To be speciflc, we consider an actor critic ....
....flts it naturally. This motivates our analysis of a learning algorithm with state aggregation. To be speciflc, we consider an actor critic type learning algorithm. These were introduced in [1] and studied extensively in [14] We interface this algorithm with state aggregation as proposed in [24], except that this is an on line learning algorithm based on a single simulation run, eschewing in particular the independent sampling over each aggregated state as in [24] which simplifles the analysis of [24] somewhat. Our analysis rigorously justifles aggregating states in terms of the ....
[Article contains additional citation context not shown here]
TSITSIKLIS J., VAN ROY, B., Feature-based methods for large scale dynamic programming, Machine Learning 22 (1996), 185-202.
.... is best known under the name value function approximation, which is used frequently in the context or reinforcement learning [Tadepalli and Ok, 1996; Van Roy, 1998] We use this idea in the context of maintaining full value functions and propagating them through the DP equation (1) Gordon, 1995; Tsitsiklis and Van Roy, 1996] However, unlike other methods, which deal with large state spaces by considering only a restricted set of representative states, our method efficiently finds a least squares approximation for the entire state space. More precisely, let V IR S be a restricted set of value functions. We ....
....determination, we need a similar contraction property. We also need an effective algorithm for projecting into V , relative to this distance. Finally, we need this projection operation to be a non expansion under the same distance, otherwise we are not guaranteed the desired convergence property. Tsitsiklis and Van Roy [1996] underscore the importance of this point by demonstrating a two state MDP for which this type of approximate value determination diverges if we use standard least squares approximation. Unfortunately, it turns out to be nontrivial to find a distance metric that satisfies all criteria. For example, ....
J. D. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59--94, January 1996.
....spanned by the basis functions H . It is useful to define an jSj Theta k matrix A whose columns are the k basis functions, viewed as vectors. Our approximate value function is then represented by Aw. Linear value functions. The idea of using linear value functions has been explored previously [Tsitsiklis and Van Roy 1996; Koller and Parr 1999; 2000] The basic idea is as follows: in the solution algorithms, whether value iteration or policy iteration, we use only value functions within H. Whenever the algorithm takes a step that results in a value function V that is outside this space, we project the result back ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
.... local linear models where the models can be learned with high accuracy (Atkeson, Moore, Schaal 1996) The switch from table lookup approaches to those based on function approximators has been found to be a significant one for model free reinforcement learning (Boyan Moore 1995; Sutton 1996; Tsitsiklis Van Roy 1994). While a wide class of methods have been proven convergent for the table lookup case, many of these, including Q learning and dynamic programming methods, appear to be unstable when simple function approximators are used (Baird 1995; Gordon 1995) Other methods, such as the Sarsa algorithm we use ....
Tsitsiklis, J., and VanRoy, B. 1994. Feature-based methods for large-scale dynamic programming. Technical Report LIDS-P2277, MIT, Cambridge, MA.
....really large problems require the use of generalizing function approximators to represent the model. The switch from table lookup approaches to those based on function approximators has been found to be a significant one for model free reinforcement learning (Boyan Moore 1995) Sutton 1996) (Tsitsiklis Roy 1994). While a wide class of methods have been proven convergent for the tablelookup case, many of these, including Q learning and dynamic programming methods, appear to be unstable when simple function approximators are used (Baird 1995) Gordon 1995) Other methods, such as the Sarsa algorithm we ....
Tsitsiklis, J., and Roy, B. V. 1994. Feature-based methods for large-scale dynamic programming. Technical Report LIDS-P2277, MIT, Cambridge, MA.
....experiments. Thrun and Schwartz #1993# theorize that function approximation of value functions is also dangerous because the errors in value functions due to generalization can become compounded by the #max operator in the de#nition of the value function. Several recent results #Gordon, 1995; Tsitsiklis Van Roy, 1996# showhow the appropriate choice of function approximator can guarantee convergence, though not necessarily to the optimal values. Baird s residual gradient technique #Baird, 1995# provides guaranteed convergence to locally optimal solutions. Perhaps the gloominess of these counter examples is ....
Tsitsiklis, J. N., & Van Roy, B. #1996#. Feature-based methods for large scale dynamic programming. Machine Learning, 22 #1#.
....1994) Second, an arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to establishing convergence assurances for algorithms following the value function approach (Bertsekas and Tsitsiklis, 1996). For example, Q learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996) This can occur even if the best ....
.... assurances for algorithms following the value function approach (Bertsekas and Tsitsiklis, 1996) For example, Q learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996) This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of best is in the mean squared error sense or the slightly di#erent senses of residual gradient, temporal di#erence, and dynamic programming ....
[Article contains additional citation context not shown here]
Tsitsiklis, J. N. Van Roy, B. (1996). Feature-based methods for large scale dynamic programming.
....us to compute updates only for the nite set of belief states. The drawback of the approach is that, when combined with the value iteration method, it can lead to instability and or divergence. This has been shown for MDPs by several researchers (Bertsekas, 1994; Boyan Moore, 1995; Baird, 1995; Tsitsiklis Roy, 1996). 25. This is similar to the QMDP method, which allows both lookahead and greedy designs. In fact, QMDP can be viewed as a special case of the grid based method with Q function approximations, where grid points correspond to extremes of the belief simplex. 74 Value Function Approximations for ....
Tsitsiklis, J. N., & Roy, B. V. (1996). Feature-based methods for large-scale dynamic programming. Machine Learning, 22, 59-94.
....optimal value function of dynamic programming, and then uses it to construct policies that are close to optimal. The understanding of such methods is still somewhat incomplete: convergence results or performance guarantees are available only for a few special cases such as state space aggregation [TR96], optimal stopping problems [TR97] and an idealized form of Policy Iteration [BT96] However, there have been some notable practical successes (see [SB98, BT96] for an overview) including the world class backgammon player by Tesauro [Tes92] b) In an alternative approach, the tuning of a ....
J. N. Tsitsiklis and B. Van Roy. Feature-Based Methods for Large Scale Dynamic Programming. Machine Learning, 22:59--94, 1996.
....splitting operation based on their rectangular partition representation followed by the application of a standard RL or MDP algorithm to the reduced model. We suspect that the rest of their algorithms as well as other RL and MDP algorithms for handling multidimensional state spaces (Moore 1993; Tsitsiklis Van Roy 1996) can be profitably analyzed in terms of model reduction. Partially Observable MDPs The simplest way of using model reduction techniques to solve partially observable MDPs (POMDPs) is to apply the model minimization algorithm to an initial partition that distinguishes on the basis of both reward ....
Tsitsiklis, J. N., and Van Roy, B. 1996. Feature-based methods for large scale dynamic programming. Machine Learning 22:59--94.
....Aggregation and Function Approximation The approach we take to solving large MDPs is a specific state aggregation method. Other types of state aggregation techniques have been proposed, in which states with similar characteristics are grouped together. Such methods are reported in, for instance, [4, 68, 81], and can vary as to whether states are statically or dynamically aggregated (that is, do the groupings of states stay fixed or can they change during computation) Other compact representations of value functions have also been proposed, such as linear function representations or neural networks ....
.... and can vary as to whether states are statically or dynamically aggregated (that is, do the groupings of states stay fixed or can they change during computation) Other compact representations of value functions have also been proposed, such as linear function representations or neural networks [1, 6, 80, 81]. These techniques do not seek to exploit regions of uniformity in value functions, but rather compact functions of state features that reflect value. As such they are distinguished from strict aggregation methods. In much of this previous work, the goal is the approximate solution of large MDPs. ....
John H. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....where i (x) is monotonically increasing with i (x) 0. This method of Q value generalization corresponds to fuzzy state generalization, which will be discussed in the next section. It can also be viewed as a special case of value function approximation using a linear combination of features [Tsitsiklis and Van Roy, 1996]. We have applied the above function approximation approach to the original two state learning problem for the special case when 0 (x) 1 (x) 1 and 0 (x 0 ) 1 (x 1 ) In this case we have 1 (x 0 ) 0 (x 1 ) and 0 (x 0 ) 1 (x 1 ) 1 . We found that for any 0:5, all ....
....states by using a function approximation architecture Q(x; a; r) for approximating Q(x; a) where r is the set of all learned parameters arranged in a single vector. The basic parameter updating rule used by discounted Qlearning or Monte Carlo learning for such an architecture is [Bertsekas and Tsitsiklis, 1996]: r t r t t r r t Q(x t ; a; r t ) 3) where is the learning rate and t is the Bellman error used in the corresponding learning rule for the look up table case: Q(x t ; a) Q(x t ; a) t : 4) For example, in the look up table version of discounted Monte Carlo learning, t = R T ....
[Article contains additional citation context not shown here]
Tsitsiklis, J. N. and Van Roy, B. 1996. \Feature-Based Methods for Large-Scale Dynamic Programming," Machine Learning, Vol. 22, pp. 59-64.
....MDPs and POMDPs is a technique called feature selection. In such reductions, a domain expert chooses certain features of the state space that are expected to be the primary contributors to the transition probabilities and simply ignores any additional information. Tsitsiklis and Van Roy (Tsitsiklis Roy 1996) give a careful analysis of dynamic programming algorithms based on feature selection. They do not give any complexity analysis, but they show that such policy finding algorithms will converge, and will find as good a policy as is possible, given the choice of features. However, this means that ....
Tsitsiklis, J. N., and Roy, B. V. 1996. Feature-based methods for large scale dynamic programming. Machine Learning 22:59--94.
....we mention three important directions for future work in this area: i) Typically, the state space can be very large. This calls for approximations, such as state aggregation or considering a parametrized family of candidate Q factor functions with a low dimensional parameter space. See, e.g. [21]. The algorithms presented need to be interlaced with such approximation architectures and analysed as such. ii) Simulation based algorithms are slow. An analysis of rate of convergence and good speed up procedures are needed. iii) Extension to the case where the state space is not finite is an ....
J. N. TSITSIKLIS, B.VAN ROY, Feature-based methods for large scale dynamic programming, Machine Learning 22 (1996), pp. 59-94.
....the analysis of RL [32, 23] and since then a mathematically well founded theory has been constructed for a large class of RL algorithms. These algorithms are based on modifications of the two basic dynamic programming algorithms used to solve MDPs, namely the value and policy iteration algorithms [31, 7, 11, 28, 24]. The RL algorithms learn via experiencing, gradually building an estimate of the optimal value function, which is known to encompass all the knowledge needed to 2 behave in an optimal way according to a fixed criterion, usually the expected total discounted cost criterion. The basic limitations ....
.... the output of features can well make the problem partially observable, so one should not expect that RL algorithms wich involve optimization will work in general (the theoretical results of Tsitsiklis and Van Roy concern only estimation types of algorithms, such as TD( or non adaptive algorithms [28]) Issues of learning in partially observable environments have been discussed by Singh et al. 21] The second track is related to the use of local controllers together with a switching function which selects the controller to be activated at any arbitrary time. In connection with this topic, ....
J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....regression, local weighted regression, and neural net fitting. In later chapters we will talk about ways to use more general function approximators. Most of the material in this chapter is drawn from [Gor95a] and [Gor95b] Some of this material was discovered simultaneously and independently in [TV94] A related algorithm which learns online (that is, by following trajectories in the MDP and updating states only as they are visited, in contrast to the way fitted value iteration can update states in any order) is described in [SJJ95] 2.1 Discounted processes In this section, we will consider ....
....defined in [Gor95a] the definition there is slightly less general than the one given here, but the theorems given there still hold for the more general definition. A similar class of function approximators (called interpolative representations) was defined simultaneously and independently in [TV94] More precisely, if M has n states, then specifying an averager is equivalent to picking n real numbers k i and n 2 nonnegative real numbers fi ij such that for each i we have P n j=1 fi ij 1. With these numbers, the fitted value at the ith state is defined to be k i n X j=1 fi ij f ....
[Article contains additional citation context not shown here]
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for largescale dynamic programming. Technical Report P-2277, Laboratory for Information and Decision Systems, 1994.
....another part. The state of an object distant from the robot does not (usually) change as a result of some action performed by the robot. Factored representations can be exploited e.g. to reduce the memory requirement of the storage of a representation of the dynamics. Function approximators (e.g. [15, 19]) Central to RL algorithms are value functions. A value function is a function that renders real numbers (values) to states. The number associated with a state represents the long term value of a policy or the best possible long term value that can be achieved from the given state. In this latter ....
....(and thus imprecise) estimate of the value function. This brings in new complications to the estimation procedure. As to day, linear function approximators of the form T OE(x) where OE : X R n is a fixed function (OE is called the feature function) are known to work with RL algorithms [19]. However, often, memory based neural networks are used in RL algorithm with pretty good success [10] Observers (e.g. 5] Control theorists use filters or observers to estimate the state if it is unobservable. These observers usually require knowledge of the dynamics and exploit the ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....et al. 1995, 1996] which assembled the basic ideas for using a factored representation together with a stochastic analog of goal regression to reason about aggregate behaviors in stochastic systems. We explore the connections between model minimization and goal regression in [Givan Dean, 1997] Tsitsiklis and Van Roy [1996] also describe algorithms for solving Markov decision processes with factorial models. Our basic treatment Markov decision processes borrows from Puterman [1994] The primary contributions of this paper consist of (a) noting that we can factor the dynamics of a planning domain along the lines of ....
Tsitsiklis, John N. and Van Roy, Benjamin, 1996, Feature-based methods for large scale dynamic programming. Machine Learning 22:5994.
....additional errors with its extra approximation step. We will be concerned here only with direct algorithms. Watkins (1989) Q learning algorithm can find the Q function for small MDPs, either online or offline. Convergence with probability 1 in the online case was proven in (Jaakkola et al. 1994, Tsitsiklis, 1994). For large MDPs, exact Q learning is too expensive: representing the Q function requires too much space. To overcome this difficulty, we may look for an inexpensive approximation to the Q function. In the offline case, several algorithms for this purpose have been proven to converge (Gordon, ....
....large MDPs, exact Q learning is too expensive: representing the Q function requires too much space. To overcome this difficulty, we may look for an inexpensive approximation to the Q function. In the offline case, several algorithms for this purpose have been proven to converge (Gordon, 1995a, Tsitsiklis and Van Roy, 1994, Baird, 1995) For the online case, there are many fewer provably convergent algorithms. As Baird (1995) points out, we cannot even rely on gradient descent for large, stochastic problems, since we must observe two independent transitions from a given state before we can compute an unbiased ....
[Article contains additional citation context not shown here]
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic programming. Technical Report P-2277, Laboratory for Information and Decision Systems, 1994.
....for the analysis of RL [17] and since then a mathematically well founded theory has been constructed for a large class of RL algorithms. These algorithms are based on modifications of the two basic dynamic programming algorithms used to solve MDPs, namely the value and policy iteration algorithms [25, 5, 10, 23, 18]. The RL algorithms learn via experience, gradually building an estimate of the optimal value function, which is known to encompass all the knowledge needed to behave in an optimal way according to a fixed criterion, usually the expected total discounted cost criterion. The basic limitations of ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
.... is best known under the name value function approximation, which is used frequently in the context or reinforcement learning [Tadepalli and Ok, 1996; Van Roy, 1998] We use this idea in the context of maintaining full value functions and propagating them through the DP equation (1) Gordon, 1995; Tsitsiklis and Van Roy, 1996] However, unlike other methods, which deal with large state spaces by considering only a restricted set of representative states, our method efficiently finds a least squares approximation for the entire state space. More precisely, let V IR S be a restricted set of value functions. We ....
....determination, we need a similar contraction property. We also need an effective algorithm for projecting into V , relative to this distance. Finally, we need this projection operation to be a non expansion under the same distance, otherwise we are not guaranteed the desired convergence property. Tsitsiklis and Van Roy [1996] underscore the importance of this point by demonstrating a two state MDP for which this type of approximate value determination diverges if we use standard least squares approximation. Unfortunately, it turns out to be nontrivial to find a distance metric that satisfies all criteria. For example, ....
J. D. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59--94, January 1996.
....a) and the expected time between two successive visits is bounded for any state action pair then Q t , as defined by the appropriately modified version of Equation (4.6) converges to the true optimal Q function, Q [52] 4.6. DISCUSSION 75 new, although Gordon [24] and Tsitsiklis and Van Roy [80] commented on these questions in the context of using function approximators together with value iteration. Proposition 4.3.2 is due to Gordon [24, Theorem 6.2] The idea of state aggregation is discussed in the context of mdps in [59] 8] and the context of RL in [61] The basic model of ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
.... programming techniques; in particular, neural network approximators may not converge, but certain linear interpolation approimators will [12] as will some feature based methods (including radial basis function networks) satisfying certain properties, under a modified dynamic programming algorithm [30]. It is our hope that these results can be extended for our radial basis function networks with reinforcement learning algorithms similar to the Q learning algorithm described above. 5 Conclusions The hybrid, hierarchical, learning control structure biological motor control systems use to deal ....
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....[9, 1] and state aggregation, whereby various states are grouped together and each aggregate state or cluster is treated as a single state. Recently, methods for automatic aggregation have been developed in which certain problem features are ignored, making certain states indistinguishable [8, 3, 11, 5, 16]. In some of these aggregation techniques, the use of standard AI representations like STRIPS or Bayesian networks to represent actions in an MDP can be exploited to help construct the aggregations. In particular, they can be used to help identify which variables are relevant, at any point in the ....
John H. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
....really large problems require the use of generalizing function approximators to represent the model. The switch from table lookup approaches to those based on function approximators has been found to be a significant one for model free reinforcement learning (Boyan Moore 1995; Sutton 1996; Tsitsiklis Van Roy 1994). While a wide class of methods have been proven convergent for the table lookup case, many of these, including Q learning and dynamic programming methods, appear to be unstable when simple function approximators are used (Baird 1995; Gordon 1995) Other methods, such as the Sarsa algorithm we use ....
Tsitsiklis, J., and VanRoy, B. 1994. Feature-based methods for large-scale dynamic programming. Technical Report LIDS-P2277, MIT, Cambridge, MA.
....of the sequential decision problems that arise in practice, the state space is huge. The most sensible way of dealing with this difficulty is to generate compact parametric representations that approximate the Q function. One form of compact representation, as described by Tsitsiklis and Van Roy [54], is based on the use of feature extraction to map the set of states into a much smaller set of feature vectors. By storing a value of the optimal Q function for each possible feature vector, the number of values that need to be computed and stored can be drastically reduced and, if meaningful ....
John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1/2/3):59--94, 1996.
....all of which only deal with cases where thenumberoftunable parametersisthe sameasthe cardinalityofthestate space. Such cases are not practical when state spaces are large or in#nite. The more general case, involvingthe use of function approximation, is addressed by results in #Dayan, 1992#, #Tsitsiklis andVan Roy, 1996#, #Gordon, 1995#, and #Singh et al. 1995#. Thelatter three establish convergence with probability1. However, their results only apply toavery limited class of function approximatorsandinvolvevariants of a constrained version of temporal di#erence learning, known as TD#0#. Dayan #1992# ....
....directly lead toapproximation error bounds or interpretable characterizations of the limit of convergence. In addition tothe positive results, counterexamples tovariantsofthe algorithm have been o#ered in several papers. These include #Baird, 1995#, #Boyan and Moore, 1995#, #Gordon, 1995#, and #Tsitsiklis andVan Roy, 1996#. As suggested bySutton #1995#, thekey feature thatdistinguishes these negative results from their positive counterpartsisthatthevariants of temporal di#erence learning used do not employ on linestate sampling. In particular, samplingisdonebyamechanism that samples states with frequencies ....
[Article contains additional citation context not shown here]
Tsitsiklis, J.N. & Van Roy, B. #1996# #Feature-Based Methods for Large Scale Dynamic Programming," Machine Learning, Vol. 22, pp. 59-94.
....which only deal with cases where the number of tunable parameters is the same as the cardinality of the state space. Such cases are not practical when state spaces are large or infinite. The more general case, involving the use of function approximation, is addressed by results in (Dayan, 1992) (Tsitsiklis and Van Roy, 1996), Gordon, 1995) and (Singh et al. 1995) The latter three establish convergence with probability 1. However, their results only apply to a very limited class of function approximators and involve variants of a constrained version of temporal difference learning, known as TD(0) Dayan (1992) ....
....lead to approximation error bounds or interpretable characterizations of the limit of convergence. In addition to the positive results, counterexamples to variants of the algorithm have been offered in several papers. These include (Baird, 1995) Boyan and Moore, 1995) Gordon, 1995) and (Tsitsiklis and Van Roy, 1996). As suggested by Sutton (1995) the key feature that distinguishes these negative results from their positive counterparts is that the variants of temporal difference learning used do not employ on line state sampling. In particular, sampling is done by a mechanism that samples states with ....
[Article contains additional citation context not shown here]
Tsitsiklis, J.N. & Van Roy, B. (1996) "Feature-Based Methods for Large Scale Dynamic Programming," Machine Learning, Vol. 22, pp. 59-94.
No context found.
John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
No context found.
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
No context found.
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
No context found.
Tsitsiklis, J. N., and van Roy, B. 1996. Feature-based methods for large scale dynamic programming. Machine Learning 22(1/2/3):59--94.
No context found.
J. N. Tsitsiklis and B. Van Roy, "Feature-based methods for large-scale dynamic programming, " Machine Learning, vol. 22, pp. 59--94, 1996.
No context found.
J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
No context found.
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996.
No context found.
John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59-94, 1996.
No context found.
J. N. Tsitsiklis and B. Van Roy, Feature--Based Methods for Large Scale Dynamic Programming. Machine Learning, 22:59--94, 1996.
No context found.
J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59--94, 1996. 17
No context found.
J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Mach. Learn., 22:59--94, 1996.
No context found.
J. N. Tsitsiklis and B. Van Roy, 1994, Feature-based methods for large scale dynamic programming. Technical Report, LIDS-P-2277, Laboratory for Information and Decision Systems, Massachussetts Institute of Technology, Cambridge, MA 02139, USA.
No context found.
Tsitsiklis, John N. and Van Roy, Benjamin (1996a).Feature-Based Methods for Large Scale Dynamic Programming, Machine Learning, 22, pp. 59-94.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC