| BERTSEKAS, D. P. Differential training of rollout policies. In Proc. of the 35th Allerton Conference on Communication, Control, and Computing (Allerton Park, Ill., October 1997). |
....performance of the greedy policy is the relative ordering the approximate value function assigns to the successor states in each state. This motivates an alternative approach: instead of seeking to minimize (1) or an 2 variant, one should minimize some form of relative error between state values [2, 7, 32]. See Section 2 for definitions. For a proof of (2) see [8, Proposition 6.1] or [34] or [23] While this idea is promising, the approach we take in this paper is even more direct: search for a policy minimizing the expected discounted reward directly. We can view the average reward (2) as a ....
D. P. Bertsekas. Differential Training of Rollout Policies. In Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., 1997.
....approximators are trained with respect to relative state values on binary decision problems. We characterise the limiting behaviour of STD( and provide a theorem guaranteeing optimality of the limiting parameter vector for two state BMDPs. Section 4. 2 compares STD( with differential training [Bertsekas, 1997] , highlighting the different distributions of state pairs used in training. Experimental results in section 5 demonstrate the success of STD( in the two state system and in a variant of the well known acrobot problem. 1.1 Previous work on State Differences. Several researchers have previously ....
....difference learning to develop an instruction scheduler for an optimising compiler. Their approach uses table lookup rather than function approximation, and combines possible successor states into a single feature vector which is mapped to a preference indicator. Bertsekas differential training [Bertsekas, 1997] is the most closely related previous work. We defer discussion of it until section 4.2. 2 Generating Suboptimal Policies Convergence by TD( to suboptimal policies can be found in even the simplest non trivial system a Markov Decision Process with only two states. Consider the transition ....
[Article contains additional citation context not shown here]
Dimitri P. Bertsekas. Differential Training of Rollout Policies. In Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., 1997.
....actions allows us to tolerate less accurate estimates when the inaccuracy is systematic across all actions. So, our technique allows savings in simulation time by focusing on the relative utility similar to the benefits of common random numbers simulation in perturbation analysis [9] see also [3]) State Estimation: Stochastic traffic models implicitly describe probability distributions over possible future traffic sequences. With many traffic models, these distributions change over time depending on the actual traffic. For example, if the model indicates we are expecting a burst of ....
....Work: As mentioned in Section 1, the recent decision theoretic work of [15] and [10] provide an impractical alternative way of finding optimal policies using simulation. Recent work by Bertsekas et al. on rollout algorithms also provide a means of using simulations to find good control actions [3, 4]. However, that work was formulated for the simpler fullyobservable problem domain, rather than the partiallyobservable domain common in network control. Also, the rollout approach relies on starting with a good heuristic policy for the control problem being considered, in contrast with our ....
D. P. Bertsekas, "Differential Training of Rollout Policies," in Proc. 35th Allerton Conference on Communication, Control, and Computing, October 1997.
....letter case, it would be very difficult to analyze the performance bound of the estimate because it would require essentially that we know the distribution of Q value for each action at a given state over the action space. 3.1. 2 Roll out algorithm Sampling is also used by Bertsekas and Castanon [12, 13] to design a heuristic method of policy improvement called roll out. This method uses sampling to improve a given heuristic policy in an online fashion: at each time step, the given policy is simulated using sampling over a finite horizon n, and the results of the simulation are used to select ....
.... value of H, the nonstationary rollout policy defined as ro i (x i ) arg max a2U(x) Q V H Gammai Gamma1 H Gammai (x i ; a) for each i = 0; H Gamma 1 can be shown to outperform the given base policy in terms of the total reward over the horizon H by backwards induction argument [12]. However, for the case of the rollout policy defined as in Equation 3.2 with a finite horizon H does not necessarily outperform the given base policy in terms of the infinite horizon discounted average reward in theory. We first establish here a convergence result for the infinite horizon ....
[Article contains additional citation context not shown here]
D. P. Bertsekas, "Differential training of rollout policies," in Proc. 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, IL, Oct. 1997.
....are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD( in the context of the two state system, is presented, along with a comparison with Bertsekas differential training method [1]. This is followed by successful demonstrations of STD( on the two state system and a variation on the well known acrobot problem. 1 Introduction For complex reinforcement learning problems, TD( with function approximation [2] has proved empirically successful. Its origins go back as far as ....
....difference learning to develop an instruction scheduler for an optimising compiler. Their approach uses tablelookup rather than function approximation, and combines possible successor states into a single feature vector which is mapped to a preference indicator. Bertsekas differential training [1] is the most closely related previous work. We defer discussion of it until section 4.3. 2 Generating Sub optimal Policies Convergence by TD( to sub optimal policies can be found in even the simplest nontrivial system a Markov Decision Process with only two states. Consider the tran2 0 1 ....
[Article contains additional citation context not shown here]
Dimitri P. Bertsekas. Differential Training of Rollout Policies. In Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., 1997.
....user s coding characteristics to build schedules better tuned for that UMass Amherst Tech Report Number 99 23 2 user. With these motivations in mind, we formulated and tested two autonomous methods of building an instruction scheduler. The first method used rollouts (Tesauro and Galperin, 1996, Bertsekas, et al. 1997a,b) and the second focused on reinforcement learning (RL) Sutton Barto, 1998) Both methods were implemented for the Digital Alpha 21064. The next section gives a domain overview and discusses results using supervised learning on the same task. 2 Domain overview We focused on scheduling ....
....our algorithms to schedule blocks whose size is greater than 10, we focus on scheduling the longer running blocks. We present timing results using the simulator. 1 3 Rollouts Rollouts are a form of Monte Carlo search, first introduced by Tesauro and Galperin (1996) for use in backgammon. Bertsekas, et al. 1997a,b) explored rollouts in other domains and proved important theoretical results. In the instruction scheduling domain, rollouts work as follows: suppose the scheduler comes to a point where it has a partial schedule and a set of (more than one) candidate instructions to add to the schedule. For ....
[Article contains additional citation context not shown here]
Bertsekas, D. P. (1997). Differential training of rollout policies. In Proc. of the 35th Allerton Conference on Communication, Control, and Computing. Allerton Park, Ill.
....of the compiler while potentially sacrificing some quality of the final schedule. With these motivations in mind, we formulated and tested two methods of building an instruction scheduler. The first method used rollouts (Woolsey, 1991; Abramson, 1990; Galperin, 1994; Tesauro and Galperin, 1996; Bertsekas et al. 1997a,b) and the second focused on reinforcement learning (RL) Sutton and Barto, 1998) We also investigated the effect of combining the two methods. All methods were implemented for the Compaq Alpha 21064. These methods address the time tradeoff directly. Rollouts evaluate schedules online during ....
....up the simulator considerably. 3 Rollouts Rollouts are a form of Monte Carlo search, first introduced in the backgammon literature (Woolsey, 1991; Galperin, 1994; Tesauro and Galperin, 1996) In other domains, Abramson (1990) studied what we call RANDOM p (below) in a game playing context, and Bertsekas et al. 1997a,b) proved important theoretical results for rollouts. In the instruction scheduling domain, rollouts work as follows: suppose the scheduler comes to a point where it has a partial schedule and a set of (more than one) candidate instructions to add to the schedule. The scheduler appends each ....
[Article contains additional citation context not shown here]
Bertsekas, D. P. (1997). Differential training of rollout policies. In Proc. of the 35th Allerton Conference on Communication, Control, and Computing. Allerton Park, Ill.
.... empirical successes in learning to play games, including checkers (Samuel, 1959) backgammon (Tesauro, 1992; Tesauro, 1994) and chess (Baxter et al. 1999a) Successes outside of the games domain include job shop scheduling (Zhang Dietterich, 1995) and dynamic channel allocation (Singh Bertsekas, 1997). While there are many algorithms for training approximate value functions (see (Bertsekas Tsitsiklis, 1996; Sutton Barto, 1998) for comprehensive treatments) with varying degrees of convergence guarantees, all these algorithms and indeed the approximate value function approach ....
....the greedy policy is the relative ordering the approximate value function assigns to the successor states in each state. This motivates an alternative approach: instead of seeking to minimize (1) or an 2 variant, one should minimize some form of relative error between state values (Baird, 1993; Bertsekas, 1997; Weaver Baxter, 1999) While this idea is promising, the approach we take in this paper is even more direct: search for a policy maximizing the expected discounted reward directly. We can view the average reward (2) as a function ( of 2 R K , where are the parameters of V . Provided ....
Bertsekas, D. P. (1997). Differential Training of Rollout Policies.
....performance of the greedy policy is the relative ordering the approximate value function assigns to the successor states in each state. This motivates an alternative approach: instead of seeking to minimize (1) or an 2 variant, one should minimize some form of relative error between state values [2, 7, 32]. 1 See Section 2 for definitions. 2 For a proof of (2) see [8, Proposition 6.1] or [34] or [23] 2 While this idea is promising, the approach we take in this paper is even more direct: search for a policy minimizing the expected discounted reward directly. We can view the average reward (2) ....
D. P. Bertsekas. Differential Training of Rollout Policies. In Proceedings of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill., 1997.
....at the highest level of optimization. We call the schedules output by the compiler ORIG. This collection has 447,127 basic blocks, containing 2,205,466 instructions. 3 Rollouts Rollouts are a form of Monte Carlo search, first introduced by Tesauro and Galperin (1996) for use in backgammon. Bertsekas et al. 1997a,b) have explored rollouts in other domains and proved important theoretical results. In the instruction scheduling domain, rollouts work as follows: suppose the scheduler comes to a point where it has a partial schedule and a set of (more than one) candidate instructions to add to the schedule. ....
....many times for each instruction to achieve a measure of the average expected outcome. After rolling out each candidate, the scheduler picks the one with the best average running time. Our first set of rollout experiments compared three different rollout policies . Although the theory developed by Bertsekas et al. 1997a,b) proved that if we used the DEC scheduler as , we would perform no worse than DEC, an architect proposing a new machine might not have a good heuristic available to use as , so we also considered policies more likely to be available. The first was the random policy, RANDOM , which is a ....
Bertsekas, D. P. (1997). Differential training of rollout policies. In Proc. of the 35th Allerton Conference on Communication, Control, and Computing. Allerton Park, Ill.
No context found.
BERTSEKAS, D. P. Differential training of rollout policies. In Proc. of the 35th Allerton Conference on Communication, Control, and Computing (Allerton Park, Ill., October 1997).
No context found.
Dmitri P. Bertsekas. Differential training of rollout policies. In Proc. of the 35th Allerton Conference on Communication, Control, and Computing, Allerton Park, Ill, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC