| G. Tesauro and G. R. Galperin. On-line policy improvement using Monte-Carlo search. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NIPS, volume 9. MIT Press, 1997. |
....number of updates. If RFDS does not construct a path to a goal state by the time that a sufficient number of updates are performed, then Theorem 3 guarantees the completion of the path afterward. A second method for generating a CLF that meets the conditions of Theorem 3 is to perform roll outs [Tesauro and Galperin, 1996; Bertsekas et al. 1997] In the present context, this means defining L 0 (s) the cost of the solution path generated by repeated application of O 1 starting from s and until it reaches G . The function L 0 is evaluated by actually constructing the path. The reader may verify that L 0 ....
....satisfies the requirements of Theorem 3. Performing roll outs can be an expensive method for evaluating leaves because an entire path to a goal state needs to be constructed for each evaluation. However, roll outs have been found to be quite effective in both game playing and sequential control [Tesauro and Galperin, 1996; Bertsekas et al. 1997] 6 Robot Arm Example We briefly illustrate the theory presented above by applying it to a problem requiring the control of the simulated 3 link robot arm, depicted in Figure 1. The state space of the arm is 6 , corresponding to three angular joint positions and three ....
G. Tesauro and G. R. Galperin. On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing: Proceedings of the Ninth Conference. MIT Press, 1996.
....search space, RoxyBot usually found the optimal completion in less than 1 second of search. Therefore, during the competition, RoxyBot used beam search rather than provably optimal A search. Our heuristic f(x) is inspired by the rollout methods that have been used in game tree search (e.g. [1, 11]) It works as follows: it runs a greedy algorithm to complete the assignment from x down to the bottom of the tree: for a client that has thus far been assigned neither a travel package nor an entertainment package, s he is assigned the travel and entertainment packages that jointly maximize ....
G. Tesauro and G. R. Galperin. On-line policy improvement using Monte-Carlo search. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NIPS, volume 9. MIT Press, 1997.
....a useful performance guarantee does not depend on the size of the state space, it exponentially depends on the value of the sampling horizon and a huge number of sampled next states, which makes applying this approach very impractical. On the other hand, motivated by Tesauro s work in backgammon [161], Bertsekas and Castanon proposed a practically much more viable approach; rolling out a given heuristic base policy. The true Q value for each action is estimated by simulating the given base policy many times over sampled traces of the system after taking the action. With the infinite number ....
....to expect a good performance by using rollout, we need a good base policy. But the process of designing a good heuristic base policy is often very difficult. The straightforward method of estimating Q V H Gamma1 H value is to use a Monte Carlo simulation, which was suggested by Tesauro [161] in the context of backgammon. We generate many sequences of IID drawn random numbers w 0 ; wH Gamma1 in [0; 1] H and then simulate a given base policy after taking an action a starting from the state x. Note that each random number w independently sampled from [0,1] can be mapped to a ....
G. Tesauro and G. R. Galperin, "On-line policy improvement using monte carlo search," in Proc. of NIPS, 1997, pp. 1068.
.... 1995) and has also been used in single agent systems on discrete domains (e.g. Korf 1990) In game playing scenarios it has also been used in conjunction with automatically learned value functions, such as in Samuel s celebrated checkers program (Samuel 1959) and Tesauro s backgammon player (Tesauro and Galperin, 1997). CLS: Constrained Local Search To make deeper searches computationally cheaper, we might consider only a subset of all possible trajectories of depth d. Especially for dynamic control, often an optimal trajectory repeatedly selects and then holds a certain action for some time, such as ....
Tesauro, G., and Galperin, G. R. 1997. On-line Policy Improvement using Monte-Carlo Search. In Mozer, M. C.; Jordan, M. I.; and Petsche, T., eds., Advances in Neural Information Processing Systems 9. Morgan Kaufmann.
....scheduler would exploit the user s coding characteristics to build schedules better tuned for that UMass Amherst Tech Report Number 99 23 2 user. With these motivations in mind, we formulated and tested two autonomous methods of building an instruction scheduler. The first method used rollouts (Tesauro and Galperin, 1996, Bertsekas, et al. 1997a,b) and the second focused on reinforcement learning (RL) Sutton Barto, 1998) Both methods were implemented for the Digital Alpha 21064. The next section gives a domain overview and discusses results using supervised learning on the same task. 2 Domain overview We ....
....compiler unrolled into extremely long blocks. By allowing our algorithms to schedule blocks whose size is greater than 10, we focus on scheduling the longer running blocks. We present timing results using the simulator. 1 3 Rollouts Rollouts are a form of Monte Carlo search, first introduced by Tesauro and Galperin (1996) for use in backgammon. Bertsekas, et al. 1997a,b) explored rollouts in other domains and proved important theoretical results. In the instruction scheduling domain, rollouts work as follows: suppose the scheduler comes to a point where it has a partial schedule and a set of (more than one) ....
Tesauro, G. & Galperin, G. R. (1996). On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing: Proceedings of the Ninth Conference. MIT Press.
....decrease the running time of the compiler while potentially sacrificing some quality of the final schedule. With these motivations in mind, we formulated and tested two methods of building an instruction scheduler. The first method used rollouts (Woolsey, 1991; Abramson, 1990; Galperin, 1994; Tesauro and Galperin, 1996; Bertsekas et al. 1997a,b) and the second focused on reinforcement learning (RL) Sutton and Barto, 1998) We also investigated the effect of combining the two methods. All methods were implemented for the Compaq Alpha 21064. These methods address the time tradeoff directly. Rollouts evaluate ....
....these blocks were then concatenated together for actual execution. This constraint did not affect the results significantly but sped up the simulator considerably. 3 Rollouts Rollouts are a form of Monte Carlo search, first introduced in the backgammon literature (Woolsey, 1991; Galperin, 1994; Tesauro and Galperin, 1996). In other domains, Abramson (1990) studied what we call RANDOM p (below) in a game playing context, and Bertsekas et al. 1997a,b) proved important theoretical results for rollouts. In the instruction scheduling domain, rollouts work as follows: suppose the scheduler comes to a point where it has ....
Tesauro, G. & Galperin, G. R. (1996). On-line policy improvement using Monte-Carlo search. In Advances in Neural Information Processing: Proceedings of the Ninth Conference. MIT Press.
....search, can signi cantly enhance local search performance for combinatorial optimization problems. Other DARP Case Studies We also investigated two other learning based enhancements to combinatorial optimization algorithms, again using DARP as our test problem. We considered the rollout method [109, 108, 106], and we used it to extend a very e ective constructive DARP algorithm developed by Kubo and Kasugai [110] Although our rollout extension is extremely long running, it signi cantly outperforms the best algorithm reported in [110] Indeed even a drastically truncated rollout algorithm outperforms ....
Tesauro, G., and Galperin, G. R. (1996). On-line Policy Improvement using Monte-Carlo Search. In Advances in Neural Information Processing: Proceedings of the Ninth Conference. MIT Press.
....result. The idea of starting with some algorithm, and using it to construct another, hopefully improved, algorithm is implicit in the policy iteration method of DP and in the use of a rollout policy, which is a form of policy iteration; see [BeT96] the name rollout policy was used by Tesauro [TeG96] in connection with one of his simulation based computer backgammon algorithms) This connection will be shown to be particularly relevant to our context, and for this reason we call the sequential version of H the rollout algorithm based on H, and we denote it by RH. We note that the idea ....
Tesauro, G., and Galperin, G. R., 1996. "On-Line Policy Improvement Using Monte Carlo Search," unpublished report. 20
....the operation k (x) arg min u#U(x) E g(x, u, w) J k 1 f(x, u, w) # x, k = 0, 1, 4) Thus the rollout policy is a one step lookahead policy, with the optimal cost to go approximated by the cost to go of the base policy. The name rollout policy was introduced by Tesauro [TeG96] in connection with one of his simulation based computer backgammon algorithms. The book by Bertsekas and Tsitsiklis [BeT96] discusses in some detail various aspects of rollout policies in a stochastic context, and also in a deterministic combinatorial optimization context, as a device for ....
....is a pair of states (x, x # ) of the original system. 4 2. EVALUATING Q FACTORS BY SIMULATION A conceptually straightforward approach to compute the rollout control at a given state x and time k is to use Monte Carlo simulation (this was Tesauro s original proposal in the context of backgammon [TeG96]) To implement this, we consider all possible controls u # U (x) and we generate a large number of simulated trajectories of the system starting from x, using u as the first control, and using the policy # thereafter. Thus a simulated trajectory has the form x i 1 = f x i , i (x i ) w ....
Tesauro, G., and Galperin, G. R., 1996. "On-Line Policy Improvement Using Monte Carlo Search," unpublished report, presented at the 1996 Neural Information Processing Systems Conference, Denver, CO.
....making our assumptions explicit, we better understand the approximations we introduce, and how to select areas of future work that will improve performance further. 4 This is the off line version of our algorithm; the on line version would be a form of policy improvement using roll outs, as in (Tesauro Galperin, 1997). cora.tex; 22 07 1999; 17:35; p.11 12 McCallum, Nigam, Rennie and Seymore A B Figure 4. A representation of spidering space where arrows are hyperlinks and nodes are web documents. The hexagonal node represents an already explored node; the circular nodes are unexplored. Filled in circles ....
Tesauro, G., & Galperin, G. R. (1997). On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing Systems 9, pp. 1068--1074.
....work is required to answer that question. As the size of the scheduling problem increases, it becomes increasingly expensive to compute the value function accurately. However, even an inexact value function can be useful as the basis for a quasi greedy search or rollout search performed online [12]. We intend to test such methods in future work on larger scheduling problems. ....
G. Tesauro and G. R. Galperin. On-line policy improvement using Monte-Carlo search. In M. Mozer, M. Jordan, and T. Petsche, editors, NIPS-9, 1997.
....work is required to answer that question. As the size of the scheduling problem increases, it becomes increasingly expensive to compute the value function accurately. However, even an inexact value function can be useful as the basis for a quasi greedy search or rollout search performed online [ Tesauro and Galperin, 1997 ] We intend to test such methods in future work on larger scheduling problems. Acknowledgements The second author acknowledges the support of a NASA GSRP fellowship. The third author acknowledges the support of an NSF Career Award. ....
G. Tesauro and G. R. Galperin. On-line policy improvement using MonteCarlo search. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1997.
....was compiled using the commercial Digital compiler at the highest level of optimization. We call the schedules output by the compiler ORIG. This collection has 447,127 basic blocks, containing 2,205,466 instructions. 3 Rollouts Rollouts are a form of Monte Carlo search, first introduced by Tesauro and Galperin (1996) for use in backgammon. Bertsekas et al. 1997a,b) have explored rollouts in other domains and proved important theoretical results. In the instruction scheduling domain, rollouts work as follows: suppose the scheduler comes to a point where it has a partial schedule and a set of (more than one) ....
Tesauro, G. & Galperin, G. R. (1996). On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing: Proceedings of the Ninth Conference. MIT Press.
....obtain by applying standard search algorithms. But for highly stochastic Markov decision problems, such values are harder to find. In domains where a simulator is available, Monte Carlo methods (e.g. roll outs or more sophisticated methods, see (Boyan Moore, 1996; Bertsekas Tsitsiklis, 1996; Tesauro Galperin, 1997)) can provide reasonable estimates. But in situations where learning is entirely online, our approach will not be applicable. Second, the method currently requires internal cross validation to set the various parameters (learning rate, relative weighting of the error terms) More experience is ....
Tesauro, G., & Galperin, G. R. (1997). On-line policy improvement using Monte-Carlo search. In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems, Vol. 9, p. 1068. The MIT Press.
No context found.
G. Tesauro and G. R. Galperin. On-line policy improvement using Monte-Carlo search. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NIPS, volume 9. MIT Press, 1997.
No context found.
G. Tesauro and G. R. Galperin. On-line policy improvement using Monte-Carlo search. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NIPS, volume 9. MIT Press, 1997.
No context found.
G. Tesauro and G. Galperin. On-line policy improvement using monte carlo search. In Advances in Neural Information Processing Systems, pages 1068--1074, Cambridge MA, 1996. MIT Press.
No context found.
G. Tesauro and G. R. Galperin. On-line policy improvement using Monte-Carlo search. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NIPS, volume 9. MIT Press, 1997.
No context found.
G. Tesauro, G.R. Galperin, On-line policy improvement using Monte Carlo search, in: Neural Information Processing Systems Conference, Denver, CO, 1996, Unpublished Report.
No context found.
Tesauro, G. & Galperin, G. R. (1996). On-line policy improvement using Monte-Carlo search. In Advances in Neural Information Processing: Proceedings of the Ninth Conference. MIT Press.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC