## Policy Iteration for Factored MDPs (2000)

### Cached

### Download Links

- [robotics.stanford.edu]
- [ai.stanford.edu]
- [robotics.stanford.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-00 |

Citations: | 78 - 6 self |

### BibTeX

@INPROCEEDINGS{Koller00policyiteration,

author = {Daphne Koller and Ronald Parr},

title = {Policy Iteration for Factored MDPs},

booktitle = {In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI-00},

year = {2000},

pages = {326--334},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many large MDPs can be represented compactly using a dynamic Bayesian network. Although the structure of the value function does not retain the structure of the process, recent work has suggested that value functions in factored MDPs can often be approximated well using a factored value function: a linear combination of restricted basis functions, each of which refers only to a small subset of variables. An approximate factored value function for a particular policy can be computed using approximate dynamic programming, but this approach (and others) can only produce an approximation relative to a distance metric which is weighted by the stationary distribution of the current policy. This type of weighted projection is ill-suited to policy improvement.

### Citations

485 |
A model for reasoning about persistence and causation
- Dean, Kanazawa
- 1989
(Show Context)
Citation Context ...tly if that structure is exploited by the representation. In factored MDPs, a state is described implicitly as an assignment of values to some set of state variables. A dynamic Bayesian network (DBN) =-=[7]-=- can then allow a compact representation of the transition model, by exploiting the fact that the transition of a variable often depends only on a small number of other variables. The momentary reward... |

274 | Bucket elimination: A unifying framework for reasoning
- Dechter
- 1999
(Show Context)
Citation Context ... functions. Hence each inner maximization is over a linear combination of functions, each of which is restricted to some small subset of variables. This type of optimization problem is a cost network =-=[8]-=-, and can be solved using standard variable elimination algorithms. The computational cost, as for other related structures, is exponential in the induced width of the graph induced by the hyper-edges... |

259 | Residual algorithms: Reinforcement learning with function approximation
- Baird
- 1995
(Show Context)
Citation Context ...o provide 1 We note that there are two interpretations of the least squares solution to the Bellman equations. The first is as the direct minimization of the mean-squared Bellman residual error as in =-=[1]-=-, while the second is as the fixed point of a Bellman iteration with a least-squares projection of the value function, i.e., the standard linear temporal difference approximation method. We adopt the ... |

231 | Exploiting structure in policy construction
- Boutilier, Dearden, et al.
- 1995
(Show Context)
Citation Context ..., structure in a factored MDP rarely induces any type of structure in the value function. An obvious solution is to restrict attention to approximate value functions that can be represented compactly =-=[3]-=-. One very useful approach is to use linear value functions --- functions that are weighted linear combinations of some small number of basis functions. In recent work, there has been some success in ... |

191 | Linear least-squares algorithms for temporal difference learning
- Bradtke, Barto
- 1996
(Show Context)
Citation Context ...sing this approach to address the policy evaluation problem --- determining the value function for a fixed policy. Generally, sampling is used to avoid explicit manipulation of the entire state space =-=[5, 10]-=-. In [9] (KP hereafter), we presented an approach based on approximate dynamic programming. The key to our approach was the use of factored linear value functions, where each basis function is restric... |

112 | Model minimization in markov decision processes
- Dean, Givan
- 1997
(Show Context)
Citation Context ...proach to exploit various other types of structure in the model, including structured action spaces, where at each stage several actions are taken in parallel, and the context-sensitivity utilized by =-=[3, 6]-=-. As a more ambitious goal, we would also like to extend it to deal with the much harder problem of planning in Partially Observable MDPs. Acknowledgments We thank Xavier Boyen for an insightful discu... |

97 | Computing factored value functions for policies in structured MDPs
- Koller, Parr
- 1999
(Show Context)
Citation Context ...proach to address the policy evaluation problem --- determining the value function for a fixed policy. Generally, sampling is used to avoid explicit manipulation of the entire state space [5, 10]. In =-=[9]-=- (KP hereafter), we presented an approach based on approximate dynamic programming. The key to our approach was the use of factored linear value functions, where each basis function is restricted to s... |

87 | Tight performance bounds on greedy policies based on imperfect value functions
- Williams, Baird
- 1993
(Show Context)
Citation Context ...he computational cost, as for other related structures, is exponential in the induced width of the graph induced by the hyper-edges consisting of the function domains. For BellmanErr(sV )sffl, we get =-=[11]-=-: kV \Gamma V 0 k1s2ffl 1 \Gamma fl Thus, the true value of followings0 is bounded by a function of the maximum Bellman error ofsV . The above computation gives us a method of computing the worst-case... |

38 | Learning and value function approximation in complex decision processes. Doctoral dissertation
- Roy
- 1998
(Show Context)
Citation Context ...sing this approach to address the policy evaluation problem --- determining the value function for a fixed policy. Generally, sampling is used to avoid explicit manipulation of the entire state space =-=[5, 10]-=-. In [9] (KP hereafter), we presented an approach based on approximate dynamic programming. The key to our approach was the use of factored linear value functions, where each basis function is restric... |

20 | The Frame Problem and Bayesian Network Action Representation
- Boutilier, Goldszmidt
- 1996
(Show Context)
Citation Context ...ion dynamics, only differing in their effect on some small set of variables. In particular, in many cases a variable has a default evolution model, which only changes if an action affects it directly =-=[4]-=-. We therefore use the notion of a default transition modelsd = hG d ; P d i. For each action a, we define Effects[a] ` X 0 to be the variables in the next state whose local probability model is diffe... |