Results 1  10
of
92
SimulationBased Approach to General Game Playing
 PROCEEDINGS OF THE TWENTYTHIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2008
"... The aim of General Game Playing (GGP) is to create intelligent agents that automatically learn how to play many different games at an expert level without any human intervention. The most successful GGP agents in the past have used traditional gametree search combined with an automatically learned ..."
Abstract

Cited by 84 (6 self)
 Add to MetaCart
(Show Context)
The aim of General Game Playing (GGP) is to create intelligent agents that automatically learn how to play many different games at an expert level without any human intervention. The most successful GGP agents in the past have used traditional gametree search combined with an automatically learned heuristic function for evaluating game states. In this paper we describe a GGP agent that instead uses a Monte Carlo/UCT simulation technique for action selection, an approach recently popularized in computer Go. Our GGP agent has proven its effectiveness by winning last year’s AAAI GGP Competition. Furthermore, we introduce and empirically evaluate a new scheme for automatically learning searchcontrol knowledge for guiding the simulation playouts, showing that it offers significant benefits for a variety of games.
Pure exploration in multiarmed bandits problems
 IN PROCEEDINGS OF THE TWENTIETH INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY (ALT 2009
, 2009
"... We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simpl ..."
Abstract

Cited by 80 (13 self)
 Add to MetaCart
(Show Context)
We consider the framework of stochastic multiarmed bandit problems and study the possibilities and limitations of strategies that explore sequentially the arms. The strategies are assessed not in terms of their cumulative regrets, as is usually the case, but through quantities referred to as simple regrets. The latter are related to the (expected) gains of the decisions that the strategies would recommend for a new oneshot instance of the same multiarmed bandit problem. Here, exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when cumulative regrets are considered and when exploitation needs to be performed at the same time. We start by indicating the links between simple and cumulative regrets. A small cumulative regret entails a small simple regret but too small a cumulative regret prevents the simple regret from decreasing exponentially towards zero, its optimal distributiondependent rate. We therefore introduce specific strategies, for which we prove both distributiondependent and distributionfree bounds. A concluding experimental study puts these theoretical bounds in perspective and shows the interest of nonuniform exploration of the arms.
Computing Elo Ratings of Move Patterns in the Game of Go
"... Move patterns are an essential method to incorporate domain knowledge into Goplaying programs. This paper presents a new Bayesian technique for supervised learning of such patterns from game records, based on a generalization of Elo ratings. Each sample move in the training data is considered as a ..."
Abstract

Cited by 73 (0 self)
 Add to MetaCart
(Show Context)
Move patterns are an essential method to incorporate domain knowledge into Goplaying programs. This paper presents a new Bayesian technique for supervised learning of such patterns from game records, based on a generalization of Elo ratings. Each sample move in the training data is considered as a victory of a team of pattern features. Elo ratings of individual pattern features are computed from these victories, and can be used in previously unseen positions to compute a probability distribution over legal moves. In this approach, several pattern features may be combined, without an exponential cost in the number of features. Despite a very small number of training games (652), this algorithm outperforms most previous patternlearning algorithms, both in terms of mean logevidence (−2.69), and prediction rate (34.9%). A 19 × 19 MonteCarlo program improved with these patterns reached the level of the strongest classical programs.
Bandit Algorithm for Tree Search
, 2007
"... apport de recherche ISSN 02496399 ISRN INRIA/RR6141FR+ENG ..."
Abstract

Cited by 71 (14 self)
 Add to MetaCart
(Show Context)
apport de recherche ISSN 02496399 ISRN INRIA/RR6141FR+ENG
Explorationexploitation tradeoff using variance estimates in multiarmed bandits
, 2009
"... ..."
(Show Context)
Online optimization in Xarmed bandits
 In Advances in Neural Information Processing Systems 22
, 2008
"... We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the meanpayoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm ..."
Abstract

Cited by 49 (11 self)
 Add to MetaCart
(Show Context)
We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the meanpayoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy whose regret improves upon previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the meanpayoff function has a finite number of global maxima around which the behavior of the function is locally Hölder with a known exponent, then the expected regret is bounded up to a logarithmic factor by √ n, i.e., the rate of the growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm for the class of problems considered. 1 Introduction and
Automatic generation and evaluation of recombination games
, 2008
"... Many new board games are designed each year, ranging from the unplayable to the truly exceptional. For each successful design there are untold numbers of failures; game design is something of an art. Players generally agree on some basic properties that indicate the quality and viability of a game, ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
(Show Context)
Many new board games are designed each year, ranging from the unplayable to the truly exceptional. For each successful design there are untold numbers of failures; game design is something of an art. Players generally agree on some basic properties that indicate the quality and viability of a game, however these properties have remained subjective and open to interpretation. The aims of this thesis are to determine whether such quality criteria may be precisely defined and automatically measured through selfplay in order to estimate the likelihood that a given game will be of interest to human players, and whether this information may be used to direct an automated search for new games of high quality. Combinatorial games provide an excellent test bed for this purpose as they are typically deep yet described by simple welldefined rule sets. To test these ideas, a game description language was devised to express such games and a general game system implemented to play, measure and explore them. Key features of the system include modules for measuring statistical aspects of selfplay and synthesising new games through the evolution of existing rule sets.
A MonteCarlo AIXI Approximation
, 2009
"... This paper describes a computationally feasible approximation to the AIXI agent, a universal reinforcement learning agent for arbitrary environments. AIXI is scaled down in two key ways: First, the class of environment models is restricted to all prediction suffix trees of a fixed maximum depth. Thi ..."
Abstract

Cited by 33 (11 self)
 Add to MetaCart
This paper describes a computationally feasible approximation to the AIXI agent, a universal reinforcement learning agent for arbitrary environments. AIXI is scaled down in two key ways: First, the class of environment models is restricted to all prediction suffix trees of a fixed maximum depth. This allows a Bayesian mixture of environment models to be computed in time proportional to the logarithm of the size of the model class. Secondly, the finitehorizon expectimax search is approximated by an asymptotically convergent Monte Carlo Tree Search technique. This scaled down AIXI agent is empirically shown to be effective on a wide class of toy problem domains, ranging from simple fully observable games to small POMDPs. We explore the limits of this approximate agent and propose a general heuristic framework for scaling this technique to much larger problems.
XArmed Bandits
, 2010
"... We consider a generalization of stochastic bandits where the set of arms, ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
We consider a generalization of stochastic bandits where the set of arms,
Optimistic planning of deterministic systems
 EUROPEAN WORKSHOP ON REINFORCEMENT LEARNING, FRANCE
, 2008
"... If one possesses a model of a controlled deterministic system, then from any state, one may consider the set of all possible reachable states starting from that state and using any sequence of actions. This forms a tree whose size is exponential in the planning time horizon. Here we ask the questio ..."
Abstract

Cited by 28 (11 self)
 Add to MetaCart
(Show Context)
If one possesses a model of a controlled deterministic system, then from any state, one may consider the set of all possible reachable states starting from that state and using any sequence of actions. This forms a tree whose size is exponential in the planning time horizon. Here we ask the question: given finite computational resources (e.g. CPU time), which may not be known ahead of time, what is the best way to explore this tree, such that once all resources have been used, the algorithm would be able to propose an action (or a sequence of actions) whose performance is as close as possible to optimality? The performance with respect to optimality is assessed in terms of the regret (with respect to the sum of discounted future rewards) resulting from choosing the action returned by the algorithm instead of an optimal action. In this paper we investigate an optimistic exploration of the tree, where the most promising states are explored first, and compare this approach to a naive uniform exploration. Bounds on the regret are derived both for uniform and optimistic exploration strategies. Numerical simulations illustrate the benefit of optimistic planning.