Results 1 -
7 of
7
Computing robust counter-strategies
- In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS
, 2007
"... Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents ’ behavior. In the experts paradigm, one may want to choose experts that are good counter- ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counterstrategy to the inferred posterior of the other agents ’ behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents ’ expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold’em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses. 1
A general criterion and an algorithmic framework for learning in multi-agent systems
- Machine Learning
, 2007
"... in multi-agent systems ..."
Data Biased Robust Counter Strategies
- Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS-09
"... The problem of exploiting information about the environment while still being robust to inaccurate or incomplete information arises in many domains. Competitive imperfect information games where the goal is to maximally exploit an unknown opponent’s weaknesses are an example of this problem. Agents ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The problem of exploiting information about the environment while still being robust to inaccurate or incomplete information arises in many domains. Competitive imperfect information games where the goal is to maximally exploit an unknown opponent’s weaknesses are an example of this problem. Agents for these games must balance two objectives. First, they should aim to exploit data from past interactions with the opponent, seeking a best-response counter strategy. Second, they should aim to minimize losses since the limited data may be misleading or the opponent’s strategy may have changed, suggesting an opponent-agnostic Nash equilibrium strategy. In this paper, we show how to partially satisfy both of these objectives at the same time, producing strategies with favourable tradeoffs between the ability to exploit an opponent and the capacity to be exploited. Like a recently published technique, our approach involves solving a modified game; however the result is more generally applicable and even performs well in situations with very limited data. We evaluate our technique in the game of two-player, Limit Texas Hold’em. 1
Safe Opponent Exploitation
"... We consider the problem of playing a finitely-repeated two-player zero-sum game safely—that is, guaranteeing at least the value of the game per period in expectation regardless of the strategy used by the opponent. Playing a stage-game equilibrium strategy at each time step clearly guarantees safety ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We consider the problem of playing a finitely-repeated two-player zero-sum game safely—that is, guaranteeing at least the value of the game per period in expectation regardless of the strategy used by the opponent. Playing a stage-game equilibrium strategy at each time step clearly guarantees safety, and prior work has conjectured that it is impossible to simultaneously deviate from a stage-game equilibrium (in hope of exploiting a suboptimal opponent) and to guarantee safety. We show that such profitable deviations are indeed possible—specifically, in games where certain types of ‘gift ’ strategies exist, which we define formally. We show that the set of strategies constituting such gifts can be strictly larger than the set of iteratively weakly-dominated strategies; this disproves another recent conjecture which states that all non-iterativelyweakly-dominated strategies are best responses to each equilibrium strategy of the other player. We present a full characterization of safe strategies, and develop efficient algorithms for exploiting suboptimal opponents while guaranteeing safety. We also provide analogous results for sequential perfect and imperfectinformation games, and present safe exploitation algorithms and full characterizations of safe strategies for those settings as well. We present experimental results in Kuhn poker, a canonical test problem for game-theoretic algorithms. Our experiments show that 1) aggressive safe exploitation strategies significantly outperform adjusting the exploitation within equilibrium strategies and 2) all the safe exploitation strategies significantly outperform a (non-safe) best response strategy against strong dynamic opponents.
On the Usefulness of Opponent Modeling: the Kuhn Poker case study (Short Paper)
"... The application of reinforcement learning algorithms to Partially Observable Stochastic Games (POSG) is challenging since each agent does not have access to the whole state information and, in case of concurrent learners, the environment has non-stationary dynamics. These problems could be partially ..."
Abstract
- Add to MetaCart
The application of reinforcement learning algorithms to Partially Observable Stochastic Games (POSG) is challenging since each agent does not have access to the whole state information and, in case of concurrent learners, the environment has non-stationary dynamics. These problems could be partially overcome if the policies followed by the other agents were known, and, for this reason, many approaches try to estimate them through the so-called opponent modeling techniques. Although many researches have been devoted to the study of the accuracy of the estimation of opponents’ policies, still little attention has been deserved to understand in which situations these model estimations can be actually useful to improve the agent’s performance. This paper presents a preliminary study about the impact
Reinforcement Learning Without Rewards
, 2010
"... Machine learning can be broadly defined as the study and design of algorithms thatimprovewithexperience. Reinforcement learning isavarietyofmachinelearning that makes minimal assumptions about the information available for learning, and, in a sense, defines the problem of learning in the broadest po ..."
Abstract
- Add to MetaCart
Machine learning can be broadly defined as the study and design of algorithms thatimprovewithexperience. Reinforcement learning isavarietyofmachinelearning that makes minimal assumptions about the information available for learning, and, in a sense, defines the problem of learning in the broadest possible terms. Reinforcement learning algorithms are usually applied to “interactive” problems, such as learning to drive a car, operate a robotic arm, or play a game. In reinforcement learning, an autonomous agent must learn how to behave in an unknown, uncertain, and possibly hostile environment, usingonly thesensory feedbackthat it receives from theenvironment. As the agent moves from one state of the environment to another, it receives only a reward signal — there is no human “in the loop ” to tell the algorithm exactly what to do. The goal in reinforcement learning is to learn an optimal behavior that maximizes the total reward that the agent collects. Despite its generality, the reinforcement learning framework does make one strong assumption: that the reward signal can always be directly and unambiguously observed. In other words, the feedback a reinforcement learning algorithm receives is
Balancing Safety and Exploitability in Opponent Modeling
"... Opponent modeling is a critical mechanism in repeated games. It allows a player to adapt its strategy in order to better respond to the presumed preferences of his opponents. We introduce a new modeling technique that adaptively balances exploitability and risk reduction. An opponent’s strategy is m ..."
Abstract
- Add to MetaCart
Opponent modeling is a critical mechanism in repeated games. It allows a player to adapt its strategy in order to better respond to the presumed preferences of his opponents. We introduce a new modeling technique that adaptively balances exploitability and risk reduction. An opponent’s strategy is modeled with a set of possible strategies that contain the actual strategy with a high probability. The algorithm is safe as the expected payoff is above the minimax payoff with a high probability, and can exploit the opponents ’ preferences when sufficient observations have been obtained. We apply them to normal-form games and stochastic games with a finite number of stages. The performance of the proposed approach is first demonstrated on repeated rock-paper-scissors games. Subsequently, the approach is evaluated in a humanrobot table-tennis setting where the robot player learns to prepare to return a served ball. By modeling the human players, the robot chooses a forehand, backhand or middle preparation pose before they serve. The learned strategies can exploit the opponent’s preferences, leading to a higher rate of successful returns.

