Results 1 - 10
of
114
The nonstochastic multiarmed bandit problem
- SIAM Journal on Computing
, 2002
"... In the multi-armed bandit problem, a gambler must decide which arm of £ non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying ou ..."
Abstract
-
Cited by 204 (16 self)
- Add to MetaCart
In the multi-armed bandit problem, a gambler must decide which arm of £ non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of ¤ plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate ¥§¦¨¤�©������� �. We show by a matching lower bound that this is best possible. We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of � strategies then our algorithm approaches the per-round payoff of the strategy at the rate ¥ ¦��¨���� � �§ � ���� � ¤ ©����� � �. Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate ¥ ¦ ¤ ©����� � �.
Gambling in a rigged casino: The adversarial multi-armed bandit problem
, 1995
"... In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying ou ..."
Abstract
-
Cited by 144 (6 self)
- Add to MetaCart
In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T \Gamma1=2 ), and we give an improved rate of conver...
An Active Testing Model for Tracking Roads in Satellite Images
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 1995
"... We present a new approach for tracking roads from satellite images, and thereby illustrate a general computational strategy ("active testing") for tracking 1D structures and other recognition tasks in computer vision. Our approach is related to recent work in active vision on "where to look next" a ..."
Abstract
-
Cited by 133 (4 self)
- Add to MetaCart
We present a new approach for tracking roads from satellite images, and thereby illustrate a general computational strategy ("active testing") for tracking 1D structures and other recognition tasks in computer vision. Our approach is related to recent work in active vision on "where to look next" and motivated by the "divide-and-conquer" strategy of parlor games such as "Twenty Questions." We choose "tests" (matched filters for short road segments) one at a time in order to remove as much uncertainty as possible about the "true hypothesis" (road position) given the results of the previous tests. The tests are chosen on-line based on a statistical model for the joint distribution of tests and hypotheses. The problem of minimizing uncertainty (measured by entropy) is formulated in simple and explicit analytical terms. To execute this entropy testing rule we then alternate between data collection and optimization: at each iteration new image data are examined and a new entropy minimizat...
A Survey of Computational Complexity Results in Systems and Control
, 2000
"... The purpose of this paper is twofold: (a) to provide a tutorial introduction to some key concepts from the theory of computational complexity, highlighting their relevance to systems and control theory, and (b) to survey the relatively recent research activity lying at the interface between these fi ..."
Abstract
-
Cited by 82 (18 self)
- Add to MetaCart
The purpose of this paper is twofold: (a) to provide a tutorial introduction to some key concepts from the theory of computational complexity, highlighting their relevance to systems and control theory, and (b) to survey the relatively recent research activity lying at the interface between these fields. We begin with a brief introduction to models of computation, the concepts of undecidability, polynomial time algorithms, NP-completeness, and the implications of intractability results. We then survey a number of problems that arise in systems and control theory, some of them classical, some of them related to current research. We discuss them from the point of view of computational complexity and also point out many open problems. In particular, we consider problems related to stability or stabilizability of linear systems with parametric uncertainty, robust control, time-varying linear systems, nonlinear and hybrid systems, and stochastic optimal control.
The Complexity Of Optimal Queueing Network Control
- Mathematics of Operations Research
, 1994
"... : We show that several well-known optimization problems related to the optimal control of queues are provably intractable ---independently of any unproven conjecture such as P6=NP. In particular, we show that several versions of the problem of optimally controlling a simple network of queues with si ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
: We show that several well-known optimization problems related to the optimal control of queues are provably intractable ---independently of any unproven conjecture such as P6=NP. In particular, we show that several versions of the problem of optimally controlling a simple network of queues with simple arrival and service distributions and multiple customer classes is complete for exponential time. This is perhaps the first such intractability result for a well-known optimization problem. We also show that the restless bandit problem (the generalization of the multi-armed bandit problem to the case in which the unselected processes are not quiescent) is complete for polynomial space. 1. INTRODUCTION The optimal control of a network of queues is a well-known, much studied, and notoriously difficult problem. We are given several servers, a set of customer classes, and class-dependent probability distributions for the service times. For each customer class, there is only one server tha...
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty
, 1998
"... . This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
. This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular congurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and back-propagating them with the dynamic programming or temporal dierence mechanisms. This allows reproducing global-scale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the eciency of these propositions. Keywords: ...
Restless bandits, linear programming relaxations, and a primal-dual index heuristic,” Operations Research
, 2000
"... a primal-dual index heuristic ..."
On Optimal Allocation of Indivisibles under Uncertainty
, 1995
"... this paper is to develop a stochastic version of the branch and bound method for optimization problems involving discrete decision variables and uncertainties. The proposed procedure can be applied in cases when conventional deterministic techniques run into difficulties in calculating exact bounds. ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
this paper is to develop a stochastic version of the branch and bound method for optimization problems involving discrete decision variables and uncertainties. The proposed procedure can be applied in cases when conventional deterministic techniques run into difficulties in calculating exact bounds. Such situations are typical for optimization of stochastic systems with indivisible resources. To illustrate the complexity encountered, let us recall two classical decision models. At first, consider the following well-known hypothesis testing problem. Suppose that there are two actions i = 1; 2 with random outcomes ff i ; i = 1; 2. The distribution of ff i depends on i, but is unknown. By using random observations of ff i , we want to find the action with the smallest expected outcome Eff i . Obviously, this problem is equivalent to the verification of the inequality: Eff 1 ! Eff 2 . Even this problem with only two alternatives is a nontrivial problem of mathematical statistics. A more general problem which is often referred to as the automaton learning or the multi-armed bandit problem is the following (see [Git89]). Let f1; : : : ; Ng be the set of possible actions of the automaton and let ff i be the response of the "environment" to action
Bayesian Sparse Sampling for On-line Reward Optimization
- In ICML 2005
, 2005
"... We present an efficient “sparse sampling ” technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making whil ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
We present an efficient “sparse sampling ” technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making while controlling computational cost. The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior—rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information). The outcome is a flexible, practical technique for improving action selection in simple reinforcement learning scenarios. 1.
Optimal coordinated planning amongst self-interested agents with private state
- In Proceedings of the Twenty-second Annual Conference on Uncertainty in Artificial Intelligence (UAI’06
, 2006
"... Consider a multi-agent system in a dynamic and uncertain environment. Each agent’s local decision problem is modeled as a Markov decision process (MDP) and agents must coordinate on a joint action in each period, which provides a reward to each agent and causes local state transitions. A social plan ..."
Abstract
-
Cited by 20 (12 self)
- Add to MetaCart
Consider a multi-agent system in a dynamic and uncertain environment. Each agent’s local decision problem is modeled as a Markov decision process (MDP) and agents must coordinate on a joint action in each period, which provides a reward to each agent and causes local state transitions. A social planner knows the model of every agent’s MDP and wants to implement the optimal joint policy, but agents are self-interested and have private local state. We provide an incentive-compatible mechanism for eliciting state information that achieves the optimal joint plan in a Markov perfect equilibrium of the induced stochastic game. In the special case in which local problems are Markov chains and agents compete to take a single action in each period, we leverage Gittins allocation indices to provide an efficient factored algorithm and distribute computation of the optimal policy among the agents. Distributed, optimal coordinated learning in a multiagent variant of the multi-armed bandit problem is obtained as a special case. 1

