| N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427-485, May 1997. |
....a clear qualitative separation between unpredictability and randomness, and hence between prediction and gambling. However, the precise quantitative relationship between these processes has not been elucidated. Given the obvious signi cance of prediction and gambling for computational learning [2, 3, 25] and information theory [7, 8] this situation should be remedied. Recently, Lutz [13, 14] has de ned computation e ectivizations of classical Hausdor dimension ( fractal dimension ) and used these to investigate questions in computational complexity and algorithmic information theory. These ....
....a single sequence, and we show that deterministic feasible predictability is stable on computably presentable sets, i.e. that dpred p (X [ Y ) minfdpred p (X) dpred p (Y )g whenever the sets X and Y are computably presentable. Feasible predictability is known to be stable on arbitrary sets [2]. Prediction and Dimension 3 To describe our main theorem precisely, we need to de ne two informationtheoretic functions, namely, the k adic segmented self information function I k and the k adic maximum entropy function H k . The k adic self information of a real number 2 (0; 1] is I k ( ....
[Article contains additional citation context not shown here]
N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427-485, May 1997.
.... [Hut01a] The convergence Theorem 1 is a known result [Sol78, LV97, Hut01a] Introductory references: There are good reviews papers books of Solomono sequence prediction [LV97] inductive inference [AS83, Sol97] in general, MDL and reasoning under uncertainty [Gr u98] worst case (WM) approaches [Ces97], Bayesian prediction approaches [HB92] and competitive online statistics [Vov99] which contain many further references. 2 Setup and Convergence Notation: We denote strings over a nite alphabet X by x 1 x 2 : x n with x t 2 X . We abbreviate x n:m : x n x n 1 : x m 1 xm and x n : x 1 : x n ....
N. Cesa-Bianchi et al. How to use expert advice. Journal of the ACM, 44(3):427-485, 1997.
....[21] and reference therein) coding, and estimation (e.g. 24, 6] and references thereto and therein) The latter question is the motivation for this work. It should be emphasized that in the prediction problem in the literature, especially that pertaining to computational learning theory (e.g. [27, 28, 12, 11, 10, 13, 19] and references therein) where the problem is formulated in terms of learning with expert advice, the class of experts is always assumed given and the questions typically asked concern optimal strategies per the given class. In such problems, one is not concerned with the question of how to ....
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427--485, 1997.
....prior is optimal (Section 6) generalize the bound in [CB90] on the relative entropy between and for continuous i.i.d. probability classes M to the non i.i.d. case (Section 7) compare the universal prediction scheme and its loss bounds to the weighted majority scheme and its loss bounds [Ces97] (Section 9) Section 2 explains notation and de nes the universal or mixture distribution as the w weighted sum of probability distributions of a set M, which includes the true distribution . No structural assumptions are made on the . multiplicatively dominates all 2M, and the ....
....where not every symbol needs to be predicted, are described. Performing and predicting a sequence of independent experiments and online learning of classi cation tasks are special cases. Section 9 compares the universal prediction scheme studied here to the weighted majority (WM) algorithm(s) [LW89, Vov92, LW94, Ces97, HKW98, KW99]. WM combines forecasts of experts e2E to form its own prediction. The number of prediction errors of WM are compared to the best expert in E . No assumption is made on the distribution of the strings the bounds are worst case bounds. Although the algorithms, the settings, and the proofs are ....
[Article contains additional citation context not shown here]
N. Cesa-Bianchi et al. How to use expert advice. Journal of the ACM, 44(3):427-485, 1997.
....universal property of [1] and [18] promise much improved performance for nice processes. The algorithms build on a methodology worked out in recent years for prediction of individual sequences, see Vovk [33] Feder, Merhav, and Gutman [8] Littlestone and Warmuth [16] Cesa Bianchi et al. [5], Kivinen and Warmuth [15] Singer and Feder [23] and Merhav and Feder [17] for a survey. An approach similar to the one of this paper was adopted by Gy or , Lugosi, and Morvai [14] where prediction of stationary binary sequences was addressed. There we introduced a simple randomized predictor ....
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427-485, 1997.
....respect to fp(xjs)g and f t g) that (s t ; b t ) s; b) Again, the expected loss in (4) is minimized over by the deterministic FS strategy t (x t 1 ; b t 1 ) g(s t ) where g is as in (2) and the expectation is with respect to the FS source. The theory of learning with expert advice [6, 7, 8, 9] is a natural extension of the above framework, where the class of reference competing strategies is viewed as a set of experts. In this setting, an on line strategy is expected to combine the (possibly random) advice of the experts, incurring a loss that approaches that of the best performing ....
....parallels the case of memoryless loss functions, for which the normalized excess loss for learning schemes with expert advice (as well as for classical schemes by Hannan [1] Blackwell [14] and others) is upper bounded by an O(n 1=2 ) term. While the learning algorithm of [6] see also [8] and [9]) suggests to (randomly) select an expert at each time t based on its performance on x t 1 , here the main problem is to overcome the e ect of b t 1 in the instantaneous loss at time t, as this action may not agree with the expert selected at that time. Next, we consider FSM experts. In the ....
[Article contains additional citation context not shown here]
N. Cesa-Bianchi, Y. Freund, D. P. Helmbold, D. Haussler, R. E. Schapire, and M. K. Warmuth, \How to use expert advice," Journal of the ACM, vol. 44, no. 3, pp. 427-485, 1997.
....using Winnow. These uses require exponentially many inputs, so we define Markov chains over the inputs to approximate the weighted sums. We state performance guarantees for our algorithms and present preliminary empirical results. 1 Introduction Multiplicative weight update algorithms (e.g. [11, 13, 3]) have been studied extensively due to their on line mistake bounds logarithmic dependence on N , the total number of inputs. This attribute e#ciency allows them to be applied to problems where N is exponential in the input size, which is the case in the problems we study here: using the Weighted ....
....error than the best single pruning. Since another motivation for pruning an ensemble is to reduce its size, we also explore methods for choosing a single good pruning. The typical approach to predicting nearly as well as the best pruning of classifiers (e.g. 18, 20] uses recent results from [3] on predicting with expert advice, where each possible pruning is an expert. If there is an e#cient way to make predictions, then the expert based algorithm s mistake bound yields an 2 It is unlikely that an e#cient distribution free DNF learning algorithm exists [2, 1] 4 e#cient algorithm. ....
N. Cesa-Bianchi, Y. Freund, D. Helmbold, D. Haussler, R. Schapire, and M. Warmuth. How to use expert advice. J. of the ACM, 44(3):427--485, 1997.
....in the family for every random process. Aggregating methods, and corresponding bounds on the di erence between the loss of the aggregate scheme and that of the best scheme in the family, have been established in a variety of settings. Representative work and further references can be found in [28, 12, 22, 6, 5, 23, 20, 14]. Here we describe a simple aggregate decision scheme that is based on weighted majority methods [28, 22] for predicting individual binary sequences. Let F be a xed, countable family of decision schemes and let x = x 1 ; x 2 ; be a sequence with values x i 2 X . Fix 2 (0; 1) and let fF j ....
N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice, J. Assoc. Comp. Mach., vol.44, pp.427-485, 1997.
.... and target are permitted to change a little between observations, as in [23, 53, 54] models REFERENCES 60 of weak learning in which the learner only has to do slightly better than random guessing [48, 90, 55] and variants in which the learning algorithm has access to the predictions of experts [36]. It is hoped that the reader has gained a flavour of this subject. There are many theoretical problems still to be solved within this framework. Furthermore, there is still much work to be done in modifying the various models of machine learning discussed here so that they become more realistic ....
N. Cesa-Bianchi, Y. Freund, D. P. Helmbold, D. Haussler, R. E. Schapire, and M. K. Warmuth. How to use expert advice. In Proc. 25th Annu. ACM Sympos. Theory Comput., pages 382--391. ACM Press, New York, NY, 1993.
....if there was no departure or arrival expected before the alarm. 8.1.1 Weighted Majority Our first class of predictors is based on the weighted majority algorithm (WM) 64] which are also called expert based algorithms. These algorithms have recently undergone significant theoretical analyses [18, 44, 64, 82, 31] and have been applied to problems such as calendar scheduling [14] and deciding when to spin down a disk in a mobile computer to conserve battery life [46] Figure 8.1 shows how we applied WM to the problem of predicting packet inter arrival times, which is similar to the approach taken by ....
N. Cesa-Bianchi, Y. Freund, D. Helmbold, D. Haussler, R. Schapire, and M. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427--485. http://www.research.att.com/~yoav/.
....the examples is not understood well enough to cast it into a mathematical model. In a series of papers some fundamental algorithms for the prediction of binary data have been developed. Upper and lower bounds on the possible performance of on line learning algorithms have been established [9, 10, 19]. Following these basic results, a number of important variants of the online learning model have been considered: the algorithm might be allowed to ask additional queries [2] it might receive only limited feedback from the environment [1] or it might have to keep track with a changing ....
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth (1997) How to Use Expert Advice, Journal of the ACM, 44(3), pp. 427--485.
....range of loss functions even when the regression problem is highly nonlinear and the data are generated with no statistical assumption. As a further motivation for the study of this prediction model, we point out the fact that any good sequential prediction algorithm can be efficiently transformed [2, 12, 15] into an algorithm that performs well in the more traditional statistical (or batch ) frameworks, like those studied in [5, 9] We use the sequential prediction model to analyze two types of on line regression problems. In the linear regression problem the master algorithm predicts, in each trial ....
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427--485, 1997.
....to this assignment. Then the regret for the best strategy is the difference between the return of this best strategy and the actual gambler s return. Using a randomized player that combines the choices of the N strategies (in the same vein as the algorithms for prediction with expert advice from [3]) we show that the expected regret for the best strategy is O( p KT ln N) see Theorem 7.1. Note that the dependence on the number of strategies is only logarithmic, and therefore the bound is quite reasonable even when the player is combining a very large number of strategies. The ....
....O(ln T ) for the classical bandit model [14] There, the distribution over the rewards is fixed as T 1. Note that our lower bound has a considerably stronger dependence on the number K of action than the lower bound Theta( p T ln K) which could have been proven directly from the results in [3, 6]. Specifically, our lower bound implies that no upper bound is possible of the form O(T ff (ln K) fi ) where 0 ff 1, fi 0. Theorem 5.1 For any number of actions K 2 and for any time horizon T , there exists a distribution over the assignment of rewards such that the expected weak ....
[Article contains additional citation context not shown here]
Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the Association for Computing Machinery, 44(3):427--485, May 1997.
No context found.
N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427-485, 1997.
....state of the problem is the drifted point R t = R t 1 r t (y t ; b y t ) The goal of the decision maker is to minimize the potential (R t ) for a given t (which might be known or unknown to the decision maker) Example 1. Consider an on line prediction problem in the experts framework of [4]. Here, the decision maker is a predictor whose goal is to forecast a hidden sequence y 1 ; y 2 ; of elements in the outcome space Y . At each time t, the predictor computes its guess b y t 2 X for the next outcome y t . This guess is based on the advice f 1;t ; fN;t 2 X of N ....
.... chooses the pure strategy that is best given the past distribution of the adversary s plays; smoothing this choice amounts to introduce randomization) In learning theory, algorithms based on the exponential potential have been intensively studied and applied to a variety of problems (see, e.g. [4, 8, 28, 29]) If r t 2 [ 1; 1] N for all t, then the choice p = 2 ln N for the polynomial potential yields the bound max 1 i N R i;t v u u t (2 ln N 1) t X s=1 N X i=1 ju i j 2 ln N 1= ln N q (2 ln N 1)N 1= ln N t = p (2 ln N 1)et : This choice of p was also suggested in [13] ....
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427-485, 1997.
....the pool of m experts and then log m bits per new section. In the bounds we also pay twice for encoding the boundaries of the sections. 1 Introduction We consider the following standard on line learning model in which a master algorithm has to combine the predictions from a set of experts [12, 15, 3, 11]. Learning proceeds in trials. In each trial the master receives the predictions from n experts and uses them to form its own prediction. At the end of the trial both the master and the experts receive the true outcome and incur a loss measuring the discrepancy between their predictions and the ....
N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427-485, 1997.
....can be achieved. This is sometimes referred to as the experts setting, since the models can be viewed as experts providing advice to the algorithm. Variants and extensions of this experts setting have been extensively studied by Littlestone and Warmuth [8] Vovk [11] Cesa Bianchi et al. [2, 3], Haussler, Kivinen, and Warmuth [6] and others in the area of computational learning theory. A crucial aspect of this setting is that although the sequence is generated adversarially, good performance bounds can be proven. In this experts setting, a master algorithm attempts to predict, one by ....
....a parameter estimating the loss of the best expert on the sequence, and this parameter is used to tune the update factor. When tuned optimally, algorithms based on such multiplicative weighting schemes are, in some sense, asymptotically optimal in both the 0 1 loss [3, 8] and the absolute loss [2, 7] settings. In these settings, Vovk [12] shows that multiplicative weighting schemes are optimal in a different sense. He shows that if any master algorithm can achieve the relative loss bound aL b log N (where L is the loss of the best expert, N is the number of experts, and a and b are ....
[Article contains additional citation context not shown here]
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. How to use expert advice. Technical Report UCSC-CRL95 -19, University of California at Santa Cruz, 1995. An extended abstract appeared in the Proceedings of the 25th ACM Symposium on the Theory of Computation.
No context found.
N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Technical Report UCSC-CRL-9433, University of California, Santa Cruz, 1994.
....data. In an on line setting these informations are typically not available. Tuning is one of the most critical aspects of an on line learning algorithm and might a ect performance in a substantial way. In this introductory section we begin by using the (randomized) Weighted Majority algorithm of [24, 26, 29, 31, 7] as a motivating example to illustrate the Introduction 2 tuning problem we are interested in. We then introduce our tuning techniques and compare them to those already available. In later sections we will apply our techniques to the much more general class of quasi additive algorithms [16, 22] ....
....B is constant. Introduction 3 If during the learning process the loss of the best component exceeds bound B, then this bound is increased, the learning algorithm is restarted, and a new round begins. A sophisticated analysis of the doubling trick for the Weighted Majority algorithm can be found in [7]. We may say that the doubling trick makes an on line algorithm coarsely adaptive, as the learning rate is constant within a round and makes big jumps between rounds. However, a major disadvantage is that the on line algorithm is restarted from scratch at the beginning of each round, hence losing ....
[Article contains additional citation context not shown here]
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R., and Warmuth, M. K. (1997), How to use expert advice, Journal of the ACM, 44(3), 427-485.
....though the analyses presented in these papers do not appear generalizable to our setting, our results build upon the algorithmical ideas developed there. The worst case analysis of problems classically studied in Statistics and Information theory has often demonstrated unexpectedly fruitful (see [2] and references therein. We believe the results of our research strengthen the validity of this approach. 1.1 Notation and terminology We formalize the bandit problem as a game between a player and an adversary. The game is parameterized by the number K of possible actions, where each action is ....
....x i;t G ln ff Gamma ln K ff Gamma 1 which is the desired bound. 2 By carefully choosing the parameter ff as a function of K and G , and using standard techniques to guess this latter quantity, we get a bound of O( p G ln K) for the expected regret in the full information game. In [2] this bound was shown to be optimal for that game. In the next section we substantially extend this analysis and provide a bound for the regret in the partial information game. 3 Approximating full information In this section we move to the analysis of the partial information game. We present an ....
[Article contains additional citation context not shown here]
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth. How to use expert advice. In Proceedings of the 25th ACM Symposium on the Theory of Computation, pages 382--391. ACM Press, 1993.
No context found.
N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Proceedings of the 25th ACM Symposium on the Theory of Computation, 1993.
No context found.
N. Cesa-Bianchi, Y. Freund ,D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Technical Report UCSC-CRL-94-33, Univ. of Calif. Computer Research Lab, Santa Cruz, CA, 1994. An extended abstract appeared in STOC '93.
No context found.
N. Ceza-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427--485, May 1997.
No context found.
Nicolo Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the Association for Computing Machinery, 44(3):427--485, 1997.
No context found.
Nicolo Cesa-Bianchi, Yoav Freund, David P. Helmbold, David Haussler, Robert E. Schapire, and Manfred K. Warmuth. How to use expert advice (extended abstract). In Proceedings of the Twenty-Fifth Annual ACM Symposium on the Theory of Computing, pages 382--391, San Diego, California, May 1993.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC