| A. DeSantis, G. Markowsky, and M.N. Wegman. Learning probabilistic prediction functions. In FOCS, pages 110--119, 1988. |
....and real data. Our empirical results with respect to a real life prediction problem indicate that smooth algorithms for expert advice, and in particular the proposed algorithms, have advantage over other expert advice algorithms. 1 Introduction In the online prediction using expert advice problem [DMW88, Vov90, LW94, Blu96] a learning algorithm is required at the start of each trial to predict an unknown binary outcome. The algorithm s input for generating this prediction is the advice of each of N experts where the expert advice are binary or real predictions of the outcome. By the end of the ....
A. DeSantis, G. Markowsky, and M.N. Wegman. Learning probabilistic prediction functions. In FOCS, pages 110--119, 1988.
....proofs of special cases of Theorem 2. All have involved the use of an algorithm that chooses to predict 0 or 1 in proportion to their payoffs with exponential weights. The exponential weighted algorithm just alluded to was introduced by Littlestone and Warmuth [25] Desantis, Markowski and Wegman [8], Feder, Mehrav and Gutman [10] and Vovk [28] at about the same time. Vovk [28] shows how the exponential weighted algorithm can be used to prove Theorem 2 for any bounded loss function (but the states of the world are either 0 or 1) Cesa Bianchi, Freund, Helmbold, Haussler, Schapire and Warmuth ....
DeSantis, A., G. Markowski and M. Wegman, `Learning Probabilistic Prediction Functions ', Proceedings of the 1988.
....real data. Our empirical results with respect to a real life prediction problem indicate that smooth algorithms for expert advice, and in particular the proposed algorithms, have advantage over other expert advice algorithms. 1 1 Introduction In the online prediction using expert advice problem [DMW88, Vov90, LW94, Blu96] a learning algorithm is required at the start of each trial to predict an unknown binary outcome. The algorithm s input for generating this prediction is the advice of each of N experts where the expert advice are binary or real predictions of the outcome. By the end of the ....
A. DeSantis, G. Markowsky, and M.N. Wegman. Learning probabilistic prediction functions. In FOCS, pages 110--119, 1988.
....context, the subject of our study is the smallest achievable worst case redundancy of a sequential lossless code, with respect to a general class of reference codes. The study of the worst case regret was pioneered by Shtarkov [15] and later studied from various points of view by De Santis et al. [14], Vovk [17, 18] Haussler and Barron [8] Weinberger, Merhav and Feder [19] Yamanishi [20] Rissanen [13] Haussler, Kivinen, and Warmuth [9] and others. Merhav and Feder summarize the relevant history in their recent survey [10] The notion of minimax regret has natural applications in gambling ....
....a finite class and w is the uniform distribution over F . In this case, the conditionals of the mixture strategy take the simple form p(yjy t Gamma1 ) P f2F f(yjy t Gamma1 )f(y t Gamma1 ) P g2F g(y t Gamma1 ) 3) This is just the weighted average (WA) algorithm of De Santis et al. [14], see also [8, 9, 18, 20] Main result 5 Besides being computationally easier to handle than p , mixture strategies are (in general) universal, that is, their conditionals can be computed without knowing the sequence length n in advance. On the other hand, there are simple finite classes F on ....
[Article contains additional citation context not shown here]
A. De Santis, G. Markowski, and M.N. Wegman. Learning probabilistic prediction functions. In Proceedings of the 1st Annual Workshop on Computational Learning Theory, pages 312--328. Morgan Kaufmann, 1988.
....into account. Both Cover s game and the longshort game are perfectly mixable. Taking into account transaction costs leads to a plethora of new games (cf Blum and Kalai [7] and Vovk and Watkins [58] Regression: results 16 The constants c(j) and a(j) for the games mentioned above were found in [22, 43, 33, 18, 52, 55, 58]. We have already noticed that a game is perfectly mixable if its loss function is strictly convex in some sense. The exact statement in the binary case (ie, where the outcome space Omega consists of only 2 elements) can be found in [52] Lemma 2) and [33] it is an open problem to find a ....
....T X t=1 y t x 0 t = 0; we obtain the RR formula (22) To prove Theorem 4, it suffices to take j = 1 8Y 2 instead of j = 1 2Y 2 in the proof in Subsection 4. 2 (cf Remark 1) 5 Review of literature The first paper on competitive on line statistics was, probably, DeSantis et al. [22], which performed a competitive analysis of the Bayesian mixing scheme for the log loss prediction game. Later Littlestone and Warmuth [43] and Vovk [53] introduced an on line algorithm (called the Weighted Majority Algorithm by the former authors) for the simple prediction game. These two ....
A DeSantis, G Markowsky, and M N Wegman. Learning probabilistic prediction functions. In Proceedings of the 29th Annual IEEE Symposium on Foundations of Computer Science, pages 110--119, Los Alamitos, CA, 1988. IEEE Comput Soc.
....of a policy for discarding predictors in Weighted Majority that allows it to speed up as it learns. Keywords: Winnow, Weighted Majority, Multiplicative algorithms 1. Introduction Multiplicative weight updating algorithms such as Winnow (Littlestone, 1988) and Weighted Majority variants (DeSantis et al. 1988; Littlestone and Warmuth, 1994; Cesa Bianchi et al. 1993) have been studied extensively in the theoretical machine learning literature, in which a collection of strong properties have been proven. These algorithms could be said to fall into the category of learning simple things really well. ....
DeSantis, A., Markowsky, G., and Wegman, M. (1988). Learning probabilistic prediction functions. In Proceedings of the 29th IEEE Symposium on Foundations of Computer Science, pages 110--119.
.... a hypothesis at random from among those that are consistent with all the training examples, as in [Maa91] Here we apply similar methods from statistical physics to study Bayes optimal classification algorithm, a special case of the weighted majority algorithm [Lit89, LW89, Vov90] see also [DMW88] Further investigation of the Bayes and Gibbs algorithms appears in [HKS91] from both an information theory and a Vapnik Chervonenkis theory perspective. The performance of any learning algorithm will depend on the target function, i.e. the input output mapping to be learned. In the Bayesian ....
Alfredo DeSantis, George Markowski, and Mark N. Wegman. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 312--328, San Mateo, CA, 1988. Published by Morgan Kaufmann.
....Bayes Algorithm. In this case, P t as given in (6.1) is the posterior distribution after seeing the rst t 1 examples. If the algorithm in trial t predicts with the predictive distribution P (xjx 1 ; x t 1 ) i.e. LA (t) ln P (x t jx 1 ; x t 1 ) then (6. 2) is an equality [DMW88] 29 In the more general setting (when 6= 1 and the losses are not necessarily negative loglikelihoods) the prediction of the algorithm is chosen so that inequality (6.2) holds no matter what the t th example will be. Also, the larger the learning rate , the better the resulting relative ....
A. DeSantis, G. Markowsky, and M. N. Wegman. Learning probabilistic prediction functions. In Proc. 29th Annu. IEEE Sympos. Found. Comput. Sci., pages 110-119. IEEE Computer Society Press, Los Alamitos, CA, 1988.
....of short selling into account. Both Cover s game and the long short game are perfectly mixable. Taking into account transaction costs leads to a plethora of new games (cf Blum and Kalai [6] and Vovk and Watkins [56] The constants c( and a( for the games mentioned above were found in [21, 41, 31, 17, 50, 53, 56]. We have already noticed that a game is perfectly mixable if its loss function is strictly convex in some sense. The exact statement in the binary case (ie, where the outcome space consists of only 2 elements) can be found in [50] Lemma 2) and [31] it is an open problem to nd a simple ....
....0 t 2 T X t=1 y t x 0 t = 0; we obtain the RR formula (22) To prove Theorem 4, it suces to take = 1 8Y 2 instead of = 1 2Y 2 in the proof in Subsection 4.2. 5 Review of literature The rst paper on competitive on line statistics was, probably, DeSantis et al. [21], which performed a competitive analysis of the Bayesian mixing scheme for the log loss prediction game. Later Littlestone and Warmuth [41] and Vovk [51] introduced an on line algorithm (called the Weighted Majority Algorithm by the former authors) for the simple binary prediction game. These two ....
A DeSantis, G Markowsky, and M N Wegman. Learning probabilistic prediction functions. In Proceedings of the 29th Annual IEEE Symposium on Foundations of Computer Science, pages 110-119, Los Alamitos, CA, 1988. IEEE Comput Soc.
....will be shown shortly, estimating ff i ; f i ; 1 i N 1 does not require explicit randomization; that is, we do not require to simulate f k g explicitly in order to learn ff i s and f i s. A number of researchers have previously considered the problem of learning the conditional distribution [5,9]. But in these approaches, the true conditional distribution is assumed to come from a known countable class of distributions. For example, DeSantis et al. 5] consider the problem of learning the conditional distribution from a countable class of distributions, which minimizes the entropy of the ....
....to learn ff i s and f i s. A number of researchers have previously considered the problem of learning the conditional distribution [5,9] But in these approaches, the true conditional distribution is assumed to come from a known countable class of distributions. For example, DeSantis et al.[5] consider the problem of learning the conditional distribution from a countable class of distributions, which minimizes the entropy of the observed data. In the approach taken here, we do not make any such assumptions. Another approach which is closely related to the issues addressed here, is the ....
A. DeSantis, G. Markowsky, and M. Wegman, "Learning Probabilistic Prediction Functions, " Proc. of the 1988 Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 1988.
....we introduce and analyze an alternative approach based on a mixture model of a new subclass of probabilistic transducers which we call suffix tree transducers. Mixture models, often referred to as mixtures of experts, have shown to be a powerful approach both theoretically and experimentally. See (DeSantis et al. 1988; Jacobs et al. 1991; Haussler and Barron, 1993; Littlestone and Warmuth, 1994; Cesa Bianchi et al. 1993; Helmbold and Schapire, 1995) for analyses and applications of mixture models, from different perspectives such as connectionism, Bayesian inference and computational learning theory. We ....
....min T 0 2Sub(T) Phi Loss n (T 0 ) Gamma log 2 (P 0 (T 0 ) Psi : The running time of the algorithm is Dn where D is the maximal depth of T or 1 2 (n 1) 2 when T is of an unbounded depth. Proof: The proof of the first part of the theorem is based on a technique introduced by DeSantis et al. 1988). Based on the definition of A ffl (n) and P n Gamma1 (T 0 ) from Thm. 1 we can rewrite Loss mix n as Loss mix n = Gamma n X i=1 log 2 0 X T 0 2Sub(T) P (y i jT 0 )P i Gamma1 (T 0 ) 1 A = Gamma n X i=1 log 2 A ffl (i) A ffl (i Gamma 1) Gammalog 2 n Y ....
A. DeSantis, G. Markowski, and M.N. Wegman. Learning probabilistic prediction functions. In Proceedings of the First Annual Workshop on Computational Learning Theory, pages 312--328, 1988.
....loss functions bounds of the form Loss L (A; S) Gamma min 1iN Loss L (E i ; S) c L ln N ; 1:1) where c L is a positive constant determined by the loss function L. For instance, for the square loss Vovk s algorithm achieves the bound with c L = 1=2 [16] and for logarithmic loss with c L = 1 [8, 16]. Note that the bound (1.1) for the additional loss is independent of the length of the trial sequence S. On the other hand, for the absolute loss L abs given by L abs (y t ; y t ) jy t Gamma y t j Cesa Bianchi et al. 2] have shown that bounds of the form (1.1) are not obtainable, but the ....
....V L;A (N; can have an upper bound that is independent on . Such bounds have previously been proven for square loss and logarithmic loss when the outcomes are binary. For these loss functions there are algorithms that satisfy V L;A (N; 1 2 ln N and V L;A (N; ln N , respectively [16, 8]. On the other hand, for the absolute loss it is known that no upper bound of this form exists, but the algorithm A that minimizes V L;A (N; has V L;A (N; Omega i p ln N j [2] One of our results provides a formula from which the best possible upper bound for V L;A (N; can be ....
[Article contains additional citation context not shown here]
A. DeSantis, G. Markowsky,and M. N. Wegman. Learning probabilistic prediction functions. In Proc. 29th Annu. IEEE Sympos. Found. Comput. Sci., pages 110--119. IEEE Computer Society Press, Los Alamitos, CA, 1988.
....the probability of mistake (known as the 0 1 loss in decision theory) for an optimal learning algorithm, and the Shannon information gain from the labels of the instance sequence. In doing so, we borrow from and contribute to the work on weighted majority and aggregating learning strategies [18,20,36,11,2,19], as well as to the VC dimension and statistical physics work. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 More general Bayesian approaches to learning in neural networks are described in the recent papers [21,6] One of our main ....
A. DeSantis, G. Markowski, and M. N. Wegman. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 312--328. Morgan Kaufmann, 1988.
....input alphabet. After prediction, an outcome y t is observed that results in a loss l t = L(y t ; y t ) We derive a weight update rule for bounded loss predictors using the framework introduced by Freund Schapire (1997) which generalizes former on line weight allocation algorithms (DeSantis, Markowsky, Wegman, 1988; Vovk, 1990; Littlestone Warmuth, 1994; CesaBianchi et al. 1997) and can be applied to a wide variety of learning problems. This derivation does not depend on the precise form of the loss function, requiring only that the appropriate loss value be provided to the learning algorithm after each ....
.... P T t=1 l t P be the loss of the pruning P of T on the sequence, where l t P is given by (9) Then the logloss of the weight allocation algorithm of Figure 3 is at most Gamma ln(w 1 P ) LP = ln(2)jPj LP : Proof: The proof is a direct application of the proof technique of technique of DeSantis et al. 1988). From (10) we get that the log loss of the mixture on the sequence x 1 ; x T is Gamma T X t=1 ln(y t ) Gamma ln T Y t=1 y t = Gamma ln T Y t=1 w t 1 ( w t ( Gamma ln(w T 1 ( Gamma ln(w 1 ( Gamma ln(w T 1 ( 11) where for the ....
DeSantis, A., Markowsky, G., & Wegman, M. N. (1988). Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory (pp. 312-- 328). San Francisco, California: Morgan Kaufmann.
.... for the log loss, LA (y) Gamma log PA (y) It is well known that for the log loss, for any set E of N experts (i.e. distributions) there is a prediction strategy A such that for any sequence y, LA (y) Gamma L E (y) log N; where L E (y) is the total log loss of the best expert for y [Ris86, DMW88, Vov92, HBar, Yam91, KW93] The strategy is just the Bayes algorithm with uniform prior on the distributions represented by the experts. An exact min max analysis of this case is quite simple. 7. Conclusions 43 Theorem 32: For each y 2 f0; 1g and each expert E i 2 E, let P i (y) denote the ....
Alfredo DeSantis, George Markowski, and Mark N. Wegman. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 312--328. Morgan Kaufmann, 1988.
....with a mixture of PSTs. Here we adopted two important ideas from machine learning and information theory. The first is the fact that a mixture over an ensemble of experts (models) when the mixture weights are properly selected, performs better than almost any individual member of that ensemble (DeSantis et al. 1988; Cesa Bianchi et al. 1993) The second idea is that within a Bayesian framework the sum over exponentially many trees can be computed efficiently using a recursive structure of the tree, as was recently shown by Willems et al. 1994) Here we apply these ideas and demonstrate that the mixture, ....
....the weights of the sub trees starting at the pointed node. These weights are used for tracking a mixture of PSTs. The special string represents a wild card that can be matched with any observed word. 3 The Learning Algorithm Within the framework of online learning, it is provably (see e.g. (DeSantis et al. 1988; Cesa Bianchi et al. 1993) and experimentally known that the performance of a weighted ensemble of models, each model weighted according to its performance (the posterior probability of the model) is not worse and generally much better than any single model in the ensemble. Although there ....
A. DeSantis, G. Markowski, M.N. Wegman. 1988. Learning Probabilistic Prediction Functions.
....of science is not just to predict, but also to posit a model that helps to understand the phenomenon being studied. Thus, any rigorous justification of the methodology of science must include an understanding of the identification problem. Such an interest in predictive models was also proposed in [4]. In the case that machines in M are all deterministic, there is a well known principle, Occam s Razor [1, 2] that states that any hypothesis that is consistent with the output, and whose description is short is a good approximation to the true machine thus the criterion is to minimize jM j ....
A. DeSantis, G. Markowsky, M. Wegman, Learning Probabilistic Prediction Functions, Proc. 1988 Workshop on Computational Learning Theory, pages 312--328, 1988.
....by Haussler, Kivinen, and Warmuth ( 15] Example 4. 4) this is also true when Omega = 0; 1] Example 5 (logarithmic game) Here Omega = f0; 1g, Gamma = 0; 1] fl) ln fl (1 Gamma ) ln 1 Gamma 1 Gamma fl : Now c(fi) 1 for fi e Gamma1 (DeSantis, Markowsky, Wegman [9]) Haussler, Kivinen, and Warmuth ( 15] Example 4.3) prove that this is true for Omega = 0; 1] as well. Example 6 This example is rather artificial; it demonstrates that it is possible that c(fi) 1, for some fi 2 ]0; 1[ Let ffl 1 ; ffl 2 ; be a decreasing sequence of positive numbers ....
....cannot win the game. It remains an open problem to give an explicit formula for the value inf fa j G(1; a) Lg 2 [0; a(0) 8 CONNECTIONS WITH LITERATURE The Aggregating Algorithm was proposed in [31] as a common generalization of the Bayesian merging scheme (Dawid [7] Section 4; DeSantis et al. [9]) and the Weighted Majority Algorithm (Vovk [32] Theorem 5, and Littlestone and Warmuth [23] I am using the name coined by Littlestone and Warmuth) Earlier, algorithms with similar properties were proposed by Foster [11] for the case of the Brier loss function; see Example 4 above) and Foster ....
A. DeSantis, G. Markowsky, and M. N. Wegman, Learning probabilistic prediction functions, in "Proceedings, 29th Annual IEEE Symposium on Foundations of Computer Science," pp. 110--119, IEEE Comput. Soc., Los Alamitos, CA, 1988.
....of this paper is to show how to build models based on mixtures of PSTs. We use two results from machine learning and information theory. The first is that a mixture of an ensemble of experts (models) with suitably selected weights performs better than almost any individual member of the ensemble (DeSantis et al. 1988; Cesa Bianchi et al. 1993) The second result is that within a Bayesian framework the sum over exponentially many trees can be computed efficiently using the recursive structure of the tree, as was recently shown by Willems et al. 1995) Our experiments with algorithms based on those ....
....sub trees starting at the pointed node. These weights are used for tracking a mixture of PSTs. The special string represents a wild card that can be matched with any observed word. Beyond Word N Grams 5 3. The Learning Algorithm Within the framework of online learning, it can be proved (DeSantis et al. 1988; Cesa Bianchi et al. 1993) and demonstrated experimentally that the performance of a weighted ensemble of models in which each model is weighted according to its performance (the posterior probability of the model) is not worse and generally much better than any single model in the ensemble. ....
A. DeSantis, G. Markowski, M. N. Wegman. 1988. Learning Probabilistic Prediction Functions. Proceedings of the 1988 Workshop on Computational Learning Theory, pp. 312--328.
....In this paper we introduce and analyze an alternative approach based on a mixture model of a new subclass of probabilistic transducers, which we call suffix tree transducers. The mixture of experts architecture has been proved to be a powerful approach both theoretically and experimentally. See [4, 8, 6, 10, 2, 7] for analyses and applications of mixture models, from different perspectives such as connectionism, Bayesian inference and computational learning theory. By combining techniques used for compression [13] and unsupervised learning [12] we devise an online algorithm that efficiently updates the ....
....sequence of input output pairs. The loss of the mixture is at most, Lossn (T 0 ) Gamma log(P 0 (T 0 ) for each possible subtree T 0 . The running time of the algorithm is D n where D is the maximal depth of T or n 2 when T is infinite. The proof is based on a technique introduced in [4]. Note that the additional loss is constant, hence the normalized loss per observation pair is, P 0 (T 0 ) n, which decreases like O( 1 n ) Given a long sequence of input output pairs or many short sequences, the structure of the suffix tree transducer is inferred as well. This is done by ....
A. DeSantis, G. Markowski, and M.N. Wegman. Learning probabilistic prediction functions. In Proc. of the 1st Wksp. on Comp. Learning Theory, pages 312--328, 1988.
....a fixed set of experts we can associate with it a distribution PA . It is well known that for the log loss, for any set E of N experts there is a prediction strategy A such that for any sequence y, LA (y) Gamma L E (y) log N; where L E (y) is the total log loss of the best expert for y [Ris86, DMW88, Vov92, HB92, Yam95, KW94] 12 The strategy is just the Bayes algorithm with uniform prior on the distributions represented by the experts. A min max optimal prediction algorithm is known for the case where the experts are simulatable and , the number of iterations, is known in advance. This ....
Alfredo DeSantis, George Markowski, and Mark N. Wegman. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 312--328. Morgan Kaufmann, 1988.
....is provably a small constant factor from optimal. We make a similar comparison for the randomized algorithm WMR. The concluding section, Section 9 gives an overview of the various algorithms introduced here and mentions a number of directions for future research. DeSantis, Markowsky and Wegman [DMW88] applied an algorithm similar to WMC to a countably infinite pool (as in WMI 2 ) in a completely different setting. For a countably infinite indexed pool of conditional probability distributions the goal is to iteratively construct a master conditional probability distribution which assigns a ....
Alfredo DeSantis, George Markowski, and Mark N. Wegman. Learning probabilistic prediction functions. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 312--328, Published by Morgan Kaufmann, San Mateo, CA, 1988.
No context found.
A. DeSantis, G. Markowsky, and M.N. Wegman. Learning probabilistic prediction functions. In FOCS, pages 110--119, 1988.
No context found.
A. DeSantis, G. Markowsky, and M. Wegman, "Learning probabilistic prediction functions," Proc. 29th IEEE Symp. Foundations of Computer Science, pp. 110--119, 1988.
No context found.
A. DeSantis, G. Markowsky, and M. N. Wegman. Learning probabilistic prediction functions. In Proc. 29th Annu. IEEE Sympos. Found. Comput. Sci., pages 110{ 119. IEEE Computer Society Press, Los Alamitos, CA, 1988.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC