| D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995. |
....with square loss we have g(y t ; y t ) y t y t . The Widrow Ho rule is then obtained when f is the identity mapping, while the EGU algorithm [21] is given by the componentwise logarithm f (w t ) ln w t;1 ; ln w t;n ) The standard way to analyze these algorithms (e.g. [23, 24, 26, 17, 2, 8, 7, 16, 21, 6, 12, 3, 13]) is to de ne a measure of progress related to the mapping f . The measure of progress we use here is the so called Bregman divergence [5, 10] associated with f . We denote the divergence by d f (u; w) where u and w are weight vectors. Informally, we can de ne d f (u; w) as follows. Assume that f ....
.... 1) the square loss setting; 2) the absolute loss setting with binary labels (i.e. the binary classi cation problem where the algorithm makes randomized predictions) Our results for square loss are easily extended to more general regression frameworks, such as Helmbold, Kivinen and Warmuth s [17, 22] generalized linear regression model. Quasi additive learning algorithms 13 We rst need to recall some preliminaries about the dual norms technology we will be using in this section. Given a vector w = w 1 ; w n ) 2 R n and p 1 we denote by jjwjj p the p norm of w, i.e. jjwjj p = ....
[Article contains additional citation context not shown here]
Helmbold, D., Kivinen, J., and Warmuth, M. K. (1999), Worst-case loss bounds for sigmoided linear neurons. IEEE Transactions on Neural Networks, 10(6), 1291-1304.
....we have g(y t ; y t ) y t y t . The Widrow Hoff rule is then obtained when f is the identity mapping, while the EGU algorithm [KW97] is given by the componentwise logarithm f (w t ) ln w t;1 ; ln w t;n ) The standard way to analyze these algorithms (e.g. Lit88, Lit89, LW94, HKW95,AW98, CBLW96, CBFH 97, GLS97, KW97, Byl97, GW98, AW99, GL99] is to define a measure of progress related to the mapping f . The measure of progress we use here is the so called Bregman divergence [Bre67, CL81] associated with f . We denote the divergence by d f (u; w) where u and w are ....
.... 1) the square loss setting; 2) the absolute loss setting with binary labels (i.e. the binary classification problem where the algorithm makes randomized predictions) Our results for square loss are easily extended to more general regression frameworks, such as Helmbold, Kivinen and Warmuth s [HKW95, KW98b] generalized linear regression model. We first need to recall some preliminaries about the dual norms technology we will be using in this section. Given a vector w = w 1 ; wn ) 2 R n and p 1 we denote by jjwjj p the p norm of w, i.e. jjwjj p = P n i=1 jw i j p ) 1=p ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. IEEE Transactions on Neural Networks, 10(6): 1291--1304, November 1999.
....the number of membership queries needed to learn a linear threshold function in the Boolean domain with positive integer weights bounded by t requires O(n t ) membership queries. Also, the worst case mistake bounds of Littlestone s Winnow on line algorithm [Lit88] and its variants [CBLW95, KW94, HKW96] for learning linear threshold functions are linear in the total number of bits needed to encode the weights. Golea et al. GBLM98] showed that sample complexity of a neural network is determined more by the magnitude of the weights of the network than its size. This suggests that the precision ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1996 Neural Information Processing Conference, 1996. To appear.
.... d 2 . Now apply the same argument as in Case 1, obtaining the contradiction jfif j d Gamma 2 Theta 2 d Gamma . Chapter 3 Learning Depth Two Neural Networks with Constant Fan in at the Hidden Nodes 3. 1 Introduction Recently, many on line algorithms were found [Lit88, CBLW95, KW94, HKW96] for learning single neurons whose total loss bounds scale only logarithmically with the input dimension. All of them rely on a multiplicative update scheme of the weights and these update schemes are motivated [KW94] by the minimum relative entropy principle of Kullback [KK92, Jum90] However, ....
....other transfer and loss functions is very similar and is only sketched in Sections 3.3.4 and 3.3.5. We start with an algorithm which learns single neurons with the logistic transfer function OE(z) 1 1 e Gammaz and where the loss of the algorithm is measured by the entropic loss function. In [HKW96] an algorithm A (1) log (a version of EG Sigma ) for learning such a neuron was developed (see Figure 3.2) In each trial each weight is updated by a positive factor. Since such multiplicative updates do not change the sign of a weight, two weights have to be maintained for each input, one ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1996 Neural Information Processing Conference, 1996.
....for single neurons to learning algorithms for depth two neural networks. This technique works for on line learning algorithms for single neurons whose total loss bounds scale only logarithmically with the input dimension. Quite a number of such algorithms were found recently [Lit88, CBLW95, KW94, HKW96] All of them rely on a multiplicative update scheme of the weights and these update schemes are motivated [KW94] by the minimum relative entropy principle of Kullback [KK92, Jum90] The way we get a depth two neural network from a single neuron is the following. We expand a single neuron by ....
....for other transfer and loss functions is very similar and is only sketched in Sections 3.4 and 3.5. We start with an algorithm which learns single neurons with the logistic transfer function OE(z) 1 1 e Gammaz and where the loss of the algorithm is measured by the entropic loss function. In [HKW96] an algorithm A (1) log (a version of EG Sigma ) for learning such a neuron was developed (see Figure 2) In each trial each weight is updated by a positive factor. Since such multiplicative updates do not change the sign of a weight, two weights have to be maintained for each input, one ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1996 Neural Information Processing Conference, 1996. To appear.
....(y Gamma y) Delta x t;i g and w t 1;i = v i P k j=1 (v j v Gamma j ) w Gamma t 1;i = v Gamma i P k j=1 (v j v Gamma j ) Figure 2: Algorithm A (1) log for learning single neurons with the logistic transfer function and the entropic loss function. HKW96] for algorithm A (1) log is L A (1) log (S) 4 3 Delta Lu (S) U 2 3 Delta ln(2k) for any sequence S of examples where Lu (S) denotes the loss of the optimal neuron with weight vector u for which jjujj 1 U . Using the technique sketched in Section 2 the transformation of algorithm ....
....is measured by D( u ; u Gamma ) w ; w Gamma ) P k i=1 h u i ln u i w i u Gamma i ln u Gamma i w Gamma i i with the convention that 0 ln 0 = 0. Note that the 2N weights of algorithm A (1) log are always normalized. The loss bound obtained [HKW96] for algorithm A (1) log is L A (1) log (S) 4 3 Delta Lu (S) U 2 3 Delta [D(u; w 1 ) Gamma D(u; wT 1 ) Note that this bound is equivalent to equation (2) with b = 4=U 2 and a = 3=U 2 . Now we consider algorithm A (2) log . Recall that the weights of the optimal neural ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1996 Neural Information Processing Conference, 1996. To appear.
No context found.
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995.
....are good for different purposes and, generally speaking, incomparable (see [KWA97] for a discussion) In [KW97] a framework was introduced for deriving simple on line learning updates. This framework has been applied to a variety of different learning algorithms and differentiable loss functions [HKW95, KW98]. The updates are always derived by approximately solving the following minimization problem w t 1 : argmin w U(w) where U(w) d(w; w t ) loss(y t ; r (w x t ) 1) Here loss denotes the chosen loss function. In our setting this would be the discrete loss. What is different now is ....
....use d(w; w t ) jjw w t jj 2 =2 as the divergence. This can be used as a potential function for the proof of the Perceptron convergence theorem. Multiplicative update algorithms such as Winnow and various exponentiated gradient algorithms use entropy based divergences as potential functions [HKW95, KW98]. The function U in (1) is minimized by differentiating w.r.t. w. This works very well when the loss function is convex and differentiable. For example for linear regression, when the loss function is the square loss (w t x t y t ) 2 =2, then minimizing U(w) with the divergence jjw w t jj 2 ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In NIPS
....in the behavior of the GD and EG Sigma algorithms for a linear neuron carry over to feed forward neural networks, but it seems unlikely that one could prove worst case bounds in this more complicated setting. For single sigmoided neurons, worst case bounds have been obtained recently [HKW95] We define the basic notation in Section 2. Our main algorithms are intro Preliminaries 9 duced in Section 3, and their derivations using the various distance measures are given in Section 4. In Section 5 we prove our worst case upper bounds for the losses of the algorithms. Both Section 4 and ....
....a new bias, which favors sparse weight vectors. We have observed that in the case of linear regression, this leads to improved performance in high dimensional problems if the target weight vector is sparse. We also expect to see similar behavior in more general settings. Recently Helmbold et al. HKW95] were able to prove worst case loss bounds for single sigmoided linear neurons when the tanh function is used as the sigmoid function and the loss function is the relative entropy loss. In this case, worstcase loss bounds can be obtained for the algorithms from the gradient descent and ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Advances in Neural Information Processing Systems 8. MIT Press, 1995. To appear.
....in the behavior of the GD and EG Sigma algorithms for a linear neuron carry over to feed forward neural networks, but it seems unlikely that one could prove worst case bounds in this more complicated setting. For single sigmoided neurons, worst case bounds have been obtained recently [HKW95] We define the basic notation in Section 2. Our main algorithms are introduced in Section 3, and their derivations using the various distance measures are given in Section 4. In Section 5 we prove our worst case upper bounds for the losses of the algorithms. Both Section 4 and Section 5 begin ....
....a new bias, which favors sparse weight vectors. We have observed that in the case of linear regression, this leads to improved performance in high dimensional problems if the target weight vector is sparse. We also expect to see similar behavior in more general settings. Recently Helmbold et al. HKW95] were able to prove worst case loss bounds for single sigmoided linear neurons when the tanh function is used as the sigmoid function and the loss function is the relative entropy loss. In this case, worst case loss bounds can be obtained for the algorithms from the gradient descent and ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Advances in Neural Information Processing Systems 8. MIT Press, 1995. To appear.
No context found.
Helmbold, D. P., Kivinen, J., and Warmuth, M. K. (1996a), Worst-case loss bounds for sigmoided linear neurons, in "Advances in Neural Information Processing Systems 8," MIT Press, Cambridge, MA (to appear).
No context found.
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309-315. MIT Press, Cambridge, MA, November 1995.
.... algorithms were developed for the on line linear least squares regression, i.e. when the comparison class consists of linear neurons (linear combination of experts) LLW95, CBLW96, KW97] This work has been generalized to the case where the comparison class is the set of sigmoided linear neurons [HKW95, KW98] Also starting with Littlestone s work, relative loss bounds for the comparison class of linear threshold functions have been investigated [Lit88, GLS97] All the on line algorithms cited in the previous paragraph use xed learning rates. In the simple settings the relative loss bound do ....
.... is di erentiable, the Bregman divergence can also be written as a path integral: G ( e ; Z e (g(r) g( dr: This integral version of the divergence has been used to de ne a notion of a convex loss matching the increasing transfer function g( of an arti cial neuron [AHW95, HKW95, KW98] 4 The Incremental O line Algorithm In this section we give our most basic algorithm and show how to prove relative loss bounds in a general setting. Learning proceeds in trials t = 1; T . In each trial t an example is processed. For density estimation, the examples are data ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309-315. MIT Press, Cambridge, MA, November 1995.
....feature, with bounds of the form (3) and proofs analogous to ours, was already pointed out in [FSSW97] Our second use for relative entropy is as a regularizing term in setting up a minimization problem that gives Vovk s rule for updating the weights. The basic idea in such a derivation (see [KW97,HKW95] for other examples) is to see the update as an act of balancing the need to maintain old information by staying close to the old weight vector and the need to learn by moving the weights in the direction of small loss on the last example. In Sect. 2 we review the basic expert framework and ....
....U t (v) c d re (v; v t ) L(y t ; v Delta x t ) and again v is constrained to be a probability vector. If the loss function is convex then L(y t ; v Delta x t ) v Delta L t and U t (v) bounds U t (v) from above. The bounds that can be obtained for algorithms based on minimizing U t [KW97,HKW95] differ significantly from the style of bounds we have here. When the loss L(y t ; b y t ) of the algorithm is compared to L(y t ; u Delta x t ) it is usually impossible to bound the additional loss by a constant (such as ec L ln n here) However, bounds where the comparison is to L(y t ; u ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995.
.... After that, algorithms were developed for the on line linear least squares regression, i.e. when the comparison class consists of linear neurons (linear combination of experts) LLW95, CBLW96, KW97] This work has been generalized to the case when sigmoided linear neurons are the comparison class [HKW95, KW98] General frameworks of on line learning algorithms were developed in [GLS97, KW97, KW98] We follow the philosophy of Kivinen and Warmuth [KW97] of starting with a divergence function. From the divergence function we derive the on line update and we then use the same divergence as a ....
....Pythagorean Theorem can be proven for Bregman divergences [Bre67, CL81, Csi91, JB90, HW98] The latter theorem contradicts the triangular inequality which is the reason why we use the term divergence instead of distance . In the context of learning these divergences were rediscovered in [KW97, HKW95, KW98] Related divergences are used in [GLS97] Projections have recently been applied in [HW98] for the case when the underlying model shifts over time and the projections w.r.t. the divergences are used to keep the parameters of the algorithm in reasonable convex regions. This aids the ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995.
....transfer function, the matching loss function is the squared Euclidean distance. The first result we get from this observation connecting matching losses to a general notion of distance is that certain previous results on generalized linear regression with matching loss on one dimensional outputs [HKW95] directly generalize to multidimensionaloutputs. From a more general point of view, a much more interesting feature of these distance functions is how they allow us to view certain previously known learning algorithms, and introduce new ones, in a simple unified framework. To briefly explain this ....
....here is updated by the gradient with respect to , so this is not just a gradient descent with reparameterization [JW98] However, we obtain the usual on line gradient descent when is the identity function. When is the softmax function, we get the so called exponentiated gradient (EG) algorithm [KW97, HKW95]. The connection of the distance function D to the update (1) is two fold. First, 1) can be motivated as an approximate solution to a minimization problem in which the distance D ( t ; t 1 ) is used as a kind of penalty term to prevent too drastic an update based on a single example. Second, ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. Neural Information Processing Systems 1995, pages 309--315. MIT Press, Cambridge, MA, November 1995.
....a measurement of the size of u. We call such bounds static bounds, because the comparison vector u is any member from the comparison class but it does not change with time. Surprisingly, such bounds are achievable even when there are no probabilistic assumptions made on the sequence of examples [Lit88, Vov90, HKW97, CBLW96, KW94, Byl97, HKW95]. In this paper we allow the comparison vector u to shift with time. For a sequence S of examples of length and a schedule of predictors hu 1 ; u i from the comparison class, we seek an upper bound of the form L(A; S) cL(hu 1 ; u i; S) c size(hu 1 ; u i) ....
.... link and the unnormalized relative entropy as the divergence function (see Figure 1) D ne = P n i=1 (u i ln u i w i w i Gamma u i ) The divergence functions DF are used as potential functions for proving static loss bounds for the corresponding generalized gradient descent algorithms[CBLW96, KW97a, HKW95, Byl97, KW97b]. As in the previous work in convex programming, we use projections based on these divergences. We now outline how worst case loss bounds are obtained in the static case. At the center of all the static proofs [KW94, HKW95, Byl97, KW97b] for the generalized gradient descent algorithms lies the ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995.
....functions lead to different update rules. One of the main contributions of this line of work is the use of the relative entropy as a distance function for motivating updates: DRE (ujjv) def = N X i=1 u i log u i v i : Many other on line algorithms with multiplicative weight updates [18, 3, 17, 12] are also motivated by this distance function and are thus rooted in the minimum relative entropy principle of Kullbach [15, 11] We also use a second order Taylor approximation (at u = v) of the relative entropy called the 2 distance, since it leads to updates that are computationally ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Advances in Neural Information Processing Systems 8, 1996.
....of link functions. The key tool for analyzing an update is a distance function associated with an update [Lit89, KW97b] Here we use a general form that is based on an arbitrary link function. This form was introduced in [JK97] and was inspired by the definition of matching loss function given in [HKW95, AHW95, JK97]. Delta f ( X i Z [i] i] f i (r) Gamma f i ( i] dr (5) This distance function is usually asymmetric. In [KW97b] discretized updates are derived from the distance functions. In Section 3 we extend this method to derive the continuous time updates also from the ....
....norms characterizes the radically different behavior of the two algorithms that shows up experimentally [KW97b] One of the norms measures the instances x t and the corresponding dual norm measures the off line parameter vector . So far this has only been done for two pairs of dual norms [KW97b, HKW95], one pair characterizing the GD algorithm and another one for the EG algorithm. However for the case of linear threshold functions the updates for more general pairs of dual norms have been developed and analyzed [GLS97] This paper helps to explain why the dual norms disappear in the CT case, as ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995.
....functions lead to different update rules. One of the main contributions of this line of work is the use of the relative entropy as a distance function for motivating updates: DRE (ujjv) def = N X i=1 u i log u i v i : Many other on line algorithms with multiplicative weight updates [21, 3, 20, 14] are also motivated by this distance function and are thus rooted in the minimum relative entropy principle of Kullback [18, 13] To derive learning rules using relative entropy, we set d(w t 1 ; w t ) DRE (w t 1 jjw t ) It is hard to maximize F since both terms depend non linearly on ....
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Advances in Neural Information Processing Systems 8, 1996.
....might be more convenient. In the one dimensional case, for a strictly increasing continuous OE it is possible to define a matching loss function L OE which has the property that the total loss is a convex function of the weight vector and thus, in particular, has no spurious local minima [AHW95, HKW95]. For example, the matching loss function for the logistic transfer function is the relative entropy (a generalization of the logarithmic loss for continuous valued outcomes) The main theme of this paper is the generalization of the notion of the matching loss function for multidimensional ....
....[MN89, FT91] These statistical interpretations of the loss function and relative loss bounds are discussed in a future paper [AW97] here we need only some very basic properties of the matching loss. The most obvious new result we obtain comes from observing that the results of Helmbold et al. [HKW95] for matching loss functions in the one dimensional case generalize almost directly into the multidimensional case. In particular, we generalize the result for the logistic transfer function to obtain a similar result for multiclass classification through the softmax function. The more ....
[Article contains additional citation context not shown here]
D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Worst-case loss bounds for sigmoided linear neurons. In Proc. 1995 Neural Information Processing Conference, pages 309--315. MIT Press, Cambridge, MA, November 1995.
....been developed. Examples are algorithms for learning linear threshold functions (Littlestone, 1988; Littlestone, 1989) and algorithms whose additional loss bound over the loss of the best linear combination of experts or sigmoided linear combination of experts is bounded (Kivinen Warmuth, 1997; Helmbold, Kivinen Warmuth, 1995). Significant progress has recently been achieved for other non stationary settings building on the techniques developed in this paper (see discussion in the Conclusion Section) The paper is outlined as follows. After some preliminaries (Section 2) we present the algorithms (Section 3) and give ....
Helmbold, D. P., Kivinen, J., & Warmuth, M. K. (1995). Worst-case loss bounds for sigmoided linear neurons. In Proceedings of the 1995 Neural Information Processing Conference, (pp.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC