73 citations found. Retrieving documents...
Kivinen J. & Warmuth M.K. (1997) Additive versus exponentiated gradient updates for linear prediction. Information and Computation, Vol.132(1): 1--64.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Sample Complexity of Classification - Langford   (Correct)

....2 presents the formal model Section 3 presents the test set bound Section 4 presents the train set bound 2. Formal Model There are many somewhat arbitrary choices of learning model. The one we use can (at best) be motivated by it s simplicity. Other models such as the online learning model [5], PAC learning [11] and the uniform convergence model [12] di er in formulation, generality, and in the scope of addressable questions. The strongest motivation for studying the sample complexity model here is simplicity and corresponding generality of results. Appendix section 5 discusses the ....

....learning will be achieved . Both of these models can support stronger statements than the basic sample complexity model presented here. Results from both of these models can apply to the sample complexity model presented here after appropriate massaging of results. The online learning model [5] makes no assumptions. Typical theorems have the form This learning algorithm s performance will be nearly as good as anyone of a set of classi ers. The online learning model has very general results and no ability to answer questions about future performance as we address here. The sample ....

J. Kivinen and M. Warmuth, "Additive Versus Exponentiated Gradient Updates for Linear Prediction," in Journal of Information and Computation, vol. 132, no. 1, pp. 1-64, January 1997.


Online Oblivious Routing - Bansal, Blum, Chawla, Meyerson (2003)   (1 citation)  (Correct)

....presents the algorithm with a convex function c . The cost incurred by the algorithm is c ) and the objective of the algorithm is to minimize the sum of costs over all time steps. This setting is closely related to the problem of prediction and regression under linear loss functions (see [8, 19, 20] and references therein) and the previously mentioned problem of designing nearly optimal strategies for repeated games [17, 7] We review the algorithm of Zinkevich [21] in detail in Section 3 and adapt it to our setting. The framework of online convex programming is general enough that it ....

M. Warmuth and J. Kivinen. Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132(1):1--64, 1997.


Multiclass Learning by Probabilistic Embeddings - Dekel, Singer (2002)   (5 citations)  (Correct)

....proportional to the distance between C and C 0 to the loss defined in Eq. 4) This penalty on C can be viewed as a form of regularization (see for instance [10] Similar paradigms have been used extensively in the pioneering work of Warmuth and his colleagues on online learning (see for instance [7] and the references therein) and more recently for incorporating prior knowledge into boosting [11] The regularization factor we employ is the KL divergence between the images of C and C 0 under the logistic transformation, R(S C, C 0 ) D[#(Cy j ) #(C 0 y j ) The influence of this ....

Jyrki Kivinen and Manfred K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Text Chunking based on a Generalization of Winnow - Zhang, Damerau, Johnson (2001)   (2 citations)  (Correct)

....solution to this problem, we shall rst examine a derivation of the Winnow algorithm in [4] which motivates a more general solution to be presented later. Following [4] we consider a loss function max( w ; 0) often called hinge loss . Using the general on line learning framework in [7], for each data point (x ) we consider an online update rule such that the weight w after seeing the i th example is given by the solution to min j ln ew max( w ; 0) 5) Setting the gradient of the above formula to zero, we obtain ln r w i 1 = 0: 6) In ....

J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132:1-64, 1997.


On the Dual Formulation of Regularized Linear Systems With Convex.. - Zhang   (Correct)

.... To better understand this, observe that the Perceptron update corresponds to the square norm regularization and the Winnow update corresponds to the entropy regularization (for example, see the proofs of both methods in [5] or the comparison of exponentiated gradient versus gradient descent in [8]) The optimal margin SVM for linearly separable problems modifies a Perceptron as minimizing the 2 norm under a margin constraint, which corresponds to the minimization of entropy under the same margin constraint for the Winnow (or exponentiated gradient) family of algorithms. The soft margin SVM ....

....large k: w k Gamma1 Gamma x k ) 1, hence the growth rate of log(n) is achieved by simply summing the above equality over k. Note that this logarithmic factor indicates that the correct batch learning rate of O(1=n) cannot be obtained by the typical randomization technique (for example, see [8]) for modifying online algorithms (and mistake bounds) as batch algorithms. It is also important to note that the matching loss function concept (cf. 6] is not important in our analysis, which allows us to analyze problems with any loss function. The role of the matching loss function ....

J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132:1--64, 1997.


Convergence of Large Margin Separable Linear Classification - Zhang (2001)   (1 citation)  (Correct)

....with a small online mistake Bounds described in [3] would imply an expected classification error of O( log ) which can be slightly improved (by a log n factor) if we adopt a slightly better covering number estimate such as the bounds in [12, 14] bound. The readers are referred to [6] and references therein for this type of analysis. The technique may lead to a bound with an expected generalization performance of O( Besides the above mentioned approaches, generalization ability can also be studied in the statistical mechanical learning framework. It was shown that for ....

J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132:1--64, 1997.


Regularized Winnow Methods - Zhang (2001)   (Correct)

....j of the weight vector w as: w j w j exp(jx ) where j 0 is the learning rate parameter, and the initial weight vector can be taken as w j = j 0. The Winnow algorithm belongs to a general family of algo rithms called exponentiated gradient descent with unnormalized weights (EGU) [9]. There can be several variants. One is called balanced Winnow, which is equivalent to an embedding of the input space into a higher dimensional space as: x = x; Gammax] This modification allows the positive weight Winnow algorithm for the augmented input x to have the effect of both ....

....problems, both Perceptron and Winnow are able to find a weight that separate the in class vectors from the out of class vectors in the training set within a finite number of steps. However, the number of mistakes (updates) before finding a separating hyperplane can be very different [10, 9]. This difference suggests that the two algorithms serve for different purposes. For linearly separable problems, Vapnik proposed a method that optimizes the Perceptron mistake bound which he calls optimal hyperplane (see [15] The same method has also appeared in the statistical mechanical ....

[Article contains additional citation context not shown here]

J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132:1--64, 1997.


Algorithms for Non-negative Matrix Factorization - Lee, Seung (2001)   (54 citations)  (Correct)

....to see that this multiplicative factor is unity when V = WH , so that perfect reconstruction is necessarily a fixed point of the update rules. 5 Multiplicative versus additive update rules It is useful to contrast these multiplicative updates with those arising from gradient descent [14]. In particular, a simple additive update for H that reduces the squared distance can be written as V ) a (W : 6) If a are all set equal to some small positive number, this is equivalent to conventional gradient descent. As long as this number is sufficiently small, the update ....

Kivinen, J & Warmuth, M (1997). Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation 132, 1--64.


Multiplicative Updates for Classification by Mixture Models - Saul, Lee (2002)   (Correct)

....It is worth comparing these multiplicative updates to others in the literature. Jebara and Pentland[6] derived similar updates for mixture weights, but without emphasizing the special form of eq. 13) Others have investigated multiplicative updates by the method of exponentiated gradients (EG)[7]. Our updates do not have the same form as EG updates: in particular, note that the gradients in eqs. 13 14) are not exponentiated. If we use one basis function per class and an identity matrix for the mixture weights, then the updates reduce to the method of generalized iterative scaling[2] for ....

J. Kivinen and M. Warmuth (1997). Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation 132: 1--64.


Additive Models, Boosting, and Inference for Generalized.. - Lafferty (1999)   (2 citations)  (Correct)

.... results of information geometry and the maximum entropy method generalize to Bregman divergences [9, 11] They have recently been used in the machine learning literature in the work of Warmuth and his colleagues, as a means of obtaining loss bounds for a broad class of on line learning algorithms [17, 19]. As we indicate in this paper, many of the bounds and techniques one can obtain for maximum likelihood estimation, based upon the Kullback Leibler divergence for exponential families, have analogues for general Bregman divergences. The use of statistical inference techniques based on the Bregman ....

....of very useful qualities, and this recent addition to the machine learning literature may have many further applications. As exploited by Warmuth et al. and Della Pietra et al. these similarity measures have convexity properties that allow bounds and auxiliary functions to be easily derived [19, 17, 14]. Their use can often be given an interpretation in terms of a generalized maximum entropy principle [11] and the projection operators that are defined for Bregman divergences can be useful for proving convergence of various learning algorithms and constrained optimization procedures. 13 0.1 ....

J. Kivinen and M.Warmuth, "Additive versus exponentiated gradient updates for linear prediction," Information and Computation, 132(1), pp. 1--64, 1997.


Barrier Boosting - Rätsch, Warmuth, Mika, Onoda, Müller (2000)   Self-citation (Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, 1997.


Boosting as Entropy Projection - Kivinen, Warmuth (1999)   (18 citations)  Self-citation (Kivinen Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, 1997.


Barrier Boosting - Rätsch, Warmuth, Mika, Onoda, Lemm..   Self-citation (Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, 1997.


Adaptive Caching by Refetching - Gramacy, Warmuth, Brandt, Ari (2002)   (2 citations)  Self-citation (Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Averaging Expert Predictions - Kivinen, Warmuth (1999)   (10 citations)  Self-citation (Kivinen Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Adaptive Caching by Refetching - Gramacy, Warmuth, Brandt, Ari (2002)   (2 citations)  Self-citation (Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Adaptive Caching by Refetching (View in Color) - Gramacy, Warmuth, Brandt, Ari   Self-citation (Warmuth)   (Correct)

....give a precise definition of adaptive when the data stream is continually changing. We use the term adaptive only informally and when we want to be precise we use off line comparators to judge the performance of our on line algorithms, as is commonly done in on line learning [LW94, CBFH 97, KW97] A good adaptive on line policy must do well compared to off line comparators. In this paper we use two off line comparators: BestFixed and BestShifting(## #) BestFixed is the a posteriori selected policy with the lowest miss rate on the entire request stream for our twelve policies. ....

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Adaptive Caching by Experts - Gramacy (2003)   Self-citation (Warmuth)   (Correct)

....Many caching policies claim to be adaptive , but seldom is adaptive clearly defined. Thus, we use the term adaptive only informally. When we want to be precise, we use off line comparators to judge the performance of our on line algorithms, as is commonly done in the online learning community [20, 10, 18]. In this paper, we use three off line comparators: BestFixed, BestShifting###, and BestRefetching###. These are described below. In addition to these off line comparators we also compare to LRU (our easiest comparator) A successfully adaptive policy should do at least as well as LRU. We also ....

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Adaptive Caching by Refetching - Gramacy, Warmuth, Brandt, Ari (2002)   (2 citations)  Self-citation (Warmuth)   (Correct)

....give a precise definition of adaptive when the data stream is continually changing. We use the term adaptive only informally and when we want to be precise we use off line comparators to judge the performance of our on line algorithms, as is commonly done in on line learning [LW94, CBFH 97, KW97] An on line algorithm is called adaptive if it performs well when measured up against off line comparators. In this paper we use two off line comparators: BestFixed and BestShifting(#) Best Fixed is the a posteriori selected policy with the lowest miss rate on the entire request stream for ....

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Information and Computation, 132(1):1--64, January 1997.


Linear Hinge Loss and Average Margin - Gentile, Warmuth (1998)   (10 citations)  Self-citation (Warmuth)   (Correct)

....appears in the exponents of factors that multiply the old weights. The factors are now used to correct the weights in the right direction when the algorithm under or overshot. The algorithms are good for different purposes and, generally speaking, incomparable (see [KWA97] for a discussion) In [KW97] a framework was introduced for deriving simple on line learning updates. This framework has been applied to a variety of different learning algorithms and differentiable loss functions [HKW95, KW98] The updates are always derived by approximately solving the following minimization problem w t 1 ....

....and it becomes the potential function in the amortized analysis used to prove loss bounds for the corresponding algorithm. The use of an amortized analysis in the context of learning essentially goes back to [Lit89] and the method for deriving updates based on the divergence was introduced in [KW97]. The divergence may be seen as a regularization term and may also serve as a barrier function in the optimization problem (1) for the purpose of keeping the weights in a particular region. The additive algorithms, such as gradient descent and the Perceptron algorithm, use d(w; w t ) jjw w t jj ....

[Article contains additional citation context not shown here]

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. Inform. and Comput., 132(1):1--64, 1997.


Tracking a Small Set of Experts by Mixing Past Posteriors - Bousquet, Warmuth (2002)   (2 citations)  Self-citation (Warmuth)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear prediction. In Journal of Information and Computation, 132(1):1-64, January 1997.


Generalization Error Bounds for Aggregation by Mirror.. - Anatoli Juditsky..   (Correct)

No context found.

Kivinen J. & Warmuth M.K. (1997) Additive versus exponentiated gradient updates for linear prediction. Information and Computation, Vol.132(1): 1--64.


Online Independent Component Analysis with Local.. - Schraudolph.. (2000)   (Correct)

No context found.

J. Kivinen and M. K. Warmuth, \Additive versus exponentiated gradient updates for linear prediction", in Proc. 27th Annual ACM Symposium on Theory of Computing, New York, NY, May 1995, pp. 209-218, The Association for Computing Machinery.


Online Oblivious Routing - Bansal, Blum, Chawla, Meyerson (2003)   (1 citation)  (Correct)

No context found.

M. Warmuth and J. Kivinen. Additiveversus exponentiated gradient updates for linear prediction. Journal of Information and Computation, 132(1):1--64, 1997.


Tutorial on Practical Prediction Theory for Classification - Langford (2005)   (1 citation)  (Correct)

No context found.

J. Kivinen and M. Warmuth, "Additive Versus Exponentiated Gradient Updates for Linear Prediction," in Journal of Information and Computation, vol. 132, no. 1, pp. 1-64, January 1997. http://www.cse.ucsc.edu/~manfred/pubs/lin.ps

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC