| J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--64, January 1997. |
....17] Gradient based methods are the simplest possible approach, but their convergence depends on careful selection of the learning rate, as well as constant attention to the nonnegativity constraints which may not be naturally enforced. Multiplicative updates based on exponentiated gradients (EG)[5, 10] have been investigated as an alternative to traditional gradient based methods. Multiplicative updates are naturally suited to sparse nonnegative optimizations, but EG updates like their additive counterparts suffer the drawback of having to choose a learning rate. Subset selection methods ....
....as in eq. 2) We will refer to the learning algorithm for hard margin SVMs based on these updates as Multiplicative Margin Maximization (M ) It is worth comparing the properties of these updates to those of other approaches. Like multiplicative updates based on exponentiated gradients (EG)[5, 10], the M updates are well suited to sparse nonnegative optimizations ; unlike EG updates, however, they do not involve a learning rate, and they come with a guarantee of monotonic improvement. Like the updates for Sequential Minimal Optimization (SMO) 15] the M updates have a simple ....
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--63, 1997.
....text categorisation. The first group, which includes David Lewis and colleagues, is using supervised learning algorithms for training neural networks, particularly WidrowHoff s error correction learning paradigm and its recent variant, the exponentiated gradient method, due to Kivinen and Warmuth [6]. Lewis [7] used the TREC data set, containing Associated Press (AP) newswire texts and a collection of medical abstracts, to compare Widrow Hoff and exponentiated gradient methods with a conventional information retrieval technique (the Rocchio learning algorithm) Lewis has shown that both the ....
Kivinen, J & Warmuth, MK. Exponentiated gradient versus gradient descent for linear predictors. Technical Report No. UCSC-CRL-94-16, 1994. Santa Cruz, Basking Center for Computer Engineering and Information Sciences.
.... theorems will unify results from classical statistics (inference in exponential families and generalized linear models) with those from computational learning theory (weighted majority, aggregating algorithm, exponentiated gradient) This regret bound framework has been studied before in [LW92, KW97, KW96, Vov90, CBFH 95] among others. Also, some of our results are similar to results from classical statistics such as the Cramer Rao variance bound [SO91] Our theorems are more general than each of these previous results in at least one of the following ways. First, they apply to more ....
....Logarithmic ln a exp w Normalized exponential expa i i expa i i w i ln w i Gamma 1 ffi(wj i w i = 1) Figure 4: Some examples of link functions. Some examples of GGD algorithms are ordinary gradient descent, the perceptron learning rule, and the Exponentiated Gradient algorithm of [KW97] We will examine some of these algorithms in more detail below. But first, we will prove regret bounds for a class of algorithms that includes GGD. 6 General regret bounds 6.1 Preliminaries In many commonMAP algorithms, each individual loss function can be written as a Bregman divergence. ....
[Article contains additional citation context not shown here]
Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--63, 1997. Preliminary version appeared as tech report UCSC-CRL-94-16; extended abstract appeared in 27th STOC.
....trials, the difference between its cumulative loss (i.e. the sum of the losses incurred in each trial) and the corresponding cumulative loss of a reference predictor , whose predictions are kept hidden from the master. Using this sequential prediction model, we will show (extending results from [3, 10, 12]) that a well known algorithm for linear regression, Gradient Descent, and a recently proposed variant, Exponentiated Gradient, have a reasonably good performance for a wide range of loss functions even when the regression problem is highly nonlinear and the data are generated with no statistical ....
....range of loss functions even when the regression problem is highly nonlinear and the data are generated with no statistical assumption. As a further motivation for the study of this prediction model, we point out the fact that any good sequential prediction algorithm can be efficiently transformed [2, 12, 15] into an algorithm that performs well in the more traditional statistical (or batch ) frameworks, like those studied in [5, 9] We use the sequential prediction model to analyze two types of on line regression problems. In the linear regression problem the master algorithm predicts, in each trial ....
[Article contains additional citation context not shown here]
J. Kivinen and M.K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--63, 1997. Proof of Lemma 2. 17
....from being easily generalizable beyond SAT. 3 SDF uses a different penalty for its primal search. Multiplicative versus additive updates: The SDF procedure updates multiplicatively rather than additively, in an analogy to the work on multiplicative updates in machine learning theory [Kivinen and Warmuth, 1997] . A multiplicative update is naturally interpreted as following an exponentiated version of the subgradient; that is, instead of using the traditional additive update L=1mV e one uses L=L given the vector of penalized violation values e . Below we compare additive and multiplicative ....
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Infor. Comput., 132:1--63, 1997.
....newly crawled pages. A positively classified page is viewed as a good page for the topic. In this sense this measure may be viewed as similar to precision where content based relevance is decided by the classifier. We use Widrow Ho# (WH) Exponentiated Gradient (EG) and Rocchio classifiers [23, 12, 11] with feature selection using Correlation Coe#cient [16] to select the best 50 features for each topic. The optimal threshold is set by maximizing the F1 score [22] on the training set. Due to limited space we refer the reader to [14, 25] for details on the classifiers. It may be observed that ....
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear p redictors. Technical Report Technical Report UCSC-CRL-94-16, Baking Center for Computer Engineering & Information Scien ces; University of California, Santa Cruz, CA, 1994.
....representations seems to provide the best performance in their experiments. On a related note, Lewis et al. [111] compare a number of learning methods for linear text classifiers including Rocchio s algorithm, the Widrow Hoff (WH) update rule [170] and the exponentiated gradient (EG) algorithm [89]. They find that both WH and EG yield consistently superior results to Rocchio. Furthermore, they point CHAPTER 4. RELATED WORK IN INFORMATION ACCESS 52 out that since EG seems to drive many of the linear discriminant coefficients to zero, effectively reducing the number of features used in making ....
Kivinen, J., and Warmuth, M. Exponentiated gradient versus gradient descent for linear predictors. Tech. Rep. UCSC-CRL-94-16, Basking Center for Computer Engineering and Information Sciences; University of California, Santa Cruz, 1994.
....optimal nature of the dual updates, the hinge penalty appears to retain an advantage over the linear penalty. Multiplicativeversus additive updates: The SDF procedure updates y multiplicatively rather than additively, in an analogy to the work on multiplicative updates in machine learning theory [Kivinen and Warmuth, 1997] . A multiplicative update is naturally interpreted as following an exponentiated version of the subgradient; that is, instead of using the traditional additive update y 0 = y (v) one uses y 0 = y (v) 1, given the vector of penalized violation values (v) Below we compare both ....
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Infor. Comput., 132:1--63, 1997.
....1. Plot of error vs. total number of features for an optimal classifier that knows which feature is relevant to a classification decision (solid) and Voting Gibbs algorithms using different priors (dash and dash dot) Details are provided in Section 5. setting (Ng, 1998; Littlestone, 1988; Kivinen and Warmuth, 1994). That this result is not merely of theoretical interest is demonstrated by the empirical results shown in Figure 1. These results, which are described in more detail in Section 5, show classification error rates in an experiment in which one feature is relevant to a classification decision. The ....
....selection, and since it is only logarithmic in f , the total number of features, it means that Bayesian feature selection using the particular prior described earlier is very insensitive to the presence of irrelevant features. This result also recovers the best known such rates (Littlestone, 1988; Kivinen Warmuth, 1994; Ng, 1998) and has sample complexity that beats that of the common wrapper model (Kohavi John, 1997) feature selection algorithm (see the analysis in Ng, 1998) Indeed, the logarithmic dependence suggests that we can, for instance, square the total number of features, and need only twice ....
Kivinen, J., & Warmuth, M. K. (1994). Exponentiated gradient versus gradient descent for linear predictors (Technical Report UCSC-CRL-94-16). Univ. of California Santa Cruz, Computer Research Laboratory.
....model. Lewis et al. 33] studied the Adaline (see section 2.5.3) as the classifier model for text categorization. Three different training methods were compared, namely the Rocchio algorithm [50] the Widrow Hoff algorithm or the delta rule [61] and the Kivinen and Warmuth s EG algorithm [24] which is an extension to the delta rule. Batch training was used for the Rocchio Algorithm to update the weights of the Adaline by taking into account the whole training set at once, while online training was used for the other two algorithms to perform weight update by running through the ....
J. Kivinen and M. K. Warmuth, "Exponentiated gradient versus gradient descent for linear predictors." Technical Report UCSC-CRL-94-16, Basking Center for Computer Engineering and Information Sciences, University of California, Santa Cruz, 1994.
No context found.
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--64, January 1997.
No context found.
J. Kivinen and M. K. Warmuth. Exponentiated gradientversus gradient descent for linear predictors. In ACM Symp. on the Theory of Computing, 1995.
No context found.
J. Kivinen and M. K. Warmuth, "Exponentiated gradient versus gradient descent for linear predictors," Information and Computation, vol. 132, no. 1, pp. 1-- 64, Jan. 1997.
No context found.
J. Kivinen and M. K. Warmuth, "Exponentiated gradient versus gradient descent for linear predictors," Inform. Comput., vol. 132, no. 1, pp. 1--64, Jan. 1997.
No context found.
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--63, 1997.
No context found.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Technical Report Technical Report UCSC-CRL-94-16, Baking Center for Computer Engineering & Information Sciences; University of California, Santa Cruz, CA, 1994.
No context found.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--63, 1997.
No context found.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--64, January 1997.
No context found.
J. Kivinen and M. Warmuth, Exponentiated Gradient Versus Gradient Descent for Linear Predictors, Journal of Information and Computation, vol. 132, no. 1, pp. 1-64, 1997
No context found.
J. Kivinen and M. K. Warmuth, \Exponentiated gradient versus gradient descent for linear predictors", Tech. Rep. UCSC-CRL-94-16, University of California, Santa Cruz, June 1994.
No context found.
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--63, 1997.
No context found.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Technical Report Technical Report UCSC-CRL-94-16, Baking Center for Computer Engineering & Information Sciences; University of California, Santa Cruz, CA, 1994.
No context found.
J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1-64, 1997.
No context found.
J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1--64, January 1997.
No context found.
J. Kivinen and M. Warmuth 1997, "Exponentiated gradient versus gradient descent for linear predictors, " Information and Computation 132, 1--64.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC