• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A scaled conjugate gradient algorithm for fast supervised learning (1993)

by M Moller
Venue:Neural Networks
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 451
Next 10 →

Evolving Artificial Neural Networks

by Xin Yao , 1999
"... This paper: 1) reviews different combinations between ANN's and evolutionary algorithms (EA's), including using EA's to evolve ANN connection weights, architectures, learning rules, and input features; 2) discusses different search operators which have been used in various EA's; ..."
Abstract - Cited by 574 (6 self) - Add to MetaCart
This paper: 1) reviews different combinations between ANN's and evolutionary algorithms (EA's), including using EA's to evolve ANN connection weights, architectures, learning rules, and input features; 2) discusses different search operators which have been used in various EA's; and 3) points out possible future research directions. It is shown, through a considerably large literature review, that combinations between ANN's and EA's can lead to significantly better intelligent systems than relying on ANN's or EA's alone

Probabilistic non-linear principal component analysis with Gaussian process latent variable models

by Neil Lawrence, Aapo Hyvärinen - Journal of Machine Learning Research , 2005
"... Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component ..."
Abstract - Cited by 229 (24 self) - Add to MetaCart
Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GP-LVM). Through analysis of the GP-LVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GP-LVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of real-world and artificially generated data sets.
(Show Context)

Citation Context

...g-likelihood is a highly non-linear function of the embeddings and the parameters. We are therefore forced to turn to gradient based optimisation of the objective function. Scaled conjugate gradient (=-=Møller, 1993-=-) is an approach to optimisation which implicitly considers second order information while using a scale parameter to regulate the positive definitiveness of the Hessian at each point. We made use of ...

First and Second-Order Methods for Learning: between Steepest Descent and Newton's Method

by Roberto Battiti - Neural Computation , 1992
"... On-line first order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neura ..."
Abstract - Cited by 177 (7 self) - Add to MetaCart
On-line first order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.

An Analysis of Particle Swarm Optimizers

by Frans Van Den Bergh , 2001
"... ..."
Abstract - Cited by 170 (2 self) - Add to MetaCart
Abstract not found

Neural networks for the prediction and forecasting of water . . .

by Holger R. Maier, Graeme C. Dandy , 2000
"... ..."
Abstract - Cited by 134 (7 self) - Add to MetaCart
Abstract not found

Fast Exact Multiplication by the Hessian

by Barak A. Pearlmutter - Neural Computation , 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract - Cited by 93 (5 self) - Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)|_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
(Show Context)

Citation Context

...s pass of the��E algorithm becoming a forward pass in the £¢algorithm, while here the direction of the equations is unchanged. The same algorithm was also discovered, with yet another derivation, by (=-=Møller, 1993-=-a). For convenience, we will now change our notation for indexing the weights© . Let© be the weights, now doubly indexed by their source and destination units’ indices, as in wij, the weight from unit...

Combinatorial codes in ventral temporal lobe for object recognition:

by Stephen José Hanson , Toshihiko Matsuka , James V Haxby - Neuroimage, , 2001
"... Haxby et al. [Science 293 (2001) 2425] recently argued that categoryrelated responses in the ventral temporal (VT) lobe during visual object identification were overlapping and distributed in topography. This observation contrasts with prevailing views that object codes are focal and localized to ..."
Abstract - Cited by 93 (11 self) - Add to MetaCart
Haxby et al. [Science 293 (2001) 2425] recently argued that categoryrelated responses in the ventral temporal (VT) lobe during visual object identification were overlapping and distributed in topography. This observation contrasts with prevailing views that object codes are focal and localized to specific areas such as the fusiform and parahippocampal gyri. We provide a critical test of Haxby's hypothesis using a neural network (NN) classifier that can detect more general topographic representations and achieves 83% correct generalization performance on patterns of voxel responses in out-of-sample tests. Using voxel-wise sensitivity analysis we show that substantially the same VT lobe voxels contribute to the classification of all object categories, suggesting the code is combinatorial. Moreover, we found no evidence for local single category representations. The neural network representations of the voxel codes were sensitive to both category and superordinate level features that were only available implicitly in the object categories. D 2004 Elsevier Inc. All rights reserved.
(Show Context)

Citation Context

... nodes, where we used the softmax function (also known as a smooth version of winner-take-all activation function) for obtaining their activations: Ok expðaOk ÞX m expðaOmÞ where aOk X j hjwjk . This softmax function normalizes outputs (i.e., output lies between 0 and 1 and sum up to unity). The error function for our NN classifier was the cross entropy function or: E XN n1 XK k1 t nk ln Onk tnk Scaled conjugate gradient The scaled conjugate gradient (SCG) method is a variant of a conjugate gradient method that uses Levenberg–Marquardt approach for finding appropriate step size (Moller, 1993). Instead of using computation-intensive line search procedure, SCG uses approximated Hessian matrix (multiplied by the direction vector) to scale the step size aj. To find the appropriate step size, only Hessian matrix multiplied by a conjugate direction vector d is need. This Hessian matrix product can be approximately computed rather efficiently for the multilayer perceptrons by using central differences (Bishop, 1995). However, to maintain definiteness of the Hessian, a scalar k is included in the computation: sj f Vðwj þ ejdjÞ f VðwjÞ ej þ kJdj where e is a small number. The step size...

Making Use of Population Information in Evolutionary Artificial Neural Networks

by Xin Yao, Xin Yao (smieee, Yong Liu , 1998
"... This paper is concerned with the simultaneous evolution of artificial neural network (ANN) architectures and weights. The current practice in evolving ANNs is to choose the best ANN in the last generation as the final result. This paper proposes a different approach to form the final result by combi ..."
Abstract - Cited by 87 (25 self) - Add to MetaCart
This paper is concerned with the simultaneous evolution of artificial neural network (ANN) architectures and weights. The current practice in evolving ANNs is to choose the best ANN in the last generation as the final result. This paper proposes a different approach to form the final result by combining all the individuals in the last generation in order to make best use of all the information contained in the whole population. This approach regards a population of ANNs as an ensemble and uses a combination method to integrate them. Although there has been some work on integrating ANN modules [2], [3], little has been done in evolutionary learning to make best use of its population information. Four linear combination methods have been investigated in this paper to illustrate our ideas. Three real world data sets have been used in our experimental studies, which show that the recursive least square (RLS) algorithm always produces an integrated system that outperforms the best individua...

Efficient weight learning for Markov logic networks

by Daniel Lowd, Pedro Domingos - In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases , 2007
"... Abstract. Markov logic networks (MLNs) combine Markov networks and first-order logic, and are a powerful and increasingly popular representation for statistical relational learning. The state-of-the-art method for discriminative learning of MLN weights is the voted perceptron algorithm, which is ess ..."
Abstract - Cited by 87 (7 self) - Add to MetaCart
Abstract. Markov logic networks (MLNs) combine Markov networks and first-order logic, and are a powerful and increasingly popular representation for statistical relational learning. The state-of-the-art method for discriminative learning of MLN weights is the voted perceptron algorithm, which is essentially gradient descent with an MPE approximation to the expected sufficient statistics (true clause counts). Unfortunately, these can vary widely between clauses, causing the learning problem to be highly ill-conditioned, and making gradient descent very slow. In this paper, we explore several alternatives, from per-weight learning rates to second-order methods. In particular, we focus on two approaches that avoid computing the partition function: diagonal Newton and scaled conjugate gradient. In experiments on standard SRL datasets, we obtain order-of-magnitude speedups, or more accurate models given comparable learning times. 1
(Show Context)

Citation Context

...ches, which involve computing the function as well as its gradient. These include most conjugate gradient and quasi-Newton methods (e.g., L-BFGS). Two exceptions to this are scaled conjugate gradient =-=[12]-=- and Newton’s method with a diagonalized Hessian [1]. In this paper we show how they can be applied to MLN learning, and verify empirically that they greatly speed up convergence. We also obtain good ...

SNNS: Stuttgart Neural Network Simulator - Manual Extensions of Version 4.0

by Andreas Zell, Gunter Mamier, Michael Vogt, Niels Mache, Ralf Hubner, Sven Doring, Kai-uwe Herrmann, Tobias Soyez, Michael Schmalzl, Tilman Sommer, Artemis Hatzigeorgiou, Dietmar Posselt, Tobias Schreiner, Bernward Kett, Gianfranco Clemente, Martin Reczko, Martin Riedmiller, Mark Seemann, Marcus Ritt, Jamie Decoster, Jochen Biedermann, Joachim Danz, Christian Wehrfritz, Olf Werner, Michael Berthold
"... This document is prepared for all those users that already have an older manual and wish to receive only an update for the new functionality. We do like to point out, however, that this document can only contain those parts that are entirely new. All those little improvements, cross references, and ..."
Abstract - Cited by 68 (0 self) - Add to MetaCart
This document is prepared for all those users that already have an older manual and wish to receive only an update for the new functionality. We do like to point out, however, that this document can only contain those parts that are entirely new. All those little improvements, cross references, and additions in chapters that are mainly unchanged can not be included here. So we encourage the users either to use the complete manual online or at least to print out every other release of the manual in order to become aware of these changes. Beginners anyways should start with the full documentation and ignore these excerpts. 1.1 New Features of Release 4.0
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University