#### DMCA

## Kernel Methods for Deep Learning

### Cached

### Download Links

Citations: | 26 - 2 self |

### Citations

13219 | An Overview of Statistical Learning Theory - Vapnik - 1999 |

6495 | LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu. edu.tw/˜cjlin/libsvm - Chang, Lin |

3701 | Support-vector networks - Cortes, Vapnik - 1995 |

2379 | An Introduction to Support Vector Machines (and Other KernelBased Learning Methods - Cristianini, Shawe-Taylor - 2000 |

2308 | Independent Component Analysis - Hyvärinen, Karhunen, et al. - 2001 |

1863 | A training algorithm for optimal margin classi er - Boser, Guyon, et al. - 1992 |

1572 | Nonlinear Component Analysis as a Kernel Eigenvalue Problem
- SchÄolkopf, Smola, et al.
- 1998
(Show Context)
Citation Context ...(MKMs) perform very competitively on multiclass data sets designed to foil shallow architectures [10]. 53.1 Multilayer kernel machines We explored how to train MKMs in stages that involve kernel PCA =-=[12]-=- and feature selection [13] at intermediate hidden layers and large-margin nearest neighbor classification [14] at the final output layer. Specifically, for ℓ-layer MKMs, we considered the following t... |

1529 | Gradient-based learning applied to document recognition - LeCun, Bottou, et al. - 1998 |

1415 | LIBLINEAR: A library for large linear classification - Fan, Chang, et al. - 2008 |

1352 | An introduction to variable and feature selection
- Guyon, Elisseeff
(Show Context)
Citation Context ...tively on multiclass data sets designed to foil shallow architectures [10]. 53.1 Multilayer kernel machines We explored how to train MKMs in stages that involve kernel PCA [12] and feature selection =-=[13]-=- at intermediate hidden layers and large-margin nearest neighbor classification [14] at the final output layer. Specifically, for ℓ-layer MKMs, we considered the following training procedure: 1. Prune... |

1215 | Learning Representations by Back-propagating Errors - Rumelhart, Hinton, et al. - 1986 |

1081 | Learning with kernels: support vector machines, regularization, optimization, and beyond - Scholkopf, Smola - 2002 |

970 | A fast learning algorithm for deep belief nets
- Hinton, Osindero, et al.
(Show Context)
Citation Context ...al nets, over shallow architectures, such as support vector machines (SVMs) [1]. Deep architectures learn complex mappings by transforming their inputs through multiple layers of nonlinear processing =-=[2]-=-. Researchers have advanced several motivations for deep architectures: the wide range of functions that can be parameterized by composing weakly nonlinear transformations, the appeal of hierarchical ... |

805 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...(x) · f(y) = m∑ Θ(wi · x)Θ(wi · y)(wi · x) n (wi · y) n , (8) i=1 where m is the number of output units. The connection with the arc-cosine kernel function emerges in the limit of very large networks =-=[9, 7]-=-. Imagine that the network has an infinite number of output units, and that the weights Wij are Gaussian distributed with zero mean and unit variance. In this limit, we see that eq. (8) reduces to eq.... |

796 | Reducing the Dimensionality of Data with Neural Networks. - Hinton, Salakhutdinov - 2006 |

775 | Learning the kernel matrix with semidefinite programming
- Lanckriet, Cristianini, et al.
(Show Context)
Citation Context ...rently experimenting with arc-cosine kernel functions of fractional and (even negative) degree n. For MKMs, we are hoping to explore better schemes for feature selection [21, 22] and kernel selection =-=[23]-=-. Also, it would be desirable to incorporate prior knowledge, such as the invariances modeled by convolutional neural nets [24, 4], though it is not obvious how to do so. These issues and others are l... |

699 | Adaptive Control Processes: A Guided Tour - Bellman - 1961 |

695 | Distance metric learning for large margin nearest neighbor classification
- Weinberger, Saul
(Show Context)
Citation Context ...ltilayer kernel machines We explored how to train MKMs in stages that involve kernel PCA [12] and feature selection [13] at intermediate hidden layers and large-margin nearest neighbor classification =-=[14]-=- at the final output layer. Specifically, for ℓ-layer MKMs, we considered the following training procedure: 1. Prune uninformative features from the input space. 2. Repeat ℓ times: (a) Compute princip... |

607 | Extensions of Lipschitz mappings into a Hilbert space - Johnson, Lindenstrauss - 1984 |

561 | NewsWeeder: learning to filter netnews. In: - Lang - 1995 |

474 |
Backpropagation applied to handwritten zip code recognition
- LeCun, Boser
- 1989
(Show Context)
Citation Context ...plore better schemes for feature selection [21, 22] and kernel selection [23]. Also, it would be desirable to incorporate prior knowledge, such as the invariances modeled by convolutional neural nets =-=[24, 4]-=-, though it is not obvious how to do so. These issues and others are left for future work. A Derivation of kernel function In this appendix, we show how to evaluate the multidimensional integral in eq... |

464 | Kernel independent component analysis - Bach, Jordan - 2003 |

461 | Sequential minimal optimization: A fast algorithm for training support vector machines. - Platt - 1998 |

409 | A.: UCI machine learning repository - Frank, Asuncion - 2010 |

398 | Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control - Aizerman, Braverman, et al. - 1964 |

394 | Greedy layer-wise training of deep networks,”
- Bengio, Lamblin, et al.
- 2007
(Show Context)
Citation Context ...Mahalanobis distance metric for these outputs, though other methods are equally viable [16]. The use of LMNN is inspired by the supervised fine-tuning of weights in the training of deep architectures =-=[17]-=-. In MKMs, however, this supervised training only occurs at the final layer (which underscores the importance of feature selection in earlier layers). LMNN learns a distance metric by solving a proble... |

383 | Some results on Tchebycheffian spline functions - Kimeldorf, Wahba - 1971 |

346 | Neighbourhood components analysis.
- Goldberger, Roweis, et al.
- 2004
(Show Context)
Citation Context ...uts of the final layer. Specifically, we use large margin nearest neighbor (LMNN) classification [14] to learn a Mahalanobis distance metric for these outputs, though other methods are equally viable =-=[16]-=-. The use of LMNN is inspired by the supervised fine-tuning of weights in the training of deep architectures [17]. In MKMs, however, this supervised training only occurs at the final layer (which unde... |

340 | A unified architecture for natural language processing: Deep neural networks with multitask learning.
- Collobert, Weston
- 2008
(Show Context)
Citation Context ...rchical distributed representations, and the potential for combining unsupervised and supervised methods. Experiments have also shown the benefits of deep learning in several interesting applications =-=[3, 4, 5]-=-. Many issues surround the ongoing debate over deep versus shallow architectures [1, 6]. Deep architectures are generally more difficult to train than shallow ones. They involve difficult nonlinear op... |

336 |
Learning deep architectures for AI,” Foundations and Trends
- Bengio
- 2009
(Show Context)
Citation Context ...rvised methods. Experiments have also shown the benefits of deep learning in several interesting applications [3, 4, 5]. Many issues surround the ongoing debate over deep versus shallow architectures =-=[1, 6]-=-. Deep architectures are generally more difficult to train than shallow ones. They involve difficult nonlinear optimizations and many heuristics. The challenges of deep learning explain the early and ... |

258 | Random features for large-scale kernel machines. - Rahimi, Recht - 2007 |

252 | What is the best multistage architecture for object recognition - Jarrett, Kavukcuoglu, et al. - 2009 |

251 | Extracting and composing robust features with denoising autoencoders. - Vincent, Larochelle, et al. - 2008 |

204 | Feature selection, l1 vs. l2 regularization, and rotational invariance. - Ng - 2004 |

194 | Unsupervised learning of invariant feature hierarchies with applications to object recognition
- Ranzato, Huang, et al.
- 2007
(Show Context)
Citation Context ...rchical distributed representations, and the potential for combining unsupervised and supervised methods. Experiments have also shown the benefits of deep learning in several interesting applications =-=[3, 4, 5]-=-. Many issues surround the ongoing debate over deep versus shallow architectures [1, 6]. Deep architectures are generally more difficult to train than shallow ones. They involve difficult nonlinear op... |

186 | Training invariant support vector machines. - Decoste, Schoelkopf - 2002 |

183 | Learning Deep Architectures for AI. - Bengio - 2009 |

155 | Why does unsupervised pre-training help deep learning? - Erhan - 2010 |

154 | Learning a similarity metric discriminatively, with application to face verification.
- Chopra, Hadsell, et al.
- 2005
(Show Context)
Citation Context ... of LMNN is that the required optimization is convex. Test examples are classified by the energy-based decision rule for LMNN [14], which was itself inspired by earlier work on multilayer neural nets =-=[18]-=-. 3.2 Experiments on multiway classification We evaluated MKMs on the two multiclass data sets from previous benchmarks [10] that exhibited the largest performance gap between deep and shallow archite... |

124 | An empirical evaluation of deep architectures on problems with many factors of variation.
- Larochelle, Erhan, et al.
- 2007
(Show Context)
Citation Context ...n training (ℓ). examples The best asprevious a validation results set to arechoose 24.04% thefor margin penalty parameter; after SVMs with RBF kernels and 22.50% for deep thisbelief parameter nets by =-=[10]-=-. cross-validation, See text for details. we then retrained each SVM using all the training exam reference, we also report the best results obtained previously from three layer deep belief ne 3) and S... |

120 | Scaling learning algorithms towards ai
- Bengio, LeCun
(Show Context)
Citation Context ... work in machine learning has highlighted the circumstances that appear to favor deep architectures, such as multilayer neural nets, over shallow architectures, such as support vector machines (SVMs) =-=[1]-=-. Deep architectures learn complex mappings by transforming their inputs through multiple layers of nonlinear processing [2]. Researchers have advanced several motivations for deep architectures: the ... |

110 | Exploring large feature spaces with hierarchical multiple kernel learning. - Bach - 2008 |

92 |
The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/,
- LeCun, Cortes, et al.
- 1998
(Show Context)
Citation Context ...KMs on the two multiclass data sets from previous benchmarks [10] that exhibited the largest performance gap between deep and shallow architectures. The data sets were created from the MNIST data set =-=[19]-=- of 28 × 28 grayscale handwritten digits. The mnist-back-rand data set was generated by filling the image background by random pixel values, while the mnist-back-image data set was generated by fillin... |

89 | Improving support vector machine classifiers bymodifying kernel functions,”Neural Networks, - Amari, Wu - 1999 |

78 | Efficient agnostic learning of neural networks with bounded fan-in. - Lee, Bartlett, et al. - 1996 |

77 | Fast support vector machine training and classification on graphics processors,” - Catanzaro, Sundaram, et al. - 2008 |

74 | Geometry and Invariance in Kernel Based Methods. - Burges - 1999 |

74 | ICA using spacings estimates of entropy. - Learned-Miller, Fisher - 2003 |

72 | Result analysis of the nips 2003 feature selection challenge. - Guyon, Hur, et al. - 2004 |

68 |
Functions of a Complex Variable: Theory and Technique, Hod Books,
- Carrier, Krook, et al.
- 1983
(Show Context)
Citation Context ...n! (sin θ) 2n+1 cos dψ 0 n ψ . (16) (1 − cos θ cos ψ) n+1 To evaluate eq. (16), we first consider the special case n=0. The following result can be derived by contour integration in the complex plane =-=[25]-=-: ∫ π/2 dψ π − θ = . (17) 0 1 − cos θ cos ψ sin θ Substituting eq. (17) into our expression for the angular part of the kernel function in eq. (16), we recover our earlier claim that J0(θ) = π −θ. Rel... |

67 | Deep learning via semi-supervised embedding.
- Weston, Ratle, et al.
- 2008
(Show Context)
Citation Context ...deep architectures yet drawn to the elegance of kernel methods. In this paper, we explore the possibility of deep learning in kernel machines. Though we share a similar motivation as previous authors =-=[20]-=-, our approach is very different. Our paper makes two main contributions. First, we develop a new family of kernel functions that mimic the computation in large neural nets. Second, using these kernel... |

65 | Convex neural networks. - Bengio, Roux, et al. - 2005 |

57 | Iterative kernel principal component analysis for image modeling. - Kim, Franz, et al. - 2005 |

55 | A useful theorem for nonl inear devices having Gaussian inputs,” - Price - 1958 |

47 | Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. - Rahimi, Recht - 2009 |

42 | On random weights and unsupervised feature learning - Saxe, Koh, et al. |

41 | Computation with Infinite Neural Networks,
- Williams
- 1998
(Show Context)
Citation Context ...) n The integral representation makes it straightforward to show that these kernel functions are positivesemidefinite. The kernel function in eq. (1) has interesting connections to neural computation =-=[7]-=- that we explore further in sections 2.2–2.3. However, we begin by elucidating its basic properties. 2.1 Basic properties We show how to evaluate the integral in eq. (1) analytically in the appendix. ... |

41 | Sparse Kernel Principal Component Analysis.
- Tipping
- 2000
(Show Context)
Citation Context ...ure work. For SVMs, we are currently experimenting with arc-cosine kernel functions of fractional and (even negative) degree n. For MKMs, we are hoping to explore better schemes for feature selection =-=[21, 22]-=- and kernel selection [23]. Also, it would be desirable to incorporate prior knowledge, such as the invariances modeled by convolutional neural nets [24, 4], though it is not obvious how to do so. The... |

35 | Two-stage learning kernel algorithms - Cortes, Mohri, et al. - 2010 |

35 | The statistical mechanics of learning a rule - Watkin, Rau, et al. - 1993 |

32 | Advances in Blind Source Separation (BSS) and Independent Component Analysis (ICA) for Nonlinear Mixtures, - Jutten, Karhunen - 2004 |

31 | Sparse kernel feature analysis,"
- Smola, Scholkopf
- 1999
(Show Context)
Citation Context ...ure work. For SVMs, we are currently experimenting with arc-cosine kernel functions of fractional and (even negative) degree n. For MKMs, we are hoping to explore better schemes for feature selection =-=[21, 22]-=- and kernel selection [23]. Also, it would be desirable to incorporate prior knowledge, such as the invariances modeled by convolutional neural nets [24, 4], though it is not obvious how to do so. The... |

27 | Nonlinear image representation using divisive normalization - Lyu, Simoncelli - 2008 |

26 | Permitted and forbidden sets in symmetric threshold-linear networks.
- RH, Seung, et al.
- 2003
(Show Context)
Citation Context ...ation functions For n = 0, the activation function is a step function, and the network is an array of perceptrons. For n = 1, the activation function is a ramp function (or rectification nonlinearity =-=[8]-=-), and the mapping f(x) is piecewise linear. More generally, the nonlinear (non-polynomial) behavior of these networks is induced by thresholding on weighted sums. We refer to networks with these acti... |

26 | Sampling methods for the nystrom method. - Kumar, Mohri, et al. - 2012 |

9 | Symplectic nonlinear component analysis - Parra - 1996 |

5 | Large margin classification in infinite neural networks - Hermans, Cho, et al. - 2010 |

1 | Analysis and extension of arc-cosine kernels for large margin classification - Cho, Saul - 2012 |

1 | Nonlinear mixtures - Jutten, Babaie-Zadeh, et al. - 2010 |

1 | Unsupervised learning of invariant feature hierarchies with applications to object recognition - pleMKL - 2007 |