Results 1  10
of
24
Slow, decorrelated features for pretraining complex celllike networks
 Advances in Neural Information Processing Systems 22
, 2009
"... Abstract We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A singlehiddenlayer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, d ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Abstract We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A singlehiddenlayer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientationselective features, similar to the receptive fields of complex cells. With this pretraining, the same singlehiddenlayer model achieves 1.34% error, even though the pretraining sample distribution is very different from the finetuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features.
Scalable Kernel Methods via Doubly Stochastic Gradients
, 2014
"... The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for largescale nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called “d ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for largescale nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called “doubly stochastic functional gradients”. Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random features associated with the kernel, and then descending using this noisy functional gradient. Our algorithm is simple, does not need to commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We show that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization performance of O(1/ t). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features. 1
Online Incremental Feature Learning with Denoising Autoencoders
"... While determining model complexity is an important problem in machine learning, many feature learning algorithms rely on crossvalidation to choose an optimal number of features, which is usually challenging for online learning from a massive stream of data. In this paper, we propose an incremental ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
While determining model complexity is an important problem in machine learning, many feature learning algorithms rely on crossvalidation to choose an optimal number of features, which is usually challenging for online learning from a massive stream of data. In this paper, we propose an incremental feature learning algorithm to determine the optimal model complexity for largescale, online datasets based on the denoising autoencoder. This algorithm is composed of two processes: adding features and merging features. Specifically, it adds new features to minimize the objective function’s residual and merges similar features to obtain a compact feature representation and prevent overfitting. Our experiments show that the proposed model quickly converges to the optimal number of features in a largescale online setting. In classification tasks, our model outperforms the (nonincremental) denoising autoencoder, and deep networks constructed from our algorithm perform favorably compared to deep belief networks and stacked denoising autoencoders. Further, the algorithm is effective in recognizing new patterns when the data distribution changes over time in the massive online data stream. 1
Normalized online learning
"... We introduce online learning algorithms which are independent of feature scales, proving regret bounds dependent on the ratio of scales existent in the data rather than the absolute scale. This has several useful effects: there is no need to prenormalize data, the testtime and testspace complexity ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We introduce online learning algorithms which are independent of feature scales, proving regret bounds dependent on the ratio of scales existent in the data rather than the absolute scale. This has several useful effects: there is no need to prenormalize data, the testtime and testspace complexity are reduced, and the algorithms are more robust. 1
Transformation pursuit for image classification
 In IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2014
"... A simple approach to learning invariances in image classification consists in augmenting the training set with transformed versions of the original images. However, given a large set of possible transformations, selecting a compact subset is challenging. Indeed, all transformations are not equall ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
A simple approach to learning invariances in image classification consists in augmenting the training set with transformed versions of the original images. However, given a large set of possible transformations, selecting a compact subset is challenging. Indeed, all transformations are not equally informative and adding uninformative transformations increases training time with no gain in accuracy. We propose a principled algorithm – Image Transformation Pursuit (ITP) – for the automatic selection of a compact set of transformations. ITP works in a greedy fashion, by selecting at each iteration the one that yields the highest accuracy gain. ITP also allows to efficiently explore complex transformations, that combine basic transformations. We report results on two public benchmarks: the CUB dataset of bird images and the ImageNet 2010 challenge. Using Fisher Vector representations, we achieve an improvement from 28.2 % to 45.2 % in top1 accuracy on CUB, and an improvement from 70.1 % to 74.9 % in top5 accuracy on ImageNet. We also show significant improvements for deep convnet features: from 47.3 % to 55.4 % on CUB and from 77.9 % to 81.4 % on ImageNet. 1.
Online Training on a Budget of Support Vector Machines Using Twin Prototypes
, 2010
"... Abstract: This paper proposes twin prototype support vector machine (TVM), a constant space and sublinear time support vector machine (SVM) algorithm for online learning. TVM achieves its favorable scaling by memorizing only a fixedsize data summary in the form of example prototypes and their assoc ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract: This paper proposes twin prototype support vector machine (TVM), a constant space and sublinear time support vector machine (SVM) algorithm for online learning. TVM achieves its favorable scaling by memorizing only a fixedsize data summary in the form of example prototypes and their associated information during training. In addition, TVM guarantees that the optimal SVM solution is maintained on all prototypes at any time. To maximize the accuracy of TVM, prototypes are constructed to approximate the data distribution near the decision boundary. Given a new training example, TVM is updated in three steps. First, the new example is added as a new prototype if it is near the decision boundary. If this happens, to maintain the budget, either the prototype farthest away from the decision boundary is removed or two near prototypes are selected and merged into a single one. Finally, TVM is updated by incremental and decremental techniques to account for the change. Several methods for prototype merging were proposed and experimentally evaluated. TVM algorithms with hinge loss and ramp loss were implemented and thoroughly tested on 12 large datasets. In most cases, the accuracy of lowbudget TVMs was comparable with the resourceunconstrained SVMs. Additionally, the TVM accuracy was substantially larger than that of SVM trained on a random sample of the same size. Even larger difference in accuracy was observed when comparing with Forgetron, a popular budgeted kernel perceptron algorithm. As expected, the difference in accuracy between hinge loss and ramp loss TVM was negligible and hinge loss version is preferable due to its lower computational cost. The results illustrate that highly accurate online SVMs could be trained from arbitrary large data streams using devices with severely limited memory budgets. © 2010
Parallel support vector machines in practice. arXiv preprint arXiv:1404.1066
, 2014
"... In this paper, we evaluate the performance of various parallel optimization methods for Kernel Support Vector Machines on multicore CPUs and GPUs. In particular, we provide the first comparison of algorithms with explicit and implicit parallelization. Most existing parallel implementations for mul ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we evaluate the performance of various parallel optimization methods for Kernel Support Vector Machines on multicore CPUs and GPUs. In particular, we provide the first comparison of algorithms with explicit and implicit parallelization. Most existing parallel implementations for multicore or GPU architectures are based on explicit parallelization of Sequential Minimal Optimization (SMO)—the programmers identified parallelizable components and handparallelized them, specifically tuned for a particular architecture. We compare these approaches with each other and with implicitly parallelized algorithms— where the algorithm is expressed such that most of the work is done within few iterations with large dense linear algebra operations. These can be computed with highlyoptimized libraries, that are carefully parallelized for a large variety of parallel platforms. We highlight the advantages and disadvantages of both approaches and compare them on various benchmark data sets.We find an approximate implicitly parallel algorithm which is surprisingly efficient, permits a much simpler implementation, and leads to unprecedented speedups in SVM training. 1
Invariant TimeSeries Classification
"... Abstract. Timeseries classification is a field of machine learning that has attracted considerable focus during the recent decades. The large number of timeseries application areas ranges from medical diagnosis up to financial econometrics. Support Vector Machines (SVMs) are reported to perform no ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Timeseries classification is a field of machine learning that has attracted considerable focus during the recent decades. The large number of timeseries application areas ranges from medical diagnosis up to financial econometrics. Support Vector Machines (SVMs) are reported to perform nonoptimally in the domain of time series, because they suffer detecting similarities in the lack of abundant training instances. In this study we present a novel timeseries transformation method which significantly improves the performance of SVMs. Our novel transformation method is used to enlarge the training set through creating new transformed instances from the support vector instances. The new transformed instances encapsulate the necessary intraclass variations required to redefine the maximum margin decision boundary. The proposed transformation method utilizes the variance distributions from the intraclass warping maps to build transformation fields, which are applied to series instances using the Moving Least Squares algorithm. Extensive experimentations on 35 time series datasets demonstrate the superiority of the proposed method compared to both the Dynamic Time Warping version of the Nearest Neighbor and the SVMs classifiers, outperforming them in the majority of the experiments.
Locally Linear Landmarks for LargeScale Manifold Learning
"... Abstract. Spectral methods for manifold learning and clustering typically construct a graph weighted with affinities from a dataset and compute eigenvectors of a graph Laplacian. With large datasets, the eigendecomposition is too expensive, and is usually approximated by solving for a smaller graph ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Spectral methods for manifold learning and clustering typically construct a graph weighted with affinities from a dataset and compute eigenvectors of a graph Laplacian. With large datasets, the eigendecomposition is too expensive, and is usually approximated by solving for a smaller graph defined on a subset of the points (landmarks) and then applying the Nyström formula to estimate the eigenvectors over all points. This has the problem that the affinities between landmarks do not benefit from the remaining points and may poorly represent the data if using few landmarks. We introduce a modified spectral problem that uses all data points by constraining the latent projection of each point to be a local linear function of the landmarks ’ latent projections. This constructs a new affinity matrix between landmarks that preserves manifold structure even with few landmarks, allows one to reduce the eigenproblem size, and defines a fast, nonlinear outofsample mapping.
Automatically enhancing locality for tree traversals with traversal splicing
 In Proceedings of the 2012 ACM international
, 2012
"... Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointerbased data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like BarnesHut and nearest neighbor, previous wo ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointerbased data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like BarnesHut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver singlethread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over pointblocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully handoptimized implementations.