Results 1  10
of
17
KERNEL METHODS MATCH DEEP NEURAL NETWORKS ON TIMIT
"... Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for largescale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps h ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for largescale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps have emerged as an elegant mechanism to scaleup kernel methods. Still, in practice, a large number of random features is required to obtain acceptable accuracy in prdictive tasks. In this paper, we develop two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression. The first scheme is a specialized distributed block coordinate descent procedure that avoids the explicit materialization of the feature space data matrix, while the second scheme gains efficiency by combining multiple weak random feature models in an ensemble learning framework. We demonstrate that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks. In particular, we obtain the best classification error rates reported on TIMIT using kernel methods.
Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
"... We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2p). Our algorithm Subsampled Randomized Hadamard Tran ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2p). Our algorithm Subsampled Randomized Hadamard Transform Dual Ridge Regression (SRHTDRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized WalshHadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHTDRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
Memory Efficient Kernel Approximation
, 2014
"... The scalability of kernel machines is a big challenge when facing millions of samples due to storage and computation issues for large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a lowrank approximation of the kernel matrix. In this pa ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The scalability of kernel machines is a big challenge when facing millions of samples due to storage and computation issues for large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a lowrank approximation of the kernel matrix. In this paper, we first make the observation that the structure of shiftinvariant kernels changes from lowrank to blockdiagonal (without any lowrank structure) when varying the scale parameter. Based on this observation, we propose a new kernel approximation algorithm – Memory Efficient Kernel Approximation (MEKA), which considers both lowrank and clustering structure of the kernel matrix. We show that the resulting algorithm outperforms stateoftheart lowrank kernel approximation methods in terms of speed, approximation error, and memory usage. As an example, on the mnist2m dataset with twomillion samples, our method takes 550 seconds on a single machine using less than 500 MBytes memory to achieve 0.2313 test RMSE for kernel ridge regression, while standard Nyström approximation takes more than 2700 seconds and uses more than 2 GBytes memory on the same problem to achieve 0.2318 test RMSE.
A DivideandConquer Solver for Kernel Support Vector Machines
"... The kernel support vector machine (SVM) is one of the most widely used classification methods; however, the amount of computation required becomes the bottleneck when facing millions of samples. In this paper, we propose and analyze a novel divideandconquer solver for kernel SVMs (DCSVM). In t ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
The kernel support vector machine (SVM) is one of the most widely used classification methods; however, the amount of computation required becomes the bottleneck when facing millions of samples. In this paper, we propose and analyze a novel divideandconquer solver for kernel SVMs (DCSVM). In the division step, we partition the kernel SVM problem into smaller subproblems by clustering the data, so that each subproblem can be solved independently and efficiently. We show theoretically that the support vectors identified by the subproblem solution are likely to be support vectors of the entire kernel SVM problem, provided that the problem
The randomized dependence coefficient
 in Advances in Neural Information Processing Systems
, 2013
"... We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the HirschfeldGebeleinRényi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random nonlinear copula projections; it is ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the HirschfeldGebeleinRényi Maximum Correlation Coefficient. RDC is defined in terms of correlation of random nonlinear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. 1
Subspace embeddings for the polynomial kernel
 In NIPS
, 2014
"... Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the socalled oblivious subspace embedding, can only be applied to data spaces with an explicit repres ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Sketching is a powerful dimensionality reduction tool for accelerating statistical learning algorithms. However, its applicability has been limited to a certain extent since the crucial ingredient, the socalled oblivious subspace embedding, can only be applied to data spaces with an explicit representation as the column span or row span of a matrix, while in many settings learning is done in a highdimensional space implicitly defined by the data matrix via a kernel transformation. We propose the first fast oblivious subspace embeddings that are able to embed a space induced by a nonlinear kernel without explicitly mapping the data to the highdimensional space. In particular, we propose an embedding for mappings induced by the polynomial kernel. Using the subspace embeddings, we obtain the fastest known algorithms for computing an implicit low rank approximation of the higherdimension mapping of the data matrix, and for computing an approximate kernel PCA of the data, as well as doing approximate kernel principal component regression. 1
Improving Multistep Prediction of Learned Time Series Models
"... Most typical statistical and machine learning approaches to time series modeling optimize a singlestep prediction error. In multiplestep simulation, the learned model is iteratively applied, feeding through the previous output as its new input. Any such predictor however, inevitably introduces e ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Most typical statistical and machine learning approaches to time series modeling optimize a singlestep prediction error. In multiplestep simulation, the learned model is iteratively applied, feeding through the previous output as its new input. Any such predictor however, inevitably introduces errors, and these compounding errors change the input distribution for future prediction steps, breaking the traintest i.i.d assumption common in supervised learning. We present an approach that reuses training data to make a noregret learner robust to errors made during multistep prediction. Our insight is to formulate the problem as imitation learning; the training data serves as a “demonstrator ” by providing corrections for the errors made during multistep prediction. By this reduction of multistep time series prediction to imitation learning, we establish theoretically a strong performance guarantee on the relation between training error and the multistep prediction error. We present experimental results of our method, DAD, and show significant improvement over the traditional approach in two notably different domains, dynamic system modeling and video texture prediction. Determining models for time series data is important in applications ranging from market prediction to the simulation of chemical processes and robotic systems. Many supervised learning approaches have been proposed for this task,
How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets
"... † and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
† and ‡: shared first and second coauthorships, respectively ¶: to whom questions and comments should be sent
JustInTime Kernel Regression for Expectation Propagation
"... We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in class ..."
Abstract
 Add to MetaCart
We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernelbased regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel twolayer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator. 1.
DOI 10.1007/s1126301306833 MaxMargin Early Event Detectors
"... Abstract The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from humanrobot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from humanrobot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper proposes a maximummargin framework for training temporal event detectors to recognize partial events, enabling early detection. Our method is based on Structured Output SVM, but extends it to accommodate sequential data. Experiments on datasets of varying complexity, for detecting facial expressions, hand gestures, and human activities, demonstrate the benefits of our approach.