Results 1  10
of
26
Revisiting the Nyström method for improved largescale machine learning
"... We reconsider randomized algorithms for the lowrank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and pro ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
(Show Context)
We reconsider randomized algorithms for the lowrank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods, and they point to differences between uniform and nonuniform sampling methods based on leverage scores. We complement our empirical results with a suite of worstcase theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds—e.g., improved additiveerror bounds for spectral and Frobenius norm error and relativeerror bounds for trace norm error. 1.
Kernel Methods for Deep Learning
"... We introduce a new family of positivedefinite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernelbased architectures that we call multilayer kernel machi ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
We introduce a new family of positivedefinite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernelbased architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive Sampling
"... The CUR matrix decomposition and the Nyström approximation are two important lowrank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
The CUR matrix decomposition and the Nyström approximation are two important lowrank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nyström approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nyström algorithms with expected relativeerror bounds. The proposed CUR and Nyström algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nyström method and the ensemble Nyström method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.
Sharp analysis of lowrank kernel matrix approximations
 JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL 30 (2013) 1–25
, 2013
"... We consider supervised learning problems within the positivedefinite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinitedimensional feature spaces, a common practical limiting difficulty is the necessity of c ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We consider supervised learning problems within the positivedefinite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinitedimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n 2). Lowrank approximations of the kernel matrix are often considered as they allow the reduction of running time complexities to O(p 2 n), where p is the rank of the approximation. The practicality of such methods thus depends on the required rank p. In this paper, we show that in the context of kernel ridge regression, for approximations based on a random subset of columns of the original kernel matrix, the rank p may be chosen to be linear in the degrees of freedom associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of nonparametric estimators. This result enables simple algorithms that have subquadratic running time complexity, but provably exhibit the same predictive performance than existing algorithms, for any given problem instance, and not only for worstcase situations.
Global image denoising
 IEEE Trans. on Image Proc
, 2014
"... Abstract — Most existing stateoftheart image denoising algorithms are based on exploiting similarity between a relatively modest number of patches. These patchbased methods are strictly dependent on patch matching, and their performance is hamstrung by the ability to reliably find sufficiently ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Most existing stateoftheart image denoising algorithms are based on exploiting similarity between a relatively modest number of patches. These patchbased methods are strictly dependent on patch matching, and their performance is hamstrung by the ability to reliably find sufficiently similar patches. As the number of patches grows, a point of diminishing returns is reached where the performance improvement due to more patches is offset by the lower likelihood of finding sufficiently close matches. The net effect is that while patchbased methods, such as BM3D, are excellent overall, they are ultimately limited in how well they can do on (larger) images with increasing complexity. In this paper, we address these shortcomings by developing a paradigm for truly global filtering where each pixel is estimated from all pixels in the image. Our objectives in this paper are twofold. First, we give a statistical analysis of our proposed global filter, based on a spectral decomposition of its corresponding operator, and we study the effect of truncation of this spectral decomposition. Second, we derive an approximation to the spectral (principal) components using the Nyström extension. Using these, we demonstrate that this global filter can be implemented efficiently by sampling a fairly small percentage of the pixels in the image. Experiments illustrate that our strategy can effectively globalize any existing denoising filters to estimate each pixel using all pixels in the image, hence improving upon the best patchbased methods. Index Terms — Image denoising, nonlocal filters, Nyström extension, spatial domain filter, risk estimator.
Efficient Algorithms and Error Analysis for the Modified Nyström Method
"... Many kernel methods suffer from high time and space complexities and are thus prohibitive in bigdata applications. To tackle the computational challenge, the Nyström method has been extensively used to reduce time and space complexities by sacrificing some accuracy. The Nyström method speedups ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Many kernel methods suffer from high time and space complexities and are thus prohibitive in bigdata applications. To tackle the computational challenge, the Nyström method has been extensively used to reduce time and space complexities by sacrificing some accuracy. The Nyström method speedups computation by constructing an approximation of the kernel matrix using only a few columns of the matrix. Recently, a variant of the Nyström method called the modified Nyström method has demonstrated significant improvement over the standard Nyström method in approximation accuracy, both theoretically and empirically. In this paper, we propose two algorithms that make the modified Nyström method practical. First, we devise a simple column selection algorithm with a provable error bound. Our algorithm is more efficient and easier to implement than and nearly as accurate as the stateoftheart algorithm. Second, with the selected columns at hand, we propose an algorithm that computes the approximation in lower time complexity than the approach in the previous work. Furthermore, we prove that the modified Nyström method is exact under certain conditions, and we establish a lower error bound for the modified Nyström method. 1
On Compact Codes for Spatially Pooled Features
"... Feature encoding with an overcomplete dictionary has demonstrated good performance in many applications, especially computer vision. In this paper we analyze the classification accuracy with respect to dictionary size by linking the encoding stage to kernel methods and Nyström sampling, and obtain u ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Feature encoding with an overcomplete dictionary has demonstrated good performance in many applications, especially computer vision. In this paper we analyze the classification accuracy with respect to dictionary size by linking the encoding stage to kernel methods and Nyström sampling, and obtain useful bounds on accuracy as a function of size. The Nyström method also inspires us to revisit dictionary learning from local patches, and we propose to learn the dictionary in an endtoend fashion taking into account pooling, a common computational layer in vision. We validate our contribution by showing how the derived bounds are able to explain the observed behavior of multiple datasets, and show that the pooling aware method efficiently reduces the dictionary size by a factor of two for a given accuracy. 1.
Largescale SVD and Manifold Learning
"... This paper examines the efficacy of samplingbased lowrank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nyström and Column sampling methods. We present a theoretical comparison between ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This paper examines the efficacy of samplingbased lowrank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nyström and Column sampling methods. We present a theoretical comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting lowdimensional manifold structure given millions of highdimensional face images. We address the computational challenges of nonlinear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning lowdimensional embeddings for two large face data sets: CMUPIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nyström approximation is superior to the Column sampling method for this task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMUPIE data set.
Memory Efficient Kernel Approximation
, 2014
"... The scalability of kernel machines is a big challenge when facing millions of samples due to storage and computation issues for large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a lowrank approximation of the kernel matrix. In this pa ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The scalability of kernel machines is a big challenge when facing millions of samples due to storage and computation issues for large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a lowrank approximation of the kernel matrix. In this paper, we first make the observation that the structure of shiftinvariant kernels changes from lowrank to blockdiagonal (without any lowrank structure) when varying the scale parameter. Based on this observation, we propose a new kernel approximation algorithm – Memory Efficient Kernel Approximation (MEKA), which considers both lowrank and clustering structure of the kernel matrix. We show that the resulting algorithm outperforms stateoftheart lowrank kernel approximation methods in terms of speed, approximation error, and memory usage. As an example, on the mnist2m dataset with twomillion samples, our method takes 550 seconds on a single machine using less than 500 MBytes memory to achieve 0.2313 test RMSE for kernel ridge regression, while standard Nyström approximation takes more than 2700 seconds and uses more than 2 GBytes memory on the same problem to achieve 0.2318 test RMSE.
LargeScale Machine Learning for Classification and Search
, 2012
"... With the rapid development of the Internet, nowadays tremendous amounts of data including images and videos, up to millions or billions, can be collected for training machine learning models. Inspired by this trend, this thesis is dedicated to developing largescale machine learning techniques for t ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
With the rapid development of the Internet, nowadays tremendous amounts of data including images and videos, up to millions or billions, can be collected for training machine learning models. Inspired by this trend, this thesis is dedicated to developing largescale machine learning techniques for the purpose of making classification and nearest neighbor search practical on gigantic databases. Our first approach is to explore data graphs to aid classification and nearest neighbor search. A graph offers an attractive way of representing data and discovering the essential information such as the neighborhood structure. However, both of the graph construction process and graphbased learning techniques become computationally prohibitive at a large scale. To this end, we present an efficient large graph construction approach and subsequently apply it to develop scalable semisupervised learning and unsupervised hashing algorithms. Our unique contributions on the graphrelated topics include: 1. Large Graph Construction: Conventional neighborhood graphs such as kNN graphs require a quadratic time complexity, which is inadequate for largescale applications mentioned above. To overcome this bottleneck, we present a novel graph construction approach,