Results 1  10
of
22
Revisiting the Nyström method for improved largescale machine learning
"... We reconsider randomized algorithms for the lowrank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and pro ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
(Show Context)
We reconsider randomized algorithms for the lowrank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods, and they point to differences between uniform and nonuniform sampling methods based on leverage scores. We complement our empirical results with a suite of worstcase theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds—e.g., improved additiveerror bounds for spectral and Frobenius norm error and relativeerror bounds for trace norm error. 1.
Nyström method vs random fourier features: A theoretical and empirical comparison
 In Advances in NIPS’12
, 2012
"... Both random Fourier features and the Nyström method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on ran ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
(Show Context)
Both random Fourier features and the Nyström method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nyström method are randomly sampled from the training examples and are therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigenspectrum of the kernel matrix, approaches based on the Nyström method can yield impressively better generalization error bound than random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1
Clustered Nyström method for large scale manifold learning and dimension reduction
 IEEE Transactions on Neural Networks
, 2010
"... Abstract — Kernel (or similarity) matrix plays a key role in many machine learning algorithms such as kernel methods, manifold learning, and dimension reduction. However, the cost of storing and manipulating the complete kernel matrix makes it infeasible for large problems. The Nyström method is a p ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
(Show Context)
Abstract — Kernel (or similarity) matrix plays a key role in many machine learning algorithms such as kernel methods, manifold learning, and dimension reduction. However, the cost of storing and manipulating the complete kernel matrix makes it infeasible for large problems. The Nyström method is a popular samplingbased lowrank approximation scheme for reducing the computational burdens in handling large kernel matrices. In this paper, we analyze how the approximating quality of the Nyström method depends on the choice of landmark points, and in particular the encoding powers of the landmark points in summarizing the data. Our (nonprobabilistic) error analysis justifies a “clustered Nyström method ” that uses the kmeans clustering centers as landmark points. Our algorithm can be applied to scale up a wide variety of algorithms that depend on the eigenvalue decomposition of kernel matrix (or its variant), such as kernel principal component analysis, Laplacian eigenmap, spectral clustering, as well as those involving kernel matrix inverse such as leastsquares support vector machine and Gaussian process regression. Extensive experiments demonstrate the competitive performance of our algorithm in both accuracy and efficiency. Index Terms — Dimension reduction, eigenvalue decomposition, kernel matrix, lowrank approximation, manifold learning,
Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive Sampling
"... The CUR matrix decomposition and the Nyström approximation are two important lowrank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
(Show Context)
The CUR matrix decomposition and the Nyström approximation are two important lowrank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nyström approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nyström algorithms with expected relativeerror bounds. The proposed CUR and Nyström algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nyström method and the ensemble Nyström method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.
CUR from a sparse optimization viewpoint
 In Advances in Neural Information Processing Systems
, 2010
"... The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a rand ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CURlike sparsity. 1
A novel greedy algorithm for Nyström approximation
"... The Nyström method is an efficient technique for obtaining a lowrank approximation of a large kernel matrix based on a subset of its columns. The quality of the Nyström approximation highly depends on the subset of columns used, which are usually selected using random sampling. This paper presents ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
The Nyström method is an efficient technique for obtaining a lowrank approximation of a large kernel matrix based on a subset of its columns. The quality of the Nyström approximation highly depends on the subset of columns used, which are usually selected using random sampling. This paper presents a novel recursive algorithm for calculating the Nyström approximation, and an effective greedy criterion for column selection. Further, a very efficient variant is proposed for greedy sampling, which works on random partitions of data instances. Experiments on benchmark data sets show that the proposed greedy algorithms achieve significant improvements in approximating kernel matrices, with minimum overhead in run time.
KERNEL METHODS MATCH DEEP NEURAL NETWORKS ON TIMIT
"... Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for largescale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps h ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Despite their theoretical appeal and grounding in tractable convex optimization techniques, kernel methods are often not the first choice for largescale speech applications due to their significant memory requirements and computational expense. In recent years, randomized approximate feature maps have emerged as an elegant mechanism to scaleup kernel methods. Still, in practice, a large number of random features is required to obtain acceptable accuracy in prdictive tasks. In this paper, we develop two algorithmic schemes to address this computational bottleneck in the context of kernel ridge regression. The first scheme is a specialized distributed block coordinate descent procedure that avoids the explicit materialization of the feature space data matrix, while the second scheme gains efficiency by combining multiple weak random feature models in an ensemble learning framework. We demonstrate that these schemes enable kernel methods to match the performance of state of the art Deep Neural Networks on TIMIT for speech recognition and classification tasks. In particular, we obtain the best classification error rates reported on TIMIT using kernel methods.
Local LowRank Matrix Approximation
"... Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of lowrank. We propose a new matrix approximation model where we assume instead that the matrix is ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is of lowrank. We propose a new matrix approximation model where we assume instead that the matrix is locally of lowrank, leading to a representation of the observed matrix as a weighted sum of lowrank matrices. We analyze the accuracy of the proposed local lowrank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks. 1.
Comparing datadependent and dataindependent embeddings for classification and ranking of internet images
 In CVPR
, 2011
"... This paper presents a comparative evaluation of feature embeddings for classification and ranking in largescale Internet image datasets. We follow a popular framework for scalable visual learning, in which the data is first transformed by a nonlinear embedding and then an efficient linear classifie ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
This paper presents a comparative evaluation of feature embeddings for classification and ranking in largescale Internet image datasets. We follow a popular framework for scalable visual learning, in which the data is first transformed by a nonlinear embedding and then an efficient linear classifier is trained in the resulting space. Our study includes datadependent embeddings inspired by the semisupervised learning literature, and dataindependent ones based on approximating specific kernels (such as the Gaussian kernel for GIST features and the histogram intersection kernel for bags of words). Perhaps surprisingly, we find that datadependent embeddings, despite being computed from large amounts of unlabeled data, do not have any advantage over dataindependent ones in the regime of scarce labeled data. On the other hand, we find that several datadependent embeddings are competitive with popular dataindependent choices for largescale classification. 1.
RANDOM FEATURES FOR KERNEL DEEP CONVEX NETWORK
"... The recently developed deep learning architecture, a kernel version of the deep convex network (KDCN), is improved to address the scalability problem when the training and testing samples become very large. We have developed a solution based on the use of random Fourier features, which possess the ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
The recently developed deep learning architecture, a kernel version of the deep convex network (KDCN), is improved to address the scalability problem when the training and testing samples become very large. We have developed a solution based on the use of random Fourier features, which possess the strong theoretical property of approximating Gaussian kernel while rendering efficient computation in both training and evaluation of the KDCN with large training samples. We empirically demonstrate that just like the conventional KDCN exploiting rigorous Gaussian kernels, the use of random Fourier features also enables successful stacking of kernel modules to form a deep architecture. Our evaluation experiments on phone recognition and speech understanding tasks both show the computation efficiency of the KDCN which makes use of random features. With sufficient depths in the KDCN, the phone recognition accuracy and slotfilling accuracy are shown to be comparable or slightly higher than the KDCN with Gaussian kernels while significant computational saving has been achieved. Index Terms — kernel regression, deep learning, spoken language understanding, random features 1.