Results 1  10
of
53
Revisiting the Nyström method for improved largescale machine learning
"... We reconsider randomized algorithms for the lowrank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and pro ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
(Show Context)
We reconsider randomized algorithms for the lowrank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods, and they point to differences between uniform and nonuniform sampling methods based on leverage scores. We complement our empirical results with a suite of worstcase theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds—e.g., improved additiveerror bounds for spectral and Frobenius norm error and relativeerror bounds for trace norm error. 1.
OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings
, 2012
"... An oblivious subspace embedding (OSE) given some parameters ε, d is a distribution D over matrices Π ∈ R m×n such that for any linear subspace W ⊆ R n with dim(W) = d it holds that PΠ∼D(∀x ∈ W ‖Πx‖2 ∈ (1 ± ε)‖x‖2)> 2/3. We show an OSE exists with m = O(d 2 /ε 2) and where every Π in the support ..."
Abstract

Cited by 32 (7 self)
 Add to MetaCart
(Show Context)
An oblivious subspace embedding (OSE) given some parameters ε, d is a distribution D over matrices Π ∈ R m×n such that for any linear subspace W ⊆ R n with dim(W) = d it holds that PΠ∼D(∀x ∈ W ‖Πx‖2 ∈ (1 ± ε)‖x‖2)> 2/3. We show an OSE exists with m = O(d 2 /ε 2) and where every Π in the support of D has exactly s = 1 nonzero entries per column. This improves the previously best known bound in [ClarksonWoodruff, arXiv abs/1207.6365]. Our quadratic dependence on d is optimal for any OSE with s = 1 [NelsonNguy ˜ ên, 2012]. We also give two OSE’s, which we call Oblivious Sparse NormApproximating Projections (OSNAPs), that both allow the parameter settings m = Õ(d/ε2) and s = polylog(d)/ε, or m = O(d1+γ /ε2) and s = O(1/ε) for any constant γ> 0. 1 This m is nearly optimal since m ≥ d is required simply to ensure no nonzero vector of W lands in the kernel of Π. These are the first constructions with m = o(d 2) to have s = o(d). In fact, our OSNAPs are nothing more than the sparse JohnsonLindenstrauss matrices of [KaneNelson, SODA 2012]. Our analyses all yield OSE’s that are sampled using either O(1)wise or O(log d)wise
Lowdistortion subspace embeddings in inputsparsity time and applications to robust linear regression
, 2012
"... Lowdistortion embeddings are critical building blocks for developing random sampling and random projection algorithms for common linear algebra problems. We show that, given a matrix A ∈ Rn×d with n d and a p ∈ [1, 2), with a constant probability, we can construct a lowdistortion embedding matr ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
Lowdistortion embeddings are critical building blocks for developing random sampling and random projection algorithms for common linear algebra problems. We show that, given a matrix A ∈ Rn×d with n d and a p ∈ [1, 2), with a constant probability, we can construct a lowdistortion embedding matrix Π ∈ RO(poly(d))×n that embeds Ap, the `p subspace spanned by A’s columns, into (RO(poly(d)), ‖ · ‖p); the distortion of our embeddings is only O(poly(d)), and we can compute ΠA in O(nnz(A)) time, i.e., inputsparsity time. Our result generalizes the inputsparsity time `2 subspace embedding by Clarkson and Woodruff [STOC’13]; and for completeness, we present a simpler and improved analysis of their construction for `2. These inputsparsity time `p embeddings are optimal, up to constants, in terms of their running time; and the improved running time propagates to applications such as (1 ± )distortion `p subspace embedding and relativeerror `p regression. For `2, we show that a (1 + )approximate solution to the `2 regression problem specified by the matrix A and a vector b ∈ Rn can be computed in O(nnz(A) + d3 log(d/)/2) time; and for `p, via a subspacepreserving sampling procedure, we show that a (1 ± )distortion embedding of Ap into RO(poly(d)) can be computed in O(nnz(A) · logn) time, and we also show that a (1 + )approximate solution to the `p regression problem minx∈Rd ‖Ax − b‖p can be computed in O(nnz(A) · logn + poly(d) log(1/)/2) time. Moreover, we can also improve the embedding dimension or equivalently the sample size to O(d3+p/2 log(1/)/2) without increasing the complexity.
Improved matrix algorithms via the subsampled randomized Hadamard transform
 SIAM J. Matrix Analysis Applications
"... Abstract. Several recent randomized linear algebra algorithms rely upon fast dimension reduction methods. A popular choice is the subsampled randomized Hadamard transform (SRHT). In this article, we address the efficacy, in the Frobenius and spectral norms, of an SRHTbased lowrank matrix approxim ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
Abstract. Several recent randomized linear algebra algorithms rely upon fast dimension reduction methods. A popular choice is the subsampled randomized Hadamard transform (SRHT). In this article, we address the efficacy, in the Frobenius and spectral norms, of an SRHTbased lowrank matrix approximation technique introduced by Woolfe, Liberty, Rohklin, and Tygert. We establish a slightly better Frobenius norm error bound than is currently available, and a much sharper spectral norm error bound (in the presence of reasonable decay of the singular values). Along the way, we produce several results on matrix operations with SRHTs (such as approximate matrix multiplication) that may be of independent interest. Our approach builds upon Tropp’s in “Improved Analysis of the
Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive Sampling
"... The CUR matrix decomposition and the Nyström approximation are two important lowrank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
(Show Context)
The CUR matrix decomposition and the Nyström approximation are two important lowrank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nyström approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nyström algorithms with expected relativeerror bounds. The proposed CUR and Nyström algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nyström method and the ensemble Nyström method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.
Sharp analysis of lowrank kernel matrix approximations
 JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL 30 (2013) 1–25
, 2013
"... We consider supervised learning problems within the positivedefinite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinitedimensional feature spaces, a common practical limiting difficulty is the necessity of c ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We consider supervised learning problems within the positivedefinite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinitedimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n 2). Lowrank approximations of the kernel matrix are often considered as they allow the reduction of running time complexities to O(p 2 n), where p is the rank of the approximation. The practicality of such methods thus depends on the required rank p. In this paper, we show that in the context of kernel ridge regression, for approximations based on a random subset of columns of the original kernel matrix, the rank p may be chosen to be linear in the degrees of freedom associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of nonparametric estimators. This result enables simple algorithms that have subquadratic running time complexity, but provably exhibit the same predictive performance than existing algorithms, for any given problem instance, and not only for worstcase situations.
The Fast Cauchy Transform and Faster Robust Linear Regression
"... We provide fast algorithms for overconstrained ℓp regression and related problems: for an n × d input matrix A and vector b ∈ Rn, in O(nd log n) time we reduce the problem minx∈Rd ‖Ax − b‖p to the same problem with input matrix Ã of dimension s×d and corresponding ˜b of dimension s × 1. Here, Ã and ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
We provide fast algorithms for overconstrained ℓp regression and related problems: for an n × d input matrix A and vector b ∈ Rn, in O(nd log n) time we reduce the problem minx∈Rd ‖Ax − b‖p to the same problem with input matrix Ã of dimension s×d and corresponding ˜b of dimension s × 1. Here, Ã and ˜b are a coreset for the problem, consisting of sampled and rescaled rows of A and b; and s is independent of n and polynomial in d. Our results improve on the best previous algorithms when n ≫ d, for all p ∈ [1, ∞) except p = 2; in particular, they improve the O(nd 1.376+) running time of Sohler and Woodruff (STOC, 2011) for p = 1, that uses asymptotically fast matrix multiplication, and the
A statistical perspective on algorithmic leveraging
, 2013
"... One popular method for dealing with largescale data sets is sampling. Using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales data matrices to reduce the data size before performing computations on the subpr ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
One popular method for dealing with largescale data sets is sampling. Using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales data matrices to reduce the data size before performing computations on the subproblem. Existing work has focused on algorithmic issues, but none of it addresses statistical aspects of this method. Here, we provide an effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model. In particular, for several versions of leveragebased sampling, we derive results for the bias and variance. We show that from the statistical perspective of bias and variance, neither leveragebased sampling nor uniform sampling dominates the other. This result is particularly striking, given the wellknown result that, from the algorithmic perspective of worstcase analysis, leveragebased sampling provides uniformly superior worstcase algorithmic results, when compared with uniform sampling. Based on these theoretical results, we propose and analyze two new leveraging algorithms: one constructs a smaller leastsquares problem with “shrinked” leverage scores (SLEV), and the other solves a smaller and unweighted (or biased) leastsquares problem (LEVUNW). The empirical results indicate that our theory is a good predictor of practical performance of existing and new leveragebased algorithms and that the new algorithms achieve improved performance.
Iterative row sampling
 In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on
, 2013
"... There has been significant interest and progress recently in algorithms that solve regression problems involving tall and thin matrices in input sparsity time. These algorithms find shorter equivalent of a n × d matrix where n d, which allows one to solve a poly(d) sized problem instead. In practic ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
There has been significant interest and progress recently in algorithms that solve regression problems involving tall and thin matrices in input sparsity time. These algorithms find shorter equivalent of a n × d matrix where n d, which allows one to solve a poly(d) sized problem instead. In practice, the best performances are often obtained by invoking these routines in an iterative fashion. We show these iterative methods can be adapted to give theoretical guarantees comparable and better than the current state of the art. Our approaches are based on computing the importances of the rows, known as leverage scores, in an iterative manner. We show that alternating between computing a short matrix estimate and finding more accurate approximate leverage scores leads to a series of geometrically smaller instances. This gives an algorithm that runs in O(nnz(A) + dω+θ−2) time for any θ> 0, where the dω+θ term is comparable to the cost of solving a regression problem on the small approximation. Our results are built upon the close connection between randomized matrix algorithms, iterative methods, and graph sparsification. 1