Results 1 - 10
of
43
Random features for large-scale kernel machines
- In Neural Infomration Processing Systems
, 2007
"... To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the f ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift-invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms that use these features outperform state-of-the-art large-scale kernel machines. 1
Proto-value functions: A laplacian framework for learning representation and control in markov decision processes
- Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract
-
Cited by 45 (8 self)
- Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
RELATIVE-ERROR CUR MATRIX DECOMPOSITIONS
- SIAM J. MATRIX ANAL. APPL
, 2008
"... Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of “components.” Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of “components.” Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an m × n matrix A and a rank parameter k. In our first algorithm, C is chosen, and we let A ′ = CC + A, where C + is the Moore–Penrose generalized inverse of C. In our second algorithm C, U, R are chosen, and we let A ′ = CUR. (C and R are matrices that consist of actual columns and rows, respectively, of A, and U is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least 1 − δ, ‖A − A ′ ‖F ≤ (1 + ɛ) ‖A − Ak‖F, where Ak is the “best ” rank-k approximation provided by truncating the SVD of A, and where ‖X‖F is the Frobenius norm of the matrix X. The number of columns of C and rows of R is a low-degree polynomial in k, 1/ɛ, and log(1/δ). Both the Numerical Linear Algebra community and the Theoretical Computer Science community have studied variants
TENSOR-CUR DECOMPOSITIONS FOR TENSOR-BASED DATA
- SIAM J. MATRIX ANAL. APPL.
, 2008
"... Motivated by numerous applications in which the data may be modeled by a variable subscripted by three or more indices, we develop a tensor-based extension of the matrix CUR decomposition. The tensor-CUR decomposition is most relevant as a data analysis tool when the data consist of one mode that i ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Motivated by numerous applications in which the data may be modeled by a variable subscripted by three or more indices, we develop a tensor-based extension of the matrix CUR decomposition. The tensor-CUR decomposition is most relevant as a data analysis tool when the data consist of one mode that is qualitatively different from the others. In this case, the tensor-CUR decomposition approximately expresses the original data tensor in terms of a basis consisting of underlying subtensors that are actual data elements and thus that have a natural interpretation in terms of the processes generating the data. Assume the data may be modeled as a (2+1)-tensor, i.e., an m×n×p tensor A in which the first two modes are similar and the third is qualitatively different. We refer to each of the p different m × n matrices as “slabs ” and each of the mn different p-vectors as “fibers.” In this case, the tensor-CUR algorithm computes an approximation to the data tensor A that is of the form CUR, where C is an m×n×c tensor consisting of a small number c of the slabs, R is an r × p matrix consisting of a small number r of the fibers, and U is an appropriately defined and easily computed c × r encoding matrix. Both C and R may be chosen by randomly sampling either slabs or fibers according to a judiciously chosen and data-dependent probability distribution, and both c and r depend on a rank parameter k, an error parameter ɛ, and a failure probability δ. Under
Subspace sampling and relative-error matrix approximation: Column-based methods
- In Proc. of the 10th RANDOM
, 2006
"... Abstract. Given an m×n matrix A and an integer k less than the rank of A, the “best ” rank k approximation to A that minimizes the error with respect to the Frobenius norm is Ak, which is obtained by projecting A on the top k left singular vectors of A. While Ak is routinely used in data analysis, i ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Abstract. Given an m×n matrix A and an integer k less than the rank of A, the “best ” rank k approximation to A that minimizes the error with respect to the Frobenius norm is Ak, which is obtained by projecting A on the top k left singular vectors of A. While Ak is routinely used in data analysis, it is difficult to interpret and understand it in terms of the original data, namely the columns and rows of A. For example, these columns and rows often come from some application domain, whereas the singular vectors are linear combinations of (up to all) the columns or rows of A. We address the problem of obtaining low-rank approximations that are directly interpretable in terms of the original columns or rows of A. Our main results are two polynomial time randomized algorithms that take as input a matrix A and return as output a matrix C, consisting of a “small ” (i.e., a low-degree polynomial in k,1/ɛ, andlog(1/δ)) number of actual columns of A such that � A − CC + A � �F ≤ (1 + ɛ) �A − Ak � F with probability at least 1−δ. Our algorithms are simple, and they take time of the order of the time needed to compute the top k right singular vectors of A. In addition, they sample the columns of A via the method of “subspace sampling, ” so-named since the sampling probabilities depend on the lengths of the rows of the top singular vectors and since they ensure that we capture entirely a certain subspace of interest.
Conditional random sampling: A sketch-based sampling technique for sparse data
- In NIPS
, 2006
"... We 1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stag ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
We 1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations. 1
Fast Approximate Spectral Clustering
, 2009
"... Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spe ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes. 1
Modeling Transfer Relationships Between Learning Tasks for Improved Inductive Transfer
"... Abstract. In this paper, we propose a novel graph-based method for knowledge transfer. We model the transfer relationships between source tasks by embedding the set of learned source models in a graph using transferability as the metric. Transfer to a new problem proceeds by mapping the problem into ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Abstract. In this paper, we propose a novel graph-based method for knowledge transfer. We model the transfer relationships between source tasks by embedding the set of learned source models in a graph using transferability as the metric. Transfer to a new problem proceeds by mapping the problem into the graph, then learning a function on this graph that automatically determines the parameters to transfer to the new learning task. This method is analogous to inductive transfer along a manifold that captures the transfer relationships between the tasks. We demonstrate improved transfer performance using this method against existing approaches in several real-world domains. 1
Large-Scale Manifold Learning
"... This paper examines the problem of extracting lowdimensional manifold structure given millions of highdimensional face images. Specifically, we address the computational challenges of nonlinear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nod ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
This paper examines the problem of extracting lowdimensional manifold structure given millions of highdimensional face images. Specifically, we address the computational challenges of nonlinear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. Since most manifold learning techniques rely on spectral decomposition, we first analyze two approximate spectral decomposition techniques for large dense matrices (Nyström and Column-sampling), providing the first direct theoretical and empirical comparison between these techniques. We next show extensive experiments on learning low-dimensional embeddings for two large face datasets: CMU-PIE (35 thousand faces) and a web dataset (18 million faces). Our comparisons show that the Nyström approximation is superior to the Column-sampling method. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE dataset. 1.
Compact spectral bases for value function approximation using kronecker factorization
- In Proceedings of the National Conference on Artificial Intelligence (AAAI
, 2007
"... A new spectral approach to value function approximation has recently been proposed to automatically construct basis functions from samples. Global basis functions called proto-value functions are generated by diagonalizing a diffusion operator, such as a reversible random walk or the Laplacian, on a ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
A new spectral approach to value function approximation has recently been proposed to automatically construct basis functions from samples. Global basis functions called proto-value functions are generated by diagonalizing a diffusion operator, such as a reversible random walk or the Laplacian, on a graph formed from connecting nearby samples. This paper addresses the challenge of scaling this approach to large domains. We propose using Kronecker factorization coupled with the Metropolis-Hastings algorithm to decompose reversible transition matrices. The result is that the basis functions can be computed on much smaller matrices and combined to form the overall bases. We demonstrate that in several continuous Markov decision processes, compact basis functions can be constructed without significant loss in performance. In one domain, basis functions were compressed by a factor of 36. A theoretical analysis relates the quality of the approximation to the spectral gap. Our approach generalizes to other basis constructions as well.

