Results 1  10
of
68
Hogwild!: A lockfree approach to parallelizing stochastic gradient descent
, 2011
"... Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateoftheart performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work a ..."
Abstract

Cited by 143 (7 self)
 Add to MetaCart
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateoftheart performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude. 1
Largescale Matrix Factorization with Distributed Stochastic Gradient Descent
 In KDD
, 2011
"... We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel “stratified ” variant of SGD, we ..."
Abstract

Cited by 68 (7 self)
 Add to MetaCart
(Show Context)
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel “stratified ” variant of SGD, we obtain a new matrixfactorization algorithm, called DSGD, that can be fully distributed and run on webscale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations and has good scalability properties. 1
Dynamic anomalography: Tracking network anomalies via sparsity and low
 Topics in Sig. Proc
, 2013
"... ar ..."
(Show Context)
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
"... Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gra ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Abstract—Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle webscale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to largescale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other largescale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updates rankone factors one by one. In addition, CCD++ can be easily parallelized on both multicore and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines. KeywordsRecommender systems, Matrix factorization, Low rank approximation, Parallelization.
Conditional gradient algorithms for normregularized smooth convex optimization
, 2013
"... Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone K, a norm ‖ · ‖ and a smooth convex function f, we want either 1) to minimize the norm over the intersection of the cone and a level set of f, or 2) to minimiz ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
(Show Context)
Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone K, a norm ‖ · ‖ and a smooth convex function f, we want either 1) to minimize the norm over the intersection of the cone and a level set of f, or 2) to minimize over the cone the sum of f and a multiple of the norm. We focus on the case where (a) the dimension of the problem is too large to allow for interior point algorithms, (b) ‖ · ‖ is “too complicated ” to allow for computationally cheap Bregman projections required in the firstorder proximal gradient algorithms. On the other hand, we assume that it is relatively easy to minimize linear forms over the intersection of K and the unit ‖ · ‖ball. Motivating examples are given by the nuclear norm with K being the entire space of matrices, or the positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on the space of 2D images. We discuss versions of the Conditional Gradient algorithm capable to handle our problems of interest, provide the related theoretical efficiency estimates and outline some applications. 1
Blind Deconvolution using Convex Programming
, 2012
"... We consider the problem of recovering two unknown vectors, w and x, of length L from their circular convolution. We make the structural assumption that the two vectors are members known subspaces, one with dimension N and the other with dimension K. Although the observed convolution is nonlinear in ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of recovering two unknown vectors, w and x, of length L from their circular convolution. We make the structural assumption that the two vectors are members known subspaces, one with dimension N and the other with dimension K. Although the observed convolution is nonlinear in both w and x, it is linear in the rank1 matrix formed by their outer product wx ∗. This observation allows us to recast the deconvolution problem as lowrank matrix recovery problem from linear measurements, whose natural convex relaxation is a nuclear norm minimization program. We prove the effectiveness of this relaxation by showing that for “generic ” signals, the program can deconvolve w and x exactly when the maximum of N and K is almost on the order of L. That is, we show that if x is drawn from a random subspace of dimension N, and w is a vector in a subspace of dimension K whose basis vectors are “spread out ” in the frequency domain, then nuclear norm minimization recovers wx ∗ without error. We discuss this result in the context of blind channel estimation in communications. If we have a message of length N which we code using a random L × N coding matrix, and the encoded message travels through an unknown linear timeinvariant channel of maximum length K, then the receiver can recover both the channel response and the message when L � N + K, to within constant and log factors. 1
Inexact Coordinate Descent: Complexity and Preconditioning
, 2013
"... In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method. One of the key steps at each iteration of the algorithm is determining the update to a block of variables. Existing algorithms assume that in order to compute the update, a parti ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
In this paper we consider the problem of minimizing a convex function using a randomized block coordinate descent method. One of the key steps at each iteration of the algorithm is determining the update to a block of variables. Existing algorithms assume that in order to compute the update, a particular subproblem is solved exactly. In his work we relax this requirement, and allow for the subproblem to be solved inexactly, leading to an inexact block coordinate descent method. Our approach incorporates the best known results for exact updates as a special case. Moreover, these theoretical guarantees are complemented by practical considerations: the use of iterative techniques to determine the update as well as the use of preconditioning for further acceleration.
Giannakis, “Robust PCA as bilinear decomposition with outliersparsity regularization
 IEEE Trans. Signal Process
, 2012
"... Abstract—Principal component analysis (PCA) is widely used for dimensionality reduction, with welldocumented merits in various applications involving highdimensional data, including computer vision, preference measurement, and bioinformatics. In this context, the fresh look advocated here permeate ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Abstract—Principal component analysis (PCA) is widely used for dimensionality reduction, with welldocumented merits in various applications involving highdimensional data, including computer vision, preference measurement, and bioinformatics. In this context, the fresh look advocated here permeates benefits from variable selection and compressive sampling, to robustify PCA against outliers. A leasttrimmed squares estimator of a lowrank bilinear factor analysis model is shown closely related to that obtained from an(pseudo)normregularized criterion encouraging sparsity in a matrix explicitly modeling the outliers. This connection suggests robust PCA schemes based on convex relaxation, which lead naturally to a family of robust estimators encompassing Huber’s optimal Mclass as a special case. Outliers are identified by tuning a regularization parameter, which amounts to controlling sparsity of the outlier matrix along the whole robustification path of (group) leastabsolute shrinkage and selection operator (Lasso) solutions. Beyond its ties to robust statistics, the developed outlieraware PCA framework is versatile to accommodate novel and scalable algorithms to: i) track the lowrank signal subspace robustly, as new data are acquired in real time; and ii) determine principal components robustly in (possibly) infinitedimensional feature spaces. Synthetic and real data tests corroborate the effectiveness of the proposed robust PCA schemes, when used to identify aberrant responses in personality assessment surveys, as well as unveil communities in social networks, and intruders from video surveillance data. Index Terms—(Group) Lasso, outlier rejection, principal component analysis, robust statistics, sparsity. I.
Lowrank matrix factorization for deep neural network training with highdimensional output targets
 in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing
, 2013
"... While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training of these networks is slow. One reason is that DNNs are trained with a large number of training parameters (i.e., 1050 million). Because networks are trained ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training of these networks is slow. One reason is that DNNs are trained with a large number of training parameters (i.e., 1050 million). Because networks are trained with a large number of output targets to achieve good performance, the majority of these parameters are in the final weight layer. In this paper, we propose a lowrank matrix factorization of the final weight layer. We apply this lowrank technique to DNNs for both acoustic modeling and language modeling. We show on three different LVCSR tasks ranging between 50400 hrs, that a lowrank factorization reduces the number of parameters of the network by 3050%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a fullrank representation. Index Terms—Deep Neural Networks, Speech Recognition 1.
FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on
"... Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often h ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netflix example above. Incorporating the additional information leads to the coupled factorization problem. So far, it has been solved for relatively small datasets. We provide a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We offer the following contributions: (1) Versatility: Our algorithm can perform matrix, tensor, and coupled factorization, with flexible objective functions including the Frobenius norm, Frobenius norm with an `1 induced sparsity, and nonnegative factorization. (2) Scalability: FlexiFaCT scales to unprecedented sizes in both the data and model, with up to billions of parameters. FlexiFaCT runs on standard Hadoop. (3) Convergence proofs showing that FlexiFaCT converges on the variety of objective functions, even with projections. 1