Results 11  20
of
73
Distributed Largescale Natural Graph Factorization
 WWW 2013
, 2013
"... Natural graphs, such as social networks, email graphs, or instant messaging patterns, have become pervasive through the internet. These graphs are massive, often containing hundreds of millions of nodes and billions of edges. While some theoretical models have been proposed to study such graphs, the ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Natural graphs, such as social networks, email graphs, or instant messaging patterns, have become pervasive through the internet. These graphs are massive, often containing hundreds of millions of nodes and billions of edges. While some theoretical models have been proposed to study such graphs, their analysis is still difficult due to the scale and nature of the data. We propose a framework for largescale graph decomposition and inference. To resolve the scale, our framework is distributed so that the data are partitioned over a sharednothing set of machines. We propose a novel factorization technique that relies on partitioning a graph so as to minimize the number of neighboring vertices rather than edges across partitions. Our decomposition is based on a streaming algorithm. It is networkaware as it adapts to the network topology of the underlying computational hardware. We use local copies of the variables and an efficient asynchronous communication protocol to synchronize the replicated values in order to perform most of the computation without having to incur the cost of network communication. On a graph of 200 million vertices and 10 billion edges, derived from an email communication network, our algorithm retains convergence properties while allowing for almost linear scalability in the number of computers.
Sparkler: Supporting largescale matrix factorization
 In EDBT
, 2013
"... Lowrank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Lowrank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in webscale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a roadblock. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem – an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called “Carousel Maps ” (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.
BLOCK STOCHASTIC GRADIENT ITERATION FOR CONVEX AND NONCONVEX OPTIMIZATION
, 2015
"... The stochastic gradient (SG) method can quickly solve a problem with a large number of components in the objective, or a stochastic optimization problem, to a moderate accuracy. The block coordinate descent/update (BCD) method, on the other hand, can quickly solve problems with multiple (blocks of ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
The stochastic gradient (SG) method can quickly solve a problem with a large number of components in the objective, or a stochastic optimization problem, to a moderate accuracy. The block coordinate descent/update (BCD) method, on the other hand, can quickly solve problems with multiple (blocks of) variables. This paper introduces a method that combines the great features of SG and BCD for problems with many components in the objective and with multiple (blocks of) variables. This paper proposes a block SG (BSG) method for both convex and nonconvex programs. BSG generalizes SG by updating all the blocks of variables in the Gauss–Seidel type (updating the current block depends on the previously updated block), in either a fixed or randomly shuffled order. Although BSG has slightly more work at each iteration, it typically outperforms SG because of BSG’s Gauss–Seidel updates and larger step sizes, the latter of which are determined by the smaller perblock Lipschitz constants. The convergence of BSG is established for both convex and nonconvex cases. In the convex case, BSG has the same order of convergence rate as SG. In the nonconvex case, its convergence is established in terms of the expected violation of a firstorder optimality condition. In both cases our analysis is nontrivial since the typical unbiasedness assumption no longer holds. BSG is numerically evaluated on the following problems: stochastic least squares and logistic regression, which are convex, and lowrank tensor recovery and bilinear logistic regression, which are nonconvex. On the convex problems, BSG performed significantly better than SG. On the nonconvex problems, BSG significantly outperformed the deterministic BCD method because the latter tends to stagnate early near local minimizers. Overall, BSG inherits the benefits of both SG approximation and block coordinate updates and is especially useful for solving largescale nonconvex problems.
Efficient Distributed Topic Modeling with Provable Guarantees
"... Topic modeling for largescale distributed webcollections requires distributed techniques that account for both computational and communication costs. We consider topic modeling under the separability assumption and develop novel computationally efficient methods that provably achieve the statisti ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
Topic modeling for largescale distributed webcollections requires distributed techniques that account for both computational and communication costs. We consider topic modeling under the separability assumption and develop novel computationally efficient methods that provably achieve the statistical performance of the stateoftheart centralized approaches while requiring insignificant communication between the distributed document collections. We achieve tradeoffs between communication and computation without actually transmitting the documents. Our scheme is based on exploiting the geometry of normalized wordword cooccurrence matrix and viewing each row of this matrix as a vector in a highdimensional space. We relate the solid angle subtended by extreme points of the convex hull of these vectors to topic identities and construct distributed schemes to identify topics. 1
Exploiting bounded staleness to speed up Big Data analytics
"... Many modern machine learning (ML) algorithms are iterative, converging on a final solution via many iterations over the input data. This paper explores approaches to exploiting these algorithms ’ convergent nature to improve performance, by allowing parallel and distributed threads to use loose con ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
Many modern machine learning (ML) algorithms are iterative, converging on a final solution via many iterations over the input data. This paper explores approaches to exploiting these algorithms ’ convergent nature to improve performance, by allowing parallel and distributed threads to use loose consistency models for shared algorithm state. Specifically, we focus on bounded staleness, in which each thread can see a view of the current intermediate solution that may be a limited number of iterations outofdate. Allowing staleness reduces communication costs (batched updates and cached reads) and synchronization (less waiting for locks or straggling threads). One approach is to increase the number of iterations between barriers in the oftused Bulk Synchronous Parallel (BSP) model of parallelizing, which mitigates these costs when all threads proceed at the same speed. A more flexible approach, called Stale Synchronous Parallel (SSP), avoids barriers and allows threads to be a bounded number of iterations ahead of the current slowest thread. Extensive experiments with ML algorithms for topic modeling, collaborative filtering, and PageRank show that both approaches significantly increase convergence speeds, behaving similarly when there are no stragglers, but SSP outperforms BSP in the presence of stragglers. 1
A distributed algorithm for largescale generalized matching
 PVLDB
"... Generalized matching problems arise in a number of applications, including computational advertising, recommender systems, and trade markets. Consider, for example, the problem of recommending multimedia items (e.g., DVDs) to users such that (1) users are recommended items that they are likely to be ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Generalized matching problems arise in a number of applications, including computational advertising, recommender systems, and trade markets. Consider, for example, the problem of recommending multimedia items (e.g., DVDs) to users such that (1) users are recommended items that they are likely to be interested in, (2) every user gets neither too few nor too many recommendations, and (3) only items available in stock are recommended to users. Stateoftheart matching algorithms fail at coping with large realworld instances, which may involve millions of users and items. We propose the first distributed algorithm for computing nearoptimal solutions to largescale generalized matching problems like the one above. Our algorithm is designed to run on a small cluster of commodity nodes (or in a MapReduce environment), has strong approximation guarantees, and requires only a polylogarithmic number of passes over the input. In particular, we propose a novel distributed algorithm to approximately solve mixed packingcovering linear programs, which include but are not limited to generalized matching problems. Experiments on realworld and synthetic data suggest that a practical variant of our algorithm scales to very large problem sizes and can be orders of magnitude faster than alternative approaches. 1.
L.: Classification of sparse time series via supervised matrix factorization
, 2012
"... Data sparsity is an emerging realworld problem observed in a various domains ranging from sensor networks to medical diagnosis. Consecutively, numerous machine learning methods were modeled to treat missing values. Nevertheless, sparsity, defined as missing segments, has not been thoroughly inve ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Data sparsity is an emerging realworld problem observed in a various domains ranging from sensor networks to medical diagnosis. Consecutively, numerous machine learning methods were modeled to treat missing values. Nevertheless, sparsity, defined as missing segments, has not been thoroughly investigated in the context of timeseries classification. We propose a novel principle for classifying time series, which in contrast to existing approaches, avoids reconstructing the missing segments in time series and operates solely on the observed ones. Based on the proposed principle, we develop a method that prevents adding noise that incurs during the reconstruction of the original time series. Our method adapts supervised matrix factorization by projecting time series in a latent space through stochastic learning. Furthermore the projected data is built in a supervised fashion via a logistic regression. Abundant experiments on a large collection of 37 data sets demonstrate the superiority of our method, which in the majority of cases outperforms a set of baselines that do not follow our proposed principle. 1
On model parallelization and scheduling strategies for distributed machine learning
 In NIPS
, 2014
"... Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A siblin ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
(Show Context)
Distributed machine learning has typically been approached from a data parallel perspective, where big data are partitioned to multiple workers and an algorithm is executed concurrently over different data subsets under various synchronization schemes to ensure speedup and/or correctness. A sibling problem that has received relatively less attention is how to ensure efficient and correct model parallel execution of ML algorithms, where parameters of an ML program are partitioned to different workers and undergone concurrent iterative updates. We argue that model and data parallelisms impose rather different challenges for system design, algorithmic adjustment, and theoretical analysis. In this paper, we develop a system for modelparallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs. STRADS enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. We demonstrate the efficacy of modelparallel algorithms implemented on STRADS versus popular implementations for topic modeling, matrix factorization, and Lasso. 1
Efficient minibatch training for stochastic optimization
 in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
"... ABSTRACT Stochastic gradient descent (SGD) is a popular technique for largescale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT Stochastic gradient descent (SGD) is a popular technique for largescale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.
“All Roads Lead to Rome: ” Optimistic Recovery for Distributed Iterative Data Processing
"... Executing dataparallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The pr ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Executing dataparallel iterative algorithms on large datasets is crucial for many advanced analytical applications in the fields of data mining and machine learning. Current systems for executing iterative tasks in large clusters typically achieve fault tolerance through rollback recovery. The principle behind this pessimistic approach is to periodically checkpoint the algorithm state. Upon failure, the system restores a consistent state from a previously written checkpoint and resumes execution from that point. We propose an optimistic recovery mechanism using algorithmic compensations. Our method leverages the robust, selfcorrecting nature of a large class of fixpoint algorithms used in data mining and machine learning, which converge to the correct solution from various intermediate consistent states. In the case of a failure, we apply a userdefined compensate function that algorithmically creates such a consistent state, instead of rolling back to a previous checkpointed state. Our optimistic recovery does not checkpoint any state and hence achieves optimal failurefree performance with respect to the overhead necessary for guaranteeing fault tolerance. We illustrate the applicability of this approach for three wide classes of problems. Furthermore, we show how to implement the proposed optimistic recovery mechanism in a data flow system. Similar to the Combine operator in MapReduce, our proposed functionality is optional and can be applied to increase performance without changing the semantics of programs. In an experimental evaluation on large datasets, we show that our proposed approach provides optimal failurefree performance. In the absence of failures our optimistic scheme is able to outperform a pessimistic approach by a factor of two to five. In presence of failures, our approach provides fast recovery and outperforms pessimistic approaches in the majority of cases.