Results 1  10
of
37
Largescale Matrix Factorization with Distributed Stochastic Gradient Descent
 In KDD
, 2011
"... We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel “stratified ” variant of SGD, we ..."
Abstract

Cited by 68 (7 self)
 Add to MetaCart
(Show Context)
We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. Based on a novel “stratified ” variant of SGD, we obtain a new matrixfactorization algorithm, called DSGD, that can be fully distributed and run on webscale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations and has good scalability properties. 1
Fast Coordinate Descent Methods with Variable Selection for Nonnegative Matrix Factorization
, 2011
"... Nonnegative Matrix Factorization (NMF) is an effective dimension reduction method for nonnegative dyadic data, and has proven to be useful in many areas, such as text mining, bioinformatics and image processing. NMF is usually formulated as a constrained nonconvex optimization problem, and many al ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Nonnegative Matrix Factorization (NMF) is an effective dimension reduction method for nonnegative dyadic data, and has proven to be useful in many areas, such as text mining, bioinformatics and image processing. NMF is usually formulated as a constrained nonconvex optimization problem, and many algorithms have been developed for solving it. Recently, a coordinate descent method, called FastHals [3], has been proposed to solve least squares NMF and is regarded as one of the stateoftheart techniques for the problem. In this paper, we first show that FastHals has an inefficiency in that it uses a cyclic coordinate descent scheme and thus, performs unneeded descent steps on unimportant variables. We then present a variable selection scheme that uses the gradient of the objective function to arrive at a new coordinate descent method. Our new method is considerably faster in practice and we show that it has theoretical convergence guarantees. Moreover when the solution is sparse, as is often the case in real applications, our new method benefits by selecting important variables to update more often, thus resulting in higher speed. As an example, on a text dataset RCV1, our method is 7 times faster than FastHals, and more than 15 times faster when the sparsity is increased by adding an L1 penalty. We also develop new coordinate descent methods when error in NMF is measured by KLdivergence by applying the Newton method to solve the onevariable subproblems. Experiments indicate that our algorithm for minimizing the KLdivergence is faster than the Lee & Seung multiplicative rule by a factor of 10 on the CBCL image dataset.
Beyond ‘Caveman Communities’: Hubs and Spokes for Graph Compression and Mining
"... Abstract—Given a real world graph, how should we layout its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find cliquelike communities, like the ‘cavemen graph’, and compress them. We show that the blockdiagonal mental image of the ‘cavemen ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
(Show Context)
Abstract—Given a real world graph, how should we layout its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find cliquelike communities, like the ‘cavemen graph’, and compress them. We show that the blockdiagonal mental image of the ‘cavemen graph ’ is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead, we propose to envision graphs as a collection of hubs connecting spokes, with superhubs connecting the hubs, and so on, recursively. Based on the idea, we propose the SLASHBURN method (burn the hubs, and slash the remaining graph into smaller connected components). Our view point has several advantages: (a) it avoids the ‘no good cuts ’ problem, (b) it gives better compression, and (c) it leads to faster execution times for matrixvector operations, which are the backbone of most graph processing tools. Experimental results show that our SLASHBURN method consistently outperforms other methods on all datasets, giving good compression and faster running time.
GigaTensor: Scaling Tensor Analysis Up By 100 Times Algorithms and Discoveries
"... Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conferenceauthorkeyword relations ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
Many data are modeled as tensors, or multi dimensional arrays. Examples include the predicates (subject, verb, object) in knowledge bases, hyperlinks and anchor texts in the Web graphs, sensor streams (time, location, and type), social networks over time, and DBLP conferenceauthorkeyword relations. Tensor decomposition is an important data mining tool with various applications including clustering, trend detection, and anomaly detection. However, current tensor decomposition algorithms are not scalable for large tensors with billions of sizes and hundreds millions of nonzeros: the largest tensor in the literature remains thousands of sizes and hundreds thousands of nonzeros. Consider a knowledge base tensor consisting of about 26 million nounphrases. The intermediate data explosion problem, associated with naive implementations of tensor decomposition algorithms, would require the materialization and the storage of a matrix whose largest dimension would be ≈ 7·10 14; this amounts to ∼ 10 Petabytes, or equivalently a few data centers worth of storage, thereby rendering the tensor analysis of this knowledge base, in the naive way, practically impossible. In this paper, we propose GIGATENSOR, a scalable distributed algorithm for large scale tensor decomposition. GIGATENSOR exploits the sparseness of the real world tensors, and avoids the intermediate data explosion problem by carefully redesigning the tensor decomposition algorithm. Extensive experiments show that our proposed GIGATENSOR solves 100 × bigger problems than existing methods. Furthermore, we employ GIGATENSOR in order to analyze a very large real world, knowledge base tensor and present our astounding findings which include discovery of potential synonyms among millions of nounphrases (e.g. the noun ‘pollutant ’ and the nounphrase ‘greenhouse gases’).
Nonnegative Matrix Factorization: A Comprehensive Review
 IEEE TRANS. KNOWLEDGE AND DATA ENG
, 2013
"... Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the partsbased representation as well as enhancing the interpretability of the issue corres ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the partsbased representation as well as enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions, and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF),
Regularized Latent Semantic Indexing
"... Topic modeling can boost the performance of information retrieval, but its realworld application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input voc ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
Topic modeling can boost the performance of information retrieval, but its realworld application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by ℓ1 and/or ℓ2 norm. This formulation allows the learning process to be decomposed into multiple suboptimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting ℓ1 norm on topics and ℓ2 norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.
MadLINQ: LargeScale Distributed Matrix Computation for the Cloud
"... The computation core of many dataintensive applications can be best expressed as matrix computations. The MadLINQ project addresses the following two important research problems: the need for a highly scalable, efficient and faulttolerant matrix computation system that is also easy to program, and ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
The computation core of many dataintensive applications can be best expressed as matrix computations. The MadLINQ project addresses the following two important research problems: the need for a highly scalable, efficient and faulttolerant matrix computation system that is also easy to program, and the seamless integration of such specialized execution engines in a general purpose dataparallel computing system. MadLINQ exposes a unified programming model to both matrix algorithm and application developers. Matrix algorithms are expressed as sequential programs operating on tiles (i.e., submatrices). For application developers, MadLINQ provides a distributed matrix computation library for.NET languages. Via the LINQ technology, MadLINQ also seamlessly integrates with DryadLINQ, a dataparallel computing system focusing on relational algebra. The system automatically handles the parallelization and distributed execution of programs on a large cluster. It outperforms current stateoftheart systems by employing two key techniques, both of which are enabled by the matrix abstraction: exploiting extra parallelism using finegrained pipelining and efficient ondemand failure recovery using a distributed faulttolerant execution engine. We describe the design and implementation of MadLINQ and evaluate system performance using several realworld applications.
Efficient Document Clustering via Online Nonnegative Matrix Factorizations
"... In recent years, Nonnegative Matrix Factorization (NMF) has received considerable interest from the data mining and information retrieval fields. NMF has been successfully applied in document clustering, image representation, and other domains. This study proposes an online NMF (ONMF) algorithm to e ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
In recent years, Nonnegative Matrix Factorization (NMF) has received considerable interest from the data mining and information retrieval fields. NMF has been successfully applied in document clustering, image representation, and other domains. This study proposes an online NMF (ONMF) algorithm to efficiently handle very largescale and/or streaming datasets. Unlike conventional NMF solutions which require the entire data matrix to reside in the memory, our ONMF algorithm proceeds with one data point or one chunk of data points at a time. Experiments with onepass and multipass ONMF on real datasets are presented. 1
Sparkler: Supporting largescale matrix factorization
 In EDBT
, 2013
"... Lowrank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Lowrank matrix factorization has recently been applied with great success on matrix completion problems for applications like recommendation systems, link predictions for social networks, and click prediction for web search. However, as this approach is applied to increasingly larger datasets, such as those encountered in webscale recommender systems like Netflix and Pandora, the data management aspects quickly become challenging and form a roadblock. In this paper, we introduce a system called Sparkler to solve such large instances of low rank matrix factorizations. Sparkler extends Spark, an existing platform for running parallel iterative algorithms on datasets that fit in the aggregate main memory of a cluster. Sparkler supports distributed stochastic gradient descent as an approach to solving the factorization problem – an iterative technique that has been shown to perform very well in practice. We identify the shortfalls of Spark in solving large matrix factorization problems, especially when running on the cloud, and solve this by introducing a novel abstraction called “Carousel Maps ” (CMs). CMs are well suited to storing large matrices in the aggregate memory of a cluster and can efficiently support the operations performed on them during distributed stochastic gradient descent. We describe the design, implementation, and the use of CMs in Sparkler programs. Through a variety of experiments, we demonstrate that Sparkler is faster than Spark by 4x to 21x, with bigger advantages for larger problems. Equally importantly, we show that this can be done without imposing any changes to the ease of programming. We argue that Sparkler provides a convenient and efficient extension to Spark for solving matrix factorization problems on very large datasets.
Distributed Largescale Natural Graph Factorization
 WWW 2013
, 2013
"... Natural graphs, such as social networks, email graphs, or instant messaging patterns, have become pervasive through the internet. These graphs are massive, often containing hundreds of millions of nodes and billions of edges. While some theoretical models have been proposed to study such graphs, the ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Natural graphs, such as social networks, email graphs, or instant messaging patterns, have become pervasive through the internet. These graphs are massive, often containing hundreds of millions of nodes and billions of edges. While some theoretical models have been proposed to study such graphs, their analysis is still difficult due to the scale and nature of the data. We propose a framework for largescale graph decomposition and inference. To resolve the scale, our framework is distributed so that the data are partitioned over a sharednothing set of machines. We propose a novel factorization technique that relies on partitioning a graph so as to minimize the number of neighboring vertices rather than edges across partitions. Our decomposition is based on a streaming algorithm. It is networkaware as it adapts to the network topology of the underlying computational hardware. We use local copies of the variables and an efficient asynchronous communication protocol to synchronize the replicated values in order to perform most of the computation without having to incur the cost of network communication. On a graph of 200 million vertices and 10 billion edges, derived from an email communication network, our algorithm retains convergence properties while allowing for almost linear scalability in the number of computers.