Results 1  10
of
23
Scalable KMeans++
"... Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obta ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the kmeans++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing kmeans that have mostly focused on the postinitialization phases of kmeans. We prove that our proposed initialization algorithm kmeans obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld largescale data demonstrates that kmeans  outperforms kmeans++ in both sequential and parallel settings. 1.
Fast greedy algorithms in mapreduce and streaming
 In SPAA
, 2013
"... Greedy algorithms are practitioners ’ best friends—they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advant ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
Greedy algorithms are practitioners ’ best friends—they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. We then show how to use this primitive to adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to psystem constraints. Our method yields efficient algorithms that run in a logarithmic number of rounds, while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint, and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or psystem constraints. Finally, we empirically validate our algorithms, and show that they achieve the same quality of the solution as standard greedy algorithms but run in a substantially fewer number of rounds. Categories and Subject Descriptors
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail
, 2012
"... ar ..."
(Show Context)
Optimistic concurrency control for distributed unsupervised learning
 In NIPS
, 2013
"... Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts ar ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflictresolution protocol is invoked. We view this “optimistic concurrency control ” paradigm as particularly appropriate for largescale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via largescale experiments in a cluster computing environment. 1
Distributed Column Subset Selection on MapReduce
"... Abstract—Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to und ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a lowdimensional embedding of the data points or computing a lowrank approximation of the corresponding matrix. The paper first formulates the problem as the selection of a few representative columns from a matrix whose columns are massively distributed, and it then proposes a MapReduce algorithm for selecting those representatives. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the submatrix on that machine such that the reconstruction error of the concise representation is minimized. The paper then demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.
Lessons from the congested clique applied to MapReduce
 Proc 21st Int Colloq Structural Information and Communication Complexity
, 2014
"... ar ..."
(Show Context)
Data Science and Distributed Intelligence: Recent Developments and Future Insights
"... flooded our research papers and technical articles during the last two years. Also, due to the inherent distributed nature of computational infrastructures supporting Data Science (like Clouds and Grids), it is natural to view Distributed Intelligence as the most natural underlying paradigm for nove ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
flooded our research papers and technical articles during the last two years. Also, due to the inherent distributed nature of computational infrastructures supporting Data Science (like Clouds and Grids), it is natural to view Distributed Intelligence as the most natural underlying paradigm for novel Data Science challenges. Following this major trend, in this paper we provide a background of these new terms, followed by a discussion of recent developments in the data mining and data warehousing areas in the light of aforementioned keywords. Finally, we provide our insights of the next stages in research and developments in this area. 1
Parallel Algorithms for Geometric Graph Problems
, 2014
"... We give algorithms for geometric graph problems in the modern parallel models such as MapReduce [DG04, KSV10, GSZ11, BKS13]. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the twodimensional space, our algorithm computes a (1 + )approximate MST. Our algorithms wor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We give algorithms for geometric graph problems in the modern parallel models such as MapReduce [DG04, KSV10, GSZ11, BKS13]. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the twodimensional space, our algorithm computes a (1 + )approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem [BKS13], despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to EarthMover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in nearlinear time, n1+o(1). We note that while recently [SA12b] have developed a nearlinear time algorithm for (1 + )approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in [SA12b]. Furthermore, our algorithm immediately gives a (1 + )approximation algorithm with nδ space in the streamingwithsorting model with 1/δO(1) passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a wellknown open problem [P07, P49]. ∗Supported in part by the Simons Postdoctoral Fellowship. Research initiated while at CMU. 1
Parallel graph decomposition and diameter approximation in o(diameter) time and linear space,” arXiv 1407.3144
"... ar ..."
(Show Context)
Embed and Conquer: Scalable Embeddings for Kernel kMeans on MapReduce
"... The kernel kmeans is an effective method for data clustering which extends the commonlyused kmeans algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, i ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The kernel kmeans is an effective method for data clustering which extends the commonlyused kmeans algorithm to work on a similarity matrix over complex data structures. It is, however, computationally very complex as it requires the complete kernel matrix to be calculated and stored. Further, its kernelized nature hinders the parallelization of its computations on modern scalable infrastructures for distributed computing. In this paper, we are defining a family of kernelbased lowdimensional embeddings that allows for scaling kernel kmeans on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two practical methods for lowdimensional embedding that adhere to our definition of the embeddings family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel kmeans. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark datasets. 1