Results 1  10
of
16
The Combinatorial BLAS: Design, Implementation, and Applications
, 2010
"... This paper presents a scalable highperformance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of highperformance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse mat ..."
Abstract

Cited by 58 (10 self)
 Add to MetaCart
(Show Context)
This paper presents a scalable highperformance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of highperformance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the Parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extendible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and easeofuse. The scalability and raw performance of the example applications, using the combinatorial BLAS, are unprecedented on distributed memory clusters.
Communication Optimal Parallel Multiplication of Sparse Random Matrices
, 2013
"... Parallel algorithms for sparse matrixmatrix multiplication typically spend most of their time on interprocessor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize c ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Parallel algorithms for sparse matrixmatrix multiplication typically spend most of their time on interprocessor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to ErdősRényi random graphs on distributedmemory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.
Experimental evaluation of multiround matrix multiplication on mapreduce
, 2014
"... Abstract—This paper proposes an Hadoop library, named M3, for performing dense and sparse matrix multiplication in MapReduce. The library features multiround MapReduce algorithms that allow to tradeoff round number with the amount of data shuffled in each round and the amount of memory required by ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper proposes an Hadoop library, named M3, for performing dense and sparse matrix multiplication in MapReduce. The library features multiround MapReduce algorithms that allow to tradeoff round number with the amount of data shuffled in each round and the amount of memory required by reduce functions. We claim that multiround MapReduce algorithms are preferable in cloud settings to traditional monolithic algorithms, that is, algorithms requiring just one or two rounds. We perform an extensive experimental evaluation of the M3 library on an inhouse cluster and on a cloud provider, aiming at assessing the performance of the library and at comparing the multiround and monolithic approaches. Keywords—MapReduce, Hadoop, multiround algorithms, matrix multiplication, experiments, cloud I.
An Efficient GPU General Sparse MatrixMatrix Multiplication for Irregular Data
"... Abstract—General sparse matrixmatrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle ext ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—General sparse matrixmatrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle extra irregularity from three aspects: (1) the number of the nonzero entries in the result sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the result sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory preallocation for the result matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited onchip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the stateoftheart GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our approach delivers excellent absolute performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures. Keywordssparse matrices; matrix multiplication; linear algebra; GPU; merging; parallel algorithms; I.
Approximate Subdivision Surface Evaluation in the Language of Linear Algebra
"... We present an interpretation of approximate subdivision surface evaluation in the language of linear algebra. Specifically, vertices in the refined mesh can be computed by leftmultiplying the vector of control vertices by a sparse matrix we call the subdivision operator. This interpretation is rath ..."
Abstract
 Add to MetaCart
We present an interpretation of approximate subdivision surface evaluation in the language of linear algebra. Specifically, vertices in the refined mesh can be computed by leftmultiplying the vector of control vertices by a sparse matrix we call the subdivision operator. This interpretation is rather general: it applies to any level of subdivision, it holds for many common subdivision schemes (including CatmullClark and Loop), it can be extended to support hierarchical edit operations, and it subsumes sharpness and featureadaptive schemes. Furthermore, our interpretation encourages highperformance implementations built on numerical linear algebra libraries. It is most applicable to subdivision of static control meshes undergoing deformation, i.e. animation, in which case it allows users to tradeoff timetofirstframe and framerate. We implemented our strategy as an extension to Pixar’s production subdivision code and observed speedups of 2x to 14x using both multicore CPUs and GPUs.
Fluid Pinchoff
"... This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day ..."
Abstract
 Add to MetaCart
This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day
Bordered Heegaard Floer . . .
, 2008
"... We construct Heegaard Floer theory for 3manifolds with connected boundary. The theory associates to an oriented twomanifold a differential graded algebra. For a threemanifold with specified boundary, the invariant comes in two different versions, one of which (type D) is a module over the algebra ..."
Abstract
 Add to MetaCart
(Show Context)
We construct Heegaard Floer theory for 3manifolds with connected boundary. The theory associates to an oriented twomanifold a differential graded algebra. For a threemanifold with specified boundary, the invariant comes in two different versions, one of which (type D) is a module over the algebra and the other of which (type A) is an A ∞ module. Both are welldefined up to chain homotopy equivalence. For a decomposition of a 3manifold into two pieces, the A∞ tensor product of the type D module of one piece and the type A module from the other piece is ̂ HF of the glued manifold. As a special case of the construction, we specialize to the case of threemanifolds with torus boundary. This case can be used to give another proof of the surgery exact triangle for ̂ HF. We relate the bordered Floer homology of a threemanifold with torus boundary with the knot Floer homology of a filling.