Results 1  10
of
27
Communicationoptimal parallel algorithm for Strassen’s matrix multiplication
 In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12
, 2012
"... Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix mul ..."
Abstract

Cited by 28 (17 self)
 Add to MetaCart
(Show Context)
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassenbased, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen’s algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA’11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communicationoptimal. It exhibits perfect strong scaling within the maximum possible range.
Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication
, 2012
"... ..."
Strong scaling of matrix multiplication algorithms and memoryindependent communication lower bounds
, 2012
"... A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributedmemory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multi ..."
Abstract

Cited by 7 (7 self)
 Add to MetaCart
(Show Context)
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributedmemory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen’s fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the interprocessor communication no longer scales. We obtain a memoryindependent communication cost lower bound on classical and Strassenbased distributedmemory matrix multiplication algorithms. These bounds imply that no classical or Strassenbased parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memoryindependent bounds and the strong scaling bounds generalize to other algorithms.
CommunicationAvoiding Parallel Strassen: Implementation and Performance
"... Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexit ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexity, as the classical ones perform better in practice. We recently obtained a new parallel algorithm that is based on Strassen’s fast matrix multiplication (SPAA ’12) that minimizes communication: it communicates asymptotically less than all classical and all previous Strassenbased algorithms, and it attains corresponding lower bounds. It is also the first parallelStrassen algorithm that exhibits perfect strong scaling. In this paper, we show that the new algorithm is also faster in practice. We benchmark and compare the performance of our new algorithm to previous algorithms on Franklin (Cray XT4), Hopper (Cray XE6), and Intrepid (IBM BG/P). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of the algorithm, and predict its performance on future exascale platforms. I.
Communication Lower Bounds for DistributedMemory Computations∗
"... In this paper we propose a new approach to the study of the communication requirements of distributed computations, which advocates for the removal of the restrictive assumptions under which earlier results were derived. We illustrate our approach by giving tight lower bounds on the communication co ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper we propose a new approach to the study of the communication requirements of distributed computations, which advocates for the removal of the restrictive assumptions under which earlier results were derived. We illustrate our approach by giving tight lower bounds on the communication complexity required to solve several computational problems in a distributedmemory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform. Our bounds rely only on a mild assumption on work distribution, and significantly strengthen previous results which require either the computation to be balanced among the processors, or specific initial distributions of the input data, or an upper bound on the size of processors ’ local memories.
ProgramCentric Cost Models for Locality and Parallelism
, 2013
"... Good locality is critical for the scalability of parallel computations. Many cost models that quantify locality and parallelism of a computation with respect to specific machine models have been proposed. A significant drawback of these machinecentric cost models is their lack of portability. Sinc ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Good locality is critical for the scalability of parallel computations. Many cost models that quantify locality and parallelism of a computation with respect to specific machine models have been proposed. A significant drawback of these machinecentric cost models is their lack of portability. Since the design and analysis of good algorithms in most machinecentric cost models is a nontrivial task, lack of portability can lead to a significant wastage of design effort. Therefore, a machineindependent portable cost model for locality and parallelism that is relevant to a broad class of machines can be a valuable guide for the design of portable and scalable algorithms as well as for understanding the complexity of problems. This thesis addresses the problem of portable analysis by presenting programcentric metrics for measuring the locality and parallelism of nestedparallel programs written for shared memory machines – metrics based solely on the program structure without reference to machine parameters such as processors, caches and connections. The metrics we present for this purpose are the parallel cache com
A Lower Bound Technique for Communication on BSP with Application to the FFT ⋆
"... Abstract. Communication complexity is defined, within the Bulk Synchronous Parallel (BSP) model of computation, as the sum of the degrees of all the supersteps. A lower bound to the communication complexity is derived for a given class of DAG computations in terms of the switching potential of a DAG ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Communication complexity is defined, within the Bulk Synchronous Parallel (BSP) model of computation, as the sum of the degrees of all the supersteps. A lower bound to the communication complexity is derived for a given class of DAG computations in terms of the switching potential of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields a novel and tight lower bound for the FFT graph. 1
Communication Bounds for Heterogeneous Architectures
"... As the gap between the cost of communication (i.e., data movement) and computation continues to grow, pursuing algorithms which minimize communication has become a critical research objective. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of alg ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, pursuing algorithms which minimize communication has become a critical research objective. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work [2] has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present algorithms which prove that the lower bounds are tight (i.e., attainable) in the cases of dense matrixvector and matrixmatrix multiplication.