Results 1  10
of
26
Communicationoptimal parallel algorithm for Strassen’s matrix multiplication
 In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12
, 2012
"... Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix mul ..."
Abstract

Cited by 28 (17 self)
 Add to MetaCart
(Show Context)
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassenbased, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen’s algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA’11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communicationoptimal. It exhibits perfect strong scaling within the maximum possible range.
CommunicationAvoiding Parallel Strassen: Implementation and Performance
"... Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexit ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexity, as the classical ones perform better in practice. We recently obtained a new parallel algorithm that is based on Strassen’s fast matrix multiplication (SPAA ’12) that minimizes communication: it communicates asymptotically less than all classical and all previous Strassenbased algorithms, and it attains corresponding lower bounds. It is also the first parallelStrassen algorithm that exhibits perfect strong scaling. In this paper, we show that the new algorithm is also faster in practice. We benchmark and compare the performance of our new algorithm to previous algorithms on Franklin (Cray XT4), Hopper (Cray XE6), and Intrepid (IBM BG/P). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of the algorithm, and predict its performance on future exascale platforms. I.
Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication
, 2012
"... ..."
Strong scaling of matrix multiplication algorithms and memoryindependent communication lower bounds
, 2012
"... A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributedmemory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multi ..."
Abstract

Cited by 7 (7 self)
 Add to MetaCart
(Show Context)
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributedmemory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen’s fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the interprocessor communication no longer scales. We obtain a memoryindependent communication cost lower bound on classical and Strassenbased distributedmemory matrix multiplication algorithms. These bounds imply that no classical or Strassenbased parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memoryindependent bounds and the strong scaling bounds generalize to other algorithms.
Communication Lower Bounds for DistributedMemory Computations∗
"... In this paper we propose a new approach to the study of the communication requirements of distributed computations, which advocates for the removal of the restrictive assumptions under which earlier results were derived. We illustrate our approach by giving tight lower bounds on the communication co ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper we propose a new approach to the study of the communication requirements of distributed computations, which advocates for the removal of the restrictive assumptions under which earlier results were derived. We illustrate our approach by giving tight lower bounds on the communication complexity required to solve several computational problems in a distributedmemory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform. Our bounds rely only on a mild assumption on work distribution, and significantly strengthen previous results which require either the computation to be balanced among the processors, or specific initial distributions of the input data, or an upper bound on the size of processors ’ local memories.
A Lower Bound Technique for Communication on BSP with Application to the FFT ⋆
"... Abstract. Communication complexity is defined, within the Bulk Synchronous Parallel (BSP) model of computation, as the sum of the degrees of all the supersteps. A lower bound to the communication complexity is derived for a given class of DAG computations in terms of the switching potential of a DAG ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Communication complexity is defined, within the Bulk Synchronous Parallel (BSP) model of computation, as the sum of the degrees of all the supersteps. A lower bound to the communication complexity is derived for a given class of DAG computations in terms of the switching potential of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields a novel and tight lower bound for the FFT graph. 1
Communication Bounds for Heterogeneous Architectures
"... As the gap between the cost of communication (i.e., data movement) and computation continues to grow, pursuing algorithms which minimize communication has become a critical research objective. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of alg ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, pursuing algorithms which minimize communication has become a critical research objective. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work [2] has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present algorithms which prove that the lower bounds are tight (i.e., attainable) in the cases of dense matrixvector and matrixmatrix multiplication.
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential
 ACM Trans. Archit. Code Optim
, 2013
"... Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The