Results 1  10
of
20
Communicationoptimal parallel algorithm for Strassen’s matrix multiplication
 In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12
, 2012
"... Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix mul ..."
Abstract

Cited by 28 (17 self)
 Add to MetaCart
(Show Context)
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassenbased, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen’s algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA’11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communicationoptimal. It exhibits perfect strong scaling within the maximum possible range.
SRUMMA: a matrix multiplication algorithm suitable for clusters and scalable shared memory systems
 in proceedings of Parallel and Distributed Processing Symposium
, 2004
"... This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon’s algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix mu ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon’s algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, LinuxMyrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over pdgemm from the ScaLAPACK/PBBLAS suite, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than pdgemm for a matrix size of 1000 on 128 processors. The impact of zerocopy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated. 1.
CommunicationAvoiding Parallel Strassen: Implementation and Performance
"... Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexit ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Matrix multiplication is a fundamental kernel of many high performance and scientific computing applications. Most parallel implementations use classical O(n 3) matrix multiplication, even though there exist Strassenlike matrix multiplication algorithms that have lower arithmetic complexity, as the classical ones perform better in practice. We recently obtained a new parallel algorithm that is based on Strassen’s fast matrix multiplication (SPAA ’12) that minimizes communication: it communicates asymptotically less than all classical and all previous Strassenbased algorithms, and it attains corresponding lower bounds. It is also the first parallelStrassen algorithm that exhibits perfect strong scaling. In this paper, we show that the new algorithm is also faster in practice. We benchmark and compare the performance of our new algorithm to previous algorithms on Franklin (Cray XT4), Hopper (Cray XE6), and Intrepid (IBM BG/P). We demonstrate significant speedups over previous algorithms both for large matrices and for small matrices on large numbers of processors. We model and analyze the performance of the algorithm, and predict its performance on future exascale platforms. I.
A High Performance Parallel Strassen Implementation
 Parallel Processing Letters, Vol 6
, 1995
"... In this paper, we give what we believe to be the first high performance parallel implementation of Strassen's algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we give what we believe to be the first high performance parallel implementation of Strassen's algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 1020% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time. 1 Introduction In Strassen's algorithm, the total time complexity of the matrix multiplication is reduced by replacing it with smaller matrix multiplications together with a number of matrix additions, thereby reducing the operation count. A net reduction in execution time is attained only if the reduction in multiplications offsets the increase in additions. This requires the matrices to be relatively large before a net gain is observed. The advantage of using parallel architectures...
Hierarchical MatrixMatrix Multiplication based on Multiprocessor Tasks
 In Proc. of the International Conference on Computational Science – ICCS 2004, LNCS
, 2004
"... Abstract. We consider the realization of matrixmatrix multiplication and propose a hierarchical algorithm implemented in a taskparallel way using multiprocessor tasks on distributed memory. The algorithm has been designed to minimize the communication overhead while showing large locality of memor ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the realization of matrixmatrix multiplication and propose a hierarchical algorithm implemented in a taskparallel way using multiprocessor tasks on distributed memory. The algorithm has been designed to minimize the communication overhead while showing large locality of memory references. The taskparallel realization makes the algorithm especially suited for cluster of SMPs since tasks can then be mapped to the different cluster nodes in order to efficiently exploit the cluster architecture. Experiments on current cluster machines show that the resulting execution times are competitive with stateoftheart methods like PDGEMM. 1
Memory Efficient Parallel Matrix Multiplication Operation for Irregular Problems
"... Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting dat ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memoryefficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.
Recursion Removal in Fast Matrix Multiplication
, 2003
"... Recursion's removal improves the efficiency of recursive algorithms, especially algorithms with large formal parameters, such as fast matrix multiplication algorithms. In this article, a general method of breaking recursions in fast matrix multiplication algorithms is introduced, which is gener ..."
Abstract
 Add to MetaCart
Recursion's removal improves the efficiency of recursive algorithms, especially algorithms with large formal parameters, such as fast matrix multiplication algorithms. In this article, a general method of breaking recursions in fast matrix multiplication algorithms is introduced, which is generalized from recursions removal of a specific fast matrix multiplication algorithm of Winograd.
Ami Paz Technion
"... In this work, we use algebraic methods for studying distance computation and subgraph detection tasks in the congested clique model. Specifically, we adapt parallel matrix multiplication implementations to the congested clique, obtaining an O(n1−2/ω) round matrix multiplication algorithm, where ω & ..."
Abstract
 Add to MetaCart
(Show Context)
In this work, we use algebraic methods for studying distance computation and subgraph detection tasks in the congested clique model. Specifically, we adapt parallel matrix multiplication implementations to the congested clique, obtaining an O(n1−2/ω) round matrix multiplication algorithm, where ω < 2.3728639 is the exponent of matrix multiplication. In conjunction with known techniques from centralised algorithmics, this gives significant improvements over previous best upper bounds in the congested clique model. The highlight results include: – triangle and 4cycle counting in O(n0.158) rounds, improving upon the O(n1/3) triangle counting algorithm of Dolev et al. [DISC 2012], – a (1 + o(1))approximation of allpairs shortest paths in O(n0.158) rounds, improving upon the Õ(n1/2)round (2+o(1))approximation algorithm of Nanongkai [STOC 2014], and – computing the girth in O(n0.158) rounds, which is the first nontrivial solution in this model. In addition, we present a novel constantround combinatorial algorithm for detecting 4cycles.
Optimal Solution to Matrix Parenthesization Problem Employing Parallel Processing Approach
"... Abstract: Optimal matrix parenthesization problem is an optimization problem that can be solved using dynamic programming. The paper discussed the problem in detail. The results and their analysis reveal that there is considerable amount of time reduction compared with simple left to right multipli ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract: Optimal matrix parenthesization problem is an optimization problem that can be solved using dynamic programming. The paper discussed the problem in detail. The results and their analysis reveal that there is considerable amount of time reduction compared with simple left to right multiplication, on applying the matrix parenthesization algorithm. Time reduction varies from 0 % to 96%, proportional to the number of matrices and the sequence of dimensions. It is also learnt that on applying parallel matrix parenthesization algorithm, time is reduced proportional to the number of processors at the start, however, after some increase, adding more processors does not yield any more throughput but only increases the overhead and cost. Major advantage of the parallel algorithm used is that it does not depend on the number of matrices. Moreover, work has been evenly distributed between the processors. KeyWords: Matrix Parenthesization Problem Parallel Processing Algorithm 1
Generalizing of a High Performance Parallel Strassen Implementation on Distributed Memory MIMD Architectures
"... Abstract: Strassen’s algorithm to multiply two n×n matrices reduces the asymptotic operation count from O(n 3) of the traditional algorithm to O(n 2.81), thus designing efficient parallelizing for this algorithm becomes essential. In this paper, we present our generalizing of a parallel Strassen imp ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract: Strassen’s algorithm to multiply two n×n matrices reduces the asymptotic operation count from O(n 3) of the traditional algorithm to O(n 2.81), thus designing efficient parallelizing for this algorithm becomes essential. In this paper, we present our generalizing of a parallel Strassen implementation which obtained a very nice performance on an Intel Paragon: faster 20 % for n ≈ 1000 and more than 100% for n ≈ 5000 in comparison to the parallel traditional algorithms (as Fox, Cannon). Our method can be applied to all the matrix multiplication algorithms on distributed memory computers that use Strassen’s algorithm at the system level, hence it gives us compatibility to find better parallel implementations of Strassen’s algorithm. 1