Results 1 - 10
of
44
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers
, 1993
"... 0-5, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T-3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, ..."
Abstract
-
Cited by 57 (11 self)
- Add to MetaCart
0-5, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T-3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, P. O. Box 3000, Boulder, CO 80307 137. Jurgen Steppeler, DWD, Frankfurterstr 135, 6050 Offenbach, WEST GERMANY 138. Rick Stevens, Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 139. Paul N. Swarztrauber, National Center for Atmospheric Research, P. O. Box 3000, Boulder, CO 80307 140. Wei Pai Tang, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 141. Harold Trease, Los Alamos National Laboratory, Mail Stop B257, Los Alamos, NM 87545 142. Robert G. Voigt, ICASE, MS 132-C, NASA Langley Research Center, Hampton, VA 23665 143. Mary F. Wheeler, Rice University, Department of Mathematical Sc
Public International Benchmarks for Parallel Computers
, 1994
"... this report: David Bailey (NASA Ames Research Center) , Michael Berry (University of Tennessee), Jack Dongarra (University of Tennessee/Oak Ridge National Laboratory), Vladimir Getov (University of Southampton), Tom Haupt (Syracuse University), Tony Hey (University of Southampton), Roger Hockney (Un ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
this report: David Bailey (NASA Ames Research Center) , Michael Berry (University of Tennessee), Jack Dongarra (University of Tennessee/Oak Ridge National Laboratory), Vladimir Getov (University of Southampton), Tom Haupt (Syracuse University), Tony Hey (University of Southampton), Roger Hockney (University of Southampton), and David Walker (Oak Ridge National Laboratory). The following PARKBENCH participants were instrumental in defining/promoting the effort, attending meetings, and providing helpful comments and suggestions: Ed Brocklehurst (National Physical Laboratory), Koushik Ghosh (Cray Research), Charles Grassl (Cray Research) , Ed Kushner (Intel SSD), Brian LaRose (Hewlett Packard), Todd Letsche (University of Tennessee), David Mackay (Intel SSD), Joanne Martin (IBM), Ramesh Natarajan (IBM, Yorktown Heights), Bodo Parady (Sun Microsystems), Robert Pennington (Pittsburgh Supercomputing Center), Philip Tannenbaum (NEC), Pearl Wang (George Mason University/US Geological Survey), and Patrick Worley (Oak Ridge National Laboratory). Special thanks are also due to Jack Dongarra in his role of host at our meetings in Knoxville, and to Mike Berry who has served valiantly as secretary at our meetings and produced excellent minutes in difficult circumstances. This publication, and the earlier report could not have been produced without the dedication of Roger Hockney, Mike Berry and Vladimir Getov who devoted many hours in turning a collection of individual contributions into a coherent L a T E X document that was fit for publication.
Communication Lower Bounds for Distributed-Memory Matrix Multiplication
, 2004
"... this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ned with the condition 0 c 1 g(n) f(n). The set o(g(n)) consists of functions f(n) such that for any c 2 > 0 there exists a constant n0 > 0 such that 0 f(n) c 2 g(n) for all n n0
Algorithmic redistribution methods for block cyclic decompositions
- IEEE Trans. on PDS
, 1996
"... ii To my parents iii Acknowledgments The writer expresses gratitude and appreciation to the members of his disser-tation committee, Michael Berry, Charles Collins, Jack Dongarra, Mark Jones and David Walker for their encouragement and participation throughout my doctoral experience. Special apprecia ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
ii To my parents iii Acknowledgments The writer expresses gratitude and appreciation to the members of his disser-tation committee, Michael Berry, Charles Collins, Jack Dongarra, Mark Jones and David Walker for their encouragement and participation throughout my doctoral experience. Special appreciation is due to Professor Jack Dongarra, Chairman, who pro-vided sound guidance, support and appropriate commentaries during the course of my graduate study. I also would like to thank Yves Robert and R. Clint Whaley for many useful and instructive discussions on general parallel algorithms and message passing software libraries. Many valuable comments for improving the presentation of this document were received from L. Susan Blackford. Finally, I am grateful to the Department of Computer Science at the University ofTennessee for allowing me to do this doctoral research work here. A special debt of gratitude is owed to Joanne Martin, IBM POWERparallel Division, for awarding me an IBM Corporation Fellowship covering the tuition as well as a stipend for the 1994-96 academic years. This work was also supported
EcliPSe: A System for High Performance Concurrent Simulation
, 1991
"... this paper describes our approach from the system point of view. The programming interface is described in detail in the next section, following which the design and salient implementation aspects are discussed. Representative results from a few simulation systems are then reported, and the conclud ..."
Abstract
-
Cited by 20 (10 self)
- Add to MetaCart
this paper describes our approach from the system point of view. The programming interface is described in detail in the next section, following which the design and salient implementation aspects are discussed. Representative results from a few simulation systems are then reported, and the concluding section discusses some of the critical issues in such an approach, the implications for applications other than stochastic simulation, and ongoing and future work
Scalability of Parallel Algorithms for Matrix Multiplication
- in Proc. of Int. Conf. on Parallel Processing
, 1991
"... A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others. We present a parallel formulation for hypercube and related architectures that performs better than any of the schemes described in the literature so far for a wide range of matrix sizes and number of processors. The superior performance and the analytical scalability expressions for this algorithm are verified through experiments on the Thinking Machines Corporation's CM-5 TM y parallel computer for up to 512 processors. We show that special har...
Comparison of Scalable Parallel Matrix Multiplication Libraries
- in Proceedings of the Scalable Parallel Libraries Conference, Starksville, MS
, 1993
"... This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2-D torus wrap. The algorithmic differences resulting from these different layouts are discussed as we ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
This paper compares two general library routines for performing parallel distributed matrix multiplication. The PUMMA algorithm utilizes block scattered data layout, whereas BiMMeR utilizes virtual 2-D torus wrap. The algorithmic differences resulting from these different layouts are discussed as well as the general issues associated with different data layouts for library routines. Results on the Intel Delta for the two matrix multiplication algorithms are presented. 1. Introduction Matrix multiplication is a standard algorithm that is an important computational kernel in many applications including eigensolvers [3] and LU factorization [15]. Utilizing matrix multiplication is one of the principal ways of achieving high efficiency block algorithms in packages such as LAPACK [2]. The BLAS 3 routines were added to achieve this block performance on computers, and optimized versions are available on most serial machines [10]. For matrix multiplication, the BLAS 3 routine XGEMM is availa...
Matrix Multiplication On The Intel Touchstone Delta
, 1993
"... . Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We obtain ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
. Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We obtain an implementation that uses communications primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, able to deal with arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86% with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 \Theta 8800 matrix. We describe our algorithm design and implementation, and present performance results that demonstrate scalability and robust behavior over varying mesh topologies. 1. Introduction Multiplication of two matrices is one of the most basic operations of scientific computing. Versions for serial computers h...
Matrix Multiplication on Hypercubes Using Full Bandwidth and Constant Storage
- in Proceeding of the Sixth Distributed Memory Computing Conference
, 1991
"... For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about P 2 = p N elements of each input operand, with operands of size P \Theta P distributed evenly over N processors. With concurrent communication on all ports, the numb ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
For matrix multiplication on hypercube multiprocessors with the product matrix accumulated in place a processor must receive about P 2 = p N elements of each input operand, with operands of size P \Theta P distributed evenly over N processors. With concurrent communication on all ports, the number of element transfers in sequence can be reduced to P 2 = p N log N for each input operand. We present a two-level partitioning of the matrices and an algorithm for the matrix multiplication with optimal data motion and constant storage. The algorithm has sequential arithmetic complexity 2P 3 , and parallel arithmetic complexity 2P 3 =N . The algorithm has been implemented on the Connection Machine model CM-2. For the performance on the 8K CM-2, we measured about 1.6 Gflops, which would scale up to about 13 Gflops for a 64K full machine. 1 Introduction The multiplication of matrices is an important operation in many computationally intensive scientific applications. Effective use...

