Results 1  10
of
17
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract

Cited by 92 (4 self)
 Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed Memory Concurrent Computers
, 1993
"... 05, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, ..."
Abstract

Cited by 74 (10 self)
 Add to MetaCart
05, NASA Ames Research Center, Moffet Field, CA 94035 134. William C. Skamarock, 3973 Escuela Court, Boulder, CO 80301 135. Richard Smith, Los Alamos National Laboratory, Group T3, Mail Stop B2316, Los Alamos, NM 87545 136. Peter Smolarkiewicz, National Center for Atmospheric Research, MMM Group, P. O. Box 3000, Boulder, CO 80307 137. Jurgen Steppeler, DWD, Frankfurterstr 135, 6050 Offenbach, WEST GERMANY 138. Rick Stevens, Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 139. Paul N. Swarztrauber, National Center for Atmospheric Research, P. O. Box 3000, Boulder, CO 80307 140. Wei Pai Tang, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 141. Harold Trease, Los Alamos National Laboratory, Mail Stop B257, Los Alamos, NM 87545 142. Robert G. Voigt, ICASE, MS 132C, NASA Langley Research Center, Hampton, VA 23665 143. Mary F. Wheeler, Rice University, Department of Mathematical Sc
Public international benchmarks for parallel computers: Parkbench committee
 Report1,” Scientific Programming
, 1994
"... ..."
Parallel Bandreduction and Tridiagonalization
 Proceedings, Sixth SIAM Conference on Parallel Processing for Scientific Computing
, 1993
"... This paper presents a parallel implementation of a blocked band reduction algorithm for symmetric matrices suggested by Bischof and Sun. The reduction to tridiagonal or block tridiagonal form is a special case of this algorithm. A blocked double torus wrap mapping is used as the underlying data dist ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
This paper presents a parallel implementation of a blocked band reduction algorithm for symmetric matrices suggested by Bischof and Sun. The reduction to tridiagonal or block tridiagonal form is a special case of this algorithm. A blocked double torus wrap mapping is used as the underlying data distribution and the socalled WY representation is employed to represent block orthogonal transformations. Preliminary performance results on the Intel Delta indicate that the algorithm is wellsuited to a MIMD computing environment and that the use of a block approach significantly improves performance. 1 Introduction Reduction to tridiagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approach [13, p. 276] or block variants thereof [12] is the method of choice. These two approaches also underlie the parallel implementations described for example in [15] and [10]. The approach described in this ...
A Programming Model for BlockStructured Scientific Calculations on SMP Clusters
 Calculations on SMP Clusters. Ph. D. Dissertation, UCSD
, 1998
"... [None] ..."
(Show Context)
The PRISM Project: Infrastructure and Algorithms for Parallel Eigensolvers
, 1994
"... The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After briefly revie ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After briefly reviewing SYISDA, we discuss the algorithmic highlights of a distributedmemory implementation of this approach. These include a fast matrixmatrix multiplication algorithm, a new approach to parallel band reduction and tridiagonalization, and a harness for coordinating the divideandconquer parallelism in the problem. We also present performance results of these kernels as well as the overall SYISDA implementation on the Intel Touchstone Delta prototype. 1. Introduction Computation of eigenvalues and eigenvectors is an essential kernel in many applications, and several promising parallel algorithms have been investigated [29, 24, 3, 27, 21]. The work presented in this paper is part of the PRI...
Parallelizing Strassen's Method for Matrix Multiplication on DistributedMemory MIMD Architectures
 Computers for Mathematics with Applications
, 1994
"... We present a parallel method for matrix multiplication on distributedmemory MIMD architectures based on Strassen's method. Our timing tests, performed on an Intel Paragon, demonstrate that our method realizes the potential of the Strassen's method with a complexity of 4:7M 2:807 at the s ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
We present a parallel method for matrix multiplication on distributedmemory MIMD architectures based on Strassen's method. Our timing tests, performed on an Intel Paragon, demonstrate that our method realizes the potential of the Strassen's method with a complexity of 4:7M 2:807 at the system level rather than the node level at which several earlier works have been focused. The parallel efficiency is nearly perfect when the processor number is divisible by 7. The parallelized Strassen's method is always faster than the traditional matrix multiplication methods whose complexity is 2M 3 coupled with the BMR method and the Ring method at the system level. The speed gain depends on matrix order M : 20% for M ß 1000 and more than 100% for M ß 5000. Key words: matrix multiplication, parallel computation, Strassen's method AMS (MOS) Subject Classification: 65F30, 65Y05, 68Q25 Submitted to SIAM Journal on Scientific Computing y To whom correspondence should be sent. His email address...
A High Performance Parallel Strassen Implementation
 Parallel Processing Letters, Vol 6
, 1995
"... In this paper, we give what we believe to be the first high performance parallel implementation of Strassen's algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we give what we believe to be the first high performance parallel implementation of Strassen's algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 1020% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time. 1 Introduction In Strassen's algorithm, the total time complexity of the matrix multiplication is reduced by replacing it with smaller matrix multiplications together with a number of matrix additions, thereby reducing the operation count. A net reduction in execution time is attained only if the reduction in multiplications offsets the increase in additions. This requires the matrices to be relatively large before a net gain is observed. The advantage of using parallel architectures...
Analysis of a Class of Parallel Matrix Multiplication Algorithms
, 1998
"... Publications concerning parallel implementation of matrixmatrix multiplication continue to appear with some regularity. It may seem odd that an algorithm that can be expressed as one statement and three nested loops deserves this much attention. This paper provides some insights as to why this prob ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Publications concerning parallel implementation of matrixmatrix multiplication continue to appear with some regularity. It may seem odd that an algorithm that can be expressed as one statement and three nested loops deserves this much attention. This paper provides some insights as to why this problem is complex: Practical algorithms that use matrix multiplication tend to use different shaped matrices, and the shape of the matrices can significantly impact the performance of matrix multiplication. We provide theoretical analysis and experimental results to explain the differences in performance achieved when these algorithms are applied to differently shaped matrices. This analysis sets the stage for hybrid algorithms which choose between the algorithms based on the shapes of the matrices involved. While the paper resolves a number of issues, it concludes with discussion of a number of directions yet to be pursued. Corresponding Author, Phone: (512) 4719720, Fax: (512) 4718885. ...