Results 1  10
of
39
GEMMBased Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1998
"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."
Abstract

Cited by 108 (10 self)
 Add to MetaCart
(Show Context)
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and highperformance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMMbased level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMMbased level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMMbased model implementations.
Locality Of Reference In Lu Decomposition With Partial Pivoting
 SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS
, 1997
"... This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of refer ..."
Abstract

Cited by 90 (8 self)
 Add to MetaCart
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the rightlooking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the rightlooking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the rightlooking algorithm. Experimental comparisons between the new algorithm and the rightlooking algorithm show that an implementation of the new algorithm outperforms a similarly coded rightlooking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrixmultiplication algorithm.
Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software
 SIAM REVIEW VOL. 46, NO. 1, PP. 3–45
, 2004
"... Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today’s memorytiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today’s high performance computing systems.
A ThreeDimensional Approach to Parallel Matrix Multiplication
 IBM Journal of Research and Development
, 1995
"... A threedimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensionsM , N , and K. Each processor ..."
Abstract

Cited by 52 (0 self)
 Add to MetaCart
(Show Context)
A threedimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensionsM , N , and K. Each processor performs a single local matrix multiplication of size M=p 1 \Theta N=p 2 \Theta K=p 3 . Before the local computation can be carried out, each subcube must receive a single submatrix of A and B. After the single matrix multiplication has completed, K=p 3 submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor P 1=6 less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallel TM SP2 TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winog...
Implementation of Strassen's Algorithm for Matrix Multiplication
 In Proceedings of Supercomputing '96
, 1996
"... In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matri ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matrix sizes and shapes and the additional memory needed for temporary variables has been minimized. Replacing DGEMM with our routine should provide a significant performance gain for large matrices while providing the same performance for small matrices. We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer comparisons to other codes. Our performance data reconfirms that Strassen's algorithm is practical for realistic size matrices. The usefulness of our implementation is demonstrated by replacing DGEMM with our routine in a large application code.
Tuning Strassen's Matrix Multiplication for Memory Efficiency
 IN PROCEEDINGS OF SC98 (CDROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of thi ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memoryfriendly. First, the algorithm internally uses a nonstandard array layout known as Morton order that is based on a quadtree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...