| B. Kagstrom, P. Ling, and C. F. Van Loan. High Performance GEMM-Based Level-3 BLAS: Sample Routines for Double Precision Real Data. In High Performance Computing II, Durand M. and El Dabaghi F., eds., pages 269--281, Amsterdam, 1991. North-Holland. C AND K. VESELI C |
....computations, including matrix multiplications. They showed that blocked algorithms transferred fewer words between fast and slow memory than algorithms that operated by row or by column. High quality implementations of I O e#cient matrix multiplication algorithms are widely available and used [2, 1, 5, 7, 11, 14, 17, 15, 16, 18, 19, 26] The proof of the next theorem is very similar to the proof of Lemma 3.1. Theorem 7.1. Consider the conventional multiplication of two n by n matrices on a computer with a large slow memory and a fast cache that can contain M words. Arithmetic operations can only be performed on words that are in ....
B. Kagstrom, P. Ling, and C. Van Loan. High performance GEMM-based level-3 BLAS: Sample routines for double precision real data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269--281, Amsterdam, 1991. North-Holland.
.... computer architectures, implementations in low level languages have been made to ensure a desirable efficiency, though improved compiler technologies recently have allowed BLAS for some architectures to be coded in high level languages, suitably structured, without significant loss of efficiency [3, 7, 10]. We discuss the issues involved in designing local BLAS for distributed memory architectures, programmed in languages with an array syntax. We report on the 1 Also affiliated with the Division of Applied Sciences, Harvard University techniques used and the performance achieved in a subset of ....
....achieve close to peak performance even on single routines written in a completely architecturally independent way. However, by a suitable partitioning, unrolling, and possibly skewing of loops by the programmer, state of the art compilers for some architectures produce very efficient code [3, 7, 10]. But, on many architectures, assembly level programming is still required to achieve the desired level of efficiency in using registers, caches, memory and pipelines. The CM 200 belong to this category of computer systems. In architectures with a single data path to memory, such as the CM 200, ....
Bo Kagstrom, P. Ling, and Charles Van Loan. High performance GEMM--based level--3 BLAS: Sample routines for double precision real data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269 -- 281. North Holland, 1991.
....and high performance model implementations of the GEMM based level 3 BLAS in Fortran 77 and the GEMM based level 3 BLAS benchmark, which is a tool for performance evaluation of different level 3 BLAS implementations. Some early results from the model implementations have been published in [8, 9]. In this contribution (talk) we will discuss design principles for the model implementations and present new performance results for different architectures (vector as well as RISC based) including single processor results for IBM SP2, Intel Paragon, Parsytec GC PowerPlus and Silicon Graphics. ....
....for both general and structured matrix multiplication. Multiple right hand side triangular system solving is also handled by the package as it is rich in matrix multiplication if properly organized. We have shown that one can live with just one highly optimized Level 3 BLAS routine: GEMM [13, 8]. This subprogram oversees a general matrix multiply of the form C ffop(A)op(B) fiC, where op(X) denotes X or X T . The structured matrix multiplication problems handled by the other Level 3 BLAS can be couched in terms of GEMM and a negligible amount of Level 1 and 2 computations, with ....
[Article contains additional citation context not shown here]
B. Kagstrom, P. Ling, and C. Van Loan. High Performance GEMM--Based Level--3 BLAS: Sample Routines for Double Precision Real Data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269--281, Amsterdam, 1991. North--Holland.
....distributed general GEMM operation. A novelty of ddgemm is that it only uses a (square) 2D submesh large enough to hold the different subarrays. This makes it possible to have several distributed GEMM operations going on in the complete 2D mesh, which, e.g. is of interest in a GEMM based approach [11, 12]. As in the single node dgemm, C is the only subarray changed, and since copies of subarrays A and B are used in data transfers between nodes it is possible to use one of the subarrays A and B as C , e.g. A = A B Delta A. The distributed GEMM algorithm consists of the following four major ....
B. Kagstrom, P. Ling, and C. Van Loan. High Performance GEMM--Based Level 3 BLAS: Sample Routines for Double Precision Real Data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269--281, Amsterdam, 1991. North--Holland.
.... n, k, alpha, A, lda, B, ldb, beta, C, ldc ) TRMM ( side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc ) TRSM ( side, uplo, trans, diag, m, n, alpha, A, lda, C, ldc ) 3 GEMM Based Level 3 BLAS Concept We have shown that one can live with just one highly optimized level 3 BLAS routine: GEMM [21, 17]. This subprogram oversees a general matrix multiply of the form C ffop(A)op(B) fiC; where op(X) denotes X or X T : The structured matrix multiplication problems handled by the other level 3 BLAS can be couched in terms of GEMM and a small amount of level 1 and 2 computations, with ....
....in 1990 [9, 10] Some vendors provide highly optimized BLAS for their machines, see for example [2, 1, 16, 4, 24] while others provide optimized versions of some or none of the routines. Vendor independent groups have also developed tuned level 3 kernels for different machines, for example [23, 17, 13, 6, 14], where some are based on the GEMM based concept [17, 6, 14] Today different implementations with different performance characteristics coexist and it is becoming more important to evaluate different implementations thoroughly. The GEMM based benchmark measures the performance of an arbitrary set ....
[Article contains additional citation context not shown here]
B. Kagstrom, P. Ling, and C. Van Loan. High Performance GEMM-Based Level 3 BLAS: Sample Routines for Double Precision Real Data. In M. Durand and F. El Dabaghi, editors, High Performance Computing II, pages 269--281, Amsterdam, 1991. North--Holland.
No context found.
B. Kagstrom, P. Ling, and C. F. Van Loan. High Performance GEMM-Based Level-3 BLAS: Sample Routines for Double Precision Real Data. In High Performance Computing II, Durand M. and El Dabaghi F., eds., pages 269--281, Amsterdam, 1991. North-Holland. C AND K. VESELI C
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC