34 citations found. Retrieving documents...
Dongarra, J., Gustavson, F., Karp, A.: Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review 26 (1984) 91--112

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Learning from the Success of MPI - Gropp (2001)   (4 citations)  (Correct)

....not achieve perfect performance portability, defined as providing a single source that runs at (near) acheivable peak performance on all platforms. This lack is sometimes given as a criticism of MPI, but it is a criticism that most other programming models also share. For example, Dongarra et al. [9] describe six different ways to implement matrix matrix multiply in Fortran for a single processor; not only is no one of the six optimal for all platforms but none of the six are optimal on modern cache based systems. Another example is the very existence of vendor optimized implementations of ....

Dongarra, J., Gustavson, F., Karp, A.: Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review 26 (1984) 91--112


Learning from the Success of MPI - Gropp (2001)   (4 citations)  (Correct)

....not achieve perfect performance portability, defined as providing a single source that runs at (near) acheivable peak performance on all platforms. This lack is sometimes given as a criticism of MPI, but it is a criticism that most other programming models also share. For example, Dongarra et al. [9] describe six di#erent ways to implement matrix matrix multiply in Fortran for a single processor; not only is no one of the six optimal for all platforms but none of the six are optimal on modern cache based systems. Another example is the very existence of vendor optimized implementations of the ....

Dongarra, J., Gustavson, F., Karp, A.: Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review 26 (1984) 91--112


Recent Developments in Dense Numerical Linear Algebra - Higham (2000)   (Correct)

....the worst case. Vectorized and partitioned algorithms for the standard matrix factorizations are now well understood, and the algorithms in LAPACK represent the state of the art. Three representative references for LU factorization spanning the last 12 years include Dongarra, Gustavson and Karp [47], Ortega [105] and Dongarra, Duff, Sorensen and van der Vorst [52] For variants of LU factorization, developing or choosing a partitioned algorithm may not be trivial. For example, developing a partitioned version of the block LDL factorization is somewhat complicated for the Bunch Kaufman ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.


A Systematic Approach to the Design and Analysis of Linear.. - Gunnels   (Correct)

....loop invariants which, we will see, lead to ve di erent variants for LU factorization. Note that in this paper we will not concern ourselves with the question of whether the above conditions exhaust all possibilities. However, they do give rise to many commonly discussed algorithms. In fact, in [23] six variants, called the ijk orders, of A = LU are listed. The jki form is commonly known as a left looking algorithm while the ikj method is leftlooking on A . Together, they correspond to the row and column lazy variants discussed in this paper. The kij and kji forms both correspond to what ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91-112, Jan. 1984.


A Partitioned Skyline LDL^T Factorization - Marques (1993)   (Correct)

....is a direct sum of 1 1 and 2 2 pivot blocks and L is a lower unit triangular matrix. Basically, L and U are evaluated by three nested loops, whose arrangement (column or row oriented) can strongly in uence the computational performance of the process. Robert and Sguazzero [24] and Dongarra et al. [10], for instance, studied di erent ways of doing that nesting for dense matrices. In any case, the factorization phase is more time consuming than a later evaluation of x. All the same, a partitioning or blocking is usually applied to matrix calculations, so as to pro t from the architecture of ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine. SIAM Review, 12:91-112, 1984.


A Parallel Algorithm for the Sylvester-Observer Equation - Bischof, Datta, al. (1994)   (Correct)

....LQ factorization The partial drawback of this algorithm is that it employs vector vector operations, which in general do not perform well on high performance processors since they require many memory accesses per floating point operation. For a discussion of this issue, see, for example, [3,9,11]. If we allow more workspace per processor, we can partially overcome this drawback and arrive at a variant that computes the forward solve Lz = c with matrix vector operations. In particular, if we allow for b columns of workspace for L, we do not have to update the right hand side y until b ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984.


Software for the Scalable Solution of PDEs - Balay, Gropp, McInnes, Smith   (Correct)

....should use standards such as MPI, OpenMP, or HPF to maintain portability. Even with such standards, the much more difficult goal of performance portability (portability without sacrificing performance) can be challenging to achieve, particularly over a wide range of computer architectures [DGK84] However, the benefits of portability are enormous. Computer performance continues to increase by leaps and bounds; portable applications can quickly take advantage of the fastest computers, independent of any particular vendor. Algorithms. Algorithmic improvements have been at least as ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, January 1984. BIBLIOGRAPHY 30


Stability of Methods for Matrix Inversion - Croz, Higham (1992)   (7 citations)  (Correct)

....L Gamma1 a column at a time. Analogous i and k methods exist, which compute L Gamma1 row wise or use outer products, respectively, and we comment on them at the end of the section. The names i , j and k refer to the outermost loop index, according to the convention introduced by [7], and used in [9] to describe the different possible orderings of the loops. The first method computes each column of X = L Gamma1 independently, using conventional forward substitution. We write it as follows, to facilitate comparison with the second method. We use MATLAB style indexing ....

J.J. Dongarra, F.G. Gustavson and A. Karp, Implementing linear algebra algorithms for dense matrices on a vector pipeline machine, SIAM Review, 26 (1984), pp. 91-- 112.


Numerical Linear Algebra and Computer Architecture: An Evolving.. - Hedayat (1993)   (2 citations)  (Correct)

....survey in [60] and the monograph [46] Block algorithms derive their efficiency by reuse of data, through reordering of computation. Similar to the matrix multiplication example on the Cyber 205, the effect of loop interchange on the access pattern of LU decomposition was studied in detail in [44] and [113] The variants came to be known as right looking, left looking, and Crout versions of the algorithm. Typical performance figures and comparisons between the variants are given in [46] and [31] for high performance shared memory computers. An important architecture dependent parameter is ....

.... vector with a multiple of another (SAXPY) With the appearance of the CRAY 1, with only one path between memory and the vector registers, it soon became evident that data reuse in registers was essential to overcome this bottleneck and achieve what Dongarra refers to as super vector performance [44]. Level 2 BLAS, rich in matrix vector operation, were proposed to improve register reuse in such vector machines [42] Experiences with parallel architectures with a hierarchy of memory indicated that, in order to maximize data reuse, it was necessary to introduce primitive kernels that have a ....

[Article contains additional citation context not shown here]

Dongarra J., Gustavson F., Karp A., Implementing linear algebra algorithms for dense matrices on a vector pipeline machine, SIAM Review,26,1984.


MOB Forms: A Class of Multilevel Block Algorithms for Dense.. - Juan Navarro Toni (1994)   (11 citations)  (Correct)

....paper we deal with the matrix multiplication operation C = C A B, where the size of C is i max Theta jmax , that of A is i max Theta kmax , and that of B is kmax Theta jmax . We assume that the matrices are large in all dimensions and stored by columns. Consider, for example, the ijk form [DoGK84]: DO 10 I= 1, Imax DO 10 J= 1, Jmax DO 10 K= 1, Kmax 10 C(I,J) C(I,J) A(I,K) B(K,J) 1 A preliminary presentation of MOB forms has been recently published in [NaJV93] Figure 1 shows what we call a Data and Computation Diagram (DCD) for this form. In this diagram, the rectangular ....

J. Dongarra, F. Gustavson and A. Karp, Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Rev., 26 (1984), pp. 91--112.


High Performance Scalable Matrix Algebra.. - von Laszewski.. (1992)   (Correct)

....is the fundamental problem in parallel computing and has great influence on the performance on such machines. The use of block based algorithms is one of the most efficient ways to improve the performance of numerical algorithms on distributed memory machines. Dongarra, Gustavson, and Karp [4] discussed six ways of implementing the LU factorization obtained by reordering the three nested loops that constitute the algorithm. Algorithm 1 explains the generic Gaussian elimination. The loop indices are i; j and k. Only three of these ways are applicable to the column oriented Fortran. ....

....kji, and jki. For example, the abbreviation jik points out that j is the loop index for the outermost loop and k for the inner most loop (compare to algorithm 1) Since the number of memory touches for the kji noblock algorithm is twice as high as for the jki noblock and the jik noblock algorithm [4], the running time for this algorithm is slower. Experiments show that the jki noblock algorithm performs better than the two others. 3.1 jik Noblock Algorithm Before the algorithm is described in detail it is useful to visualize the data dependencies of the n Theta n matrix elements between ....

Dongarra, J., Gustavson, F. G., and Karp, A. Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine. SIAM Review 26, 1 (Jan 1984), pp. 91--112.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1997)   (13 citations)  (Correct)

....factorizations involve on the order of n 3 floating point operations for data that needs n 2 memory locations. With the advent of vector and parallel supercomputers, the efficiency of the factorizations were seen to depend dramatically upon the algorithmic form chosen for the implementation [16, 18, 32]. These studies concluded that managing the memory hierarchy is the single most important factor governing the efficiency of the software implementation computing the factorization. The motivation of the LAPACK [2] project was to recast the algorithms in the EISPACK [35] and LINPACK [14] software ....

J.J. Dongarra, F.G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, January 1984.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1996)   (13 citations)  (Correct)

....factorizations involve on the order of n 3 floating point operations for data that needs n 2 memory locations. With the advent of vector and parallel supercomputers, the efficiency of the factorizations were seen to depend dramatically upon the algorithmic form chosen for the implementation [14, 16, 29]. These studies concluded that managing the memory hierarchy is the single most important factor governing the efficiency of the software implementation computing the factorization. The motivation of the LAPACK [2] project was to recast the algorithms in the EISPACK [32] and LINPACK [15] software ....

J.J. Dongarra, F.G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, January 1984.


A Partitioned Skyline LDL^T Factorization - Marques (1993)   (Correct)

....sum of 1 Theta1 and 2 Theta2 pivot blocks and L is a lower unit triangular matrix. Basically, L and U are evaluated by three nested loops, whose arrangement (column or row oriented) can strongly influence the computational performance of the process. Robert and Sguazzero [23] and Dongarra et al. [9], for instance, studied different ways of doing that nesting for dense matrices. In any case, the factorization phase is more time consuming than a later evaluation of x. All the same, a partitioning or blocking is usually applied to matrix calculations, so as to profit from the architecture of ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine. SIAM Review, 12:91--112, 1984.


Block Implementations of Symmetric QR and Jacobi Algorithms - Arbenz, Oetti (1992)   (2 citations)  (Correct)

....to avoid unnecessary memory references, as moving data between different levels of memory (registers, cache, main memory) is slow compared to arithmetic operations on the data. It has turned out, that on vector processing computers matrix vector operations can be efficiently implemented (see [10]) On parallel shared memory computers with cache memory only matrix matrix operations perform optimally (see [13] These facts has led to the development of new algorithms and redesign of old ones, which are based on these level 2 (matrix vector) and level 3 (matrix matrix) operations. This ....

....was obtained. These times on the CRAY confirm, that algorithms with optimal exploitation of matrix vector operations have an optimal performance on poor vector processing machines without cache memory and that therefore matrix matrix operations can only yield small further improvements (compare [10]) The times of the new routines on four processors show, that BLAS 2 and BLAS 3 primitives are well suited to perform in parallel. They are actually the only part of the algorithm executed in parallel. But BTRED is still inferior to STRED. This shows, that the CRAY Y MP has a well balanced ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984.


Parallelizing Unstructured Sparse Matrix Computations On.. - Venugopal (1993)   (Correct)

....would need to group together computations into a task in such a 21 manner that no two computations in a task access data which lie in different columns of the matrix. Apart from the data access pattern, the size and shape of data partitions also depends on the architecture. Dongarra et.al. [16] cite several methods of row oriented and column oriented partitionings for matrix multiplication and LU decomposition, choosing the partitioning depending on the vector machine used. The different sets of BLAS (Basic Linear Algebra Subprograms) routines also give an indication as to how ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26-1:91--112, 1984.


Compiler Blockability of Dense Matrix Factorizations - Steve Carr   (13 citations)  (Correct)

....involve on the order of n 3 floating point operations for data that needs n 2 memory locations. With the advent of vector supercomputers, the efficiency of the factorizations were seen to depend dramatically upon the algorithmic form chosen for the implementation. Dongarra, Gustavson and Karp [15] gave a detailed study the algorithmic issues involved in constructing an efficient LU factorization on the early CRAY supercomputers. The work of Ortega [27] and Gallivan, Plemmons and Sameh [17] considered both algorithmic and computational issues involved in the efficient implementation of ....

J.J. Dongarra, F.G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, January 1984.


The Gauss-Huard algorithm and LU factorization - Walter Hoffmann (1998)   (Correct)

....in Gauss Huard elimination. The numerical stability of the Gauss Huard algorithm with row pivoting is proven in [4] In the current paper we describe a relation of the Gauss Huard algorithm with LU factorization, or more precisely, with one out of essentially six variants of LU factorization [5]. 2 The original Gauss Huard algorithm As we want to demonstrate a theoretical relation between the Gauss Huard algorithm and LU factorization, we will not consider the subject of pivoting to achieve a numerically stable algorithm; for a version of the Gauss Huard algorithm that includes partial ....

....operations in an LU factorization: 2 3 n 3 Gamma 1 2 n 2 5 6 n . 3 A relation with LU factorization For a comparison of the Gauss Huard algorithm with LU factorization, we focus on the existence of six variants in the data flow for computing an LU factorization as has been introduced in [5]. These six variants can be discriminated pairwise by one out of three fundamental operations which are at the basis of that specific variant: 1. Dot product of two vectors for the ijk and jik variant. 2. Matrix vector multiplication for the ikj and the jki variant. 3. Rank one matrix update for ....

J.J. Dongarra, F.G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984.


Memory Organization and Management for Linear.. - Navarro, Lang.. (1994)   (Correct)

....max , that of A is i max Theta k max , and that of B is k max Theta j max . We assume that each matrix element is a double precision floating point number (8 byte words) the matrices are large in all dimensions, dense and stored by columns, like in Fortran. Consider, for example, the ijk form [DoGK84] DO 10 I= 1, Imax DO 10 J= 1, Jmax DO 10 K= 1, Kmax 10 C(I,J) C(I,J) A(I,K) B(K,J) Figure 1.1 shows the dependence graph in the iteration space for i max = j max = k max = 3. The nodes are placed in a three dimensional space, one dimension for each loop. The node with coordinates (i p , j p , k ....

J. Dongarra, F. Gustavson and A. Karp, Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Rev., 26 (1984), pp. 91--112.


Parallel Algorithms in Linear Algebra - Brent (1991)   (2 citations)  (Correct)

....growth of interest in parallel algorithms (including those for linear algebra problems) in recent years, so we can not attempt to be comprehensive. For more detailed discussions and additional references, the reader is referred to surveys such as those by Gallivan et al. [23] Dongarra et al. [16], and Heller [36] 2. Basic concepts We assume that a parallel machine with P processors is available. Thus P measures the degree of parallelism; P = 1 is just the familiar serial case. When considering the solution of a particular problem, we let TP denote the time required to solve the problem ....

J. J. Dongarra, F. G. Gustavson and A. Karp, "Implementing linear algebra algorithms for dense matrices on a vector pipeline machine", SIAM Review 26 (1984), 91-112.


Orthogonal Reduction On Vector Computers - Mattingly, Meyer, Ortega (1989)   (1 citation)  (Correct)

....REDUCTION ON VECTOR COMPUTERS R. Bruce Mattingly, 1 Carl D. Meyer, 2 James M. Ortega 3 1. INTRODUCTION. This paper concerns the implementation of the QR factorization by Givens and Householder transformations on vector computers . Following the analysis of Dongarra, et al. 1984] for Gaussian elimination, various ijk forms for both Givens and Householder transformations are investigated. Conclusions concerning which of these forms have desirable or undesirable properties for vector computers are presented. These ijk forms utilize only rows or columns as the basic entities ....

....statement a ij = a ij l ik a kj in Gaussian elimination where u ik plays the role of the multiplier l ik and v kj plays the role of a kj . With this interpretation, the rank one update code corresponds as well as could be expected to the kij form of Gaussian elimination as given in Dongarra, et al. 1984]. This is the motivation for the terminology kij form of Householder reduction. By interchanging the indices i and j in the rank one update code, the kji (inner product) form of Householder reduction, given in Figure 2, is obtained. THE kji (INNER PRODUCT) FORM OF HOUSEHOLDER REDUCTION For k = 1 ....

[Article contains additional citation context not shown here]

J. Dongarra, F. Gustavson, and A. Karp [1984], Implementing linear algebra algorithms for dense matrices on a vector pipeline machine, SIAM Rev., Vol 26, pp 91-112.


Stability of Methods for Matrix Inversion - Croz, Higham (1992)   (7 citations)  (Correct)

....L Gamma1 a column at a time. Analogous i and k methods exist, which compute L Gamma1 row wise or use outer products, respectively, and we comment on them at the end of the section. The names i , j and k refer to the outermost loop index, according to the convention introduced by [7], and used in [9] to describe the different possible orderings of the loops. The first method computes each column of X = L Gamma1 independently, using conventional forward substitution. We write it as follows, to facilitate comparison with the second method. We use MATLAB style indexing ....

J.J. Dongarra, F.G. Gustavson and A. Karp, Implementing linear algebra algorithms for dense matrices on a vector pipeline machine, SIAM Review, 26 (1984), pp. 91--112.


Sparse Gaussian Elimination on High Performance Computers - Li (1996)   (19 citations)  (Correct)

....nested loops: for do for do for do a ij a ij Gamma (a ik a kj ) a kk ; 2.1) end for; end for; end for; The loop indices have variable names i, j, and k, but they will have different ranges. Six possible permutations of i, j and k are possible in the three nested loops. Dongarra et al. [29] studied the performance impact of each permutation for dense LU factorization algorithms on vector pipeline machines. Although the generic algorithm is very simple, significant complications in its actual implementation arise from sparsity, the need for numerical pivoting and diverse computer ....

J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, January 1984.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1996)   (13 citations)  (Correct)

....factorizations involve on the order of n 3 floating point operations for data that needs n 2 memory locations. With the advent of vector and parallel supercomputers, the efficiency of the factorizations was seen to depend dramatically upon the algorithmic form chosen for the implementation [14, 16, 30]. These studies concluded that managing the memory hierarchy is the single most important factor governing the efficiency of the software implementation computing the factorization. The motivation of the LAPACK [2] project was to recast the algorithms in the EISPACK [33] and LINPACK [15] software ....

J.J. Dongarra, F.G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, January 1984.


Mapping Arbitrary Non-Uniform Task Graphs onto Arbitrary.. - Chen, Eshaghian, Wu (1995)   (2 citations)  (Correct)

....the task modules onto the processor with the highest speed to avoid the relatively expensive communication cost. This is shown in figure 14(c) Finally, we give an example of mapping a real application task onto a heterogeneous system. We choose the Gaussian elimination algorithm used in LINPACK [13, 12]. The FORTRAN code is given in figure 15. Suppose it takes 1 unit of time to do an addition or subtraction, and it takes 2 units of time to do a multiplication or division of two real numbers. Also, we assume the communication amount of sending receiving each real number to be 1. A task graph for ....

J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.


Multilevel Blocking and Prefetching for Linear.. - Garcia..   (Correct)

...., the size of A is i max Theta k max , and the size of B is k max Theta j max . We assume that each matrix element is a double precision floating point number (8 bytes) the matrices are large in all dimensions, dense and stored by columns, like in Fortran. Consider, for example, the ijk form [DoGK84] DO 10 I= 1, Imax DO 10 J= 1, Jmax DO 10 K= 1, Kmax 10 C(I,J) C(I,J) A(I,K) B(K,J) Figure 2.1 shows the dependence graph in the iteration space for i max = j max = kmax = 3. The nodes are placed in a three dimensional space, one dimension for each loop. The node with coordinates (i p , j p , k ....

J. Dongarra, F. Gustavson and A. Karp, Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Rev., 26 (1984), pp. 91--112.


On The Granularity And Clustering Of Directed Acyclic Task.. - Gerasoulis, Yang (1990)   (47 citations)  (Correct)

....which have been studied in the literature for a variety of parallel architectures, e.g. Ortega [25] pp. 88 and 241, Gerasoulis and Nelken [11] Robert, Tourancheau and Villard [27] Moler [24] Geist and Heath [8] Davis [5] Saad [28] Ipsen, Saad and Schultz [29] Dongarra, Gustavson and Karp [6], S. Y. Kung [20] p. 168 and many others. Here we consider the kji form without pivoting for the Gauss Jordan algorithm, e.g. Dongarra, Gustavson and Karp [6] Robert, Tourancheau and Villard [27] with interior loop partitioning, see figure 10. Gauss Jordan kji form for k = 1 : n for j = k 1 ....

....and Villard [27] Moler [24] Geist and Heath [8] Davis [5] Saad [28] Ipsen, Saad and Schultz [29] Dongarra, Gustavson and Karp [6] S. Y. Kung [20] p. 168 and many others. Here we consider the kji form without pivoting for the Gauss Jordan algorithm, e.g. Dongarra, Gustavson and Karp [6], Robert, Tourancheau and Villard [27] with interior loop partitioning, see figure 10. Gauss Jordan kji form for k = 1 : n for j = k 1 : n 1 T j k : f a k;j = a k;j =a k;k for i = 1 : n and i 6= k a i;j = a i;j 0 a i;k 3 a k;j endg end end Figure 10: The kji form for Gauss Jordan ....

J. J. Dongarra, F.G. Gustavson, and A. Karp, Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine, SIAM Review, vol. 26, pp. 91-112, 1984.


Preliminary LAPACK Users' Guide - Anderson, Bai, Bischof, Demmel.. (1991)   Self-citation (Dongarra)   (Correct)

....linear algebra software on these classes of machines. 2.2. 1 Vectorization Designing vectorizable algorithms in linear algebra is usually straightforward; indeed for many computations there are several variants, all vectorizable, but with different characteristics in performance (see, for example, [16]) Linear algebra algorithms can come close to the peak performance of many machines principally because peak performance depends on some form of chaining of vector addition and multiplication operations, and this is just what the algorithms require. However, when the algorithms are realized ....

....and the code given above is not the code that is actually used in the LAPACK routine SPOTRF. We mentioned in subsection 2.2. 1 that for many linear algebra computations there are several vectorizable variants, often referred to as i , j and k variants, according to a convention introduced in [16] and used in [20] The same is true of the corresponding block algorithms. It turns out that the j variant that was chosen for LINPACK, and used in the above examples, is not the fastest on many machines, because it is based on solving triangular systems of equations, which can be significantly ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984.


Standardized Numerical Linear Algebra Software - Dongarra, Eijkhout   Self-citation (Dongarra)   (Correct)

....not the end of the story, and the code given above is not the code actually used in the LAPACK routine SPOTRF. We mentioned earlier that for many linear algebra computations there are several algorithmic variants, often referred to as i , j , and k variants, according to a convention introduced in [15, 23] and explored further in [53, 54] The same is true of the corresponding block algorithms. It turns out that the j variant chosen for LINPACK, and used in the above examples, is not the fastest on many machines, because it performs most of the work in solving triangular systems of equations, ....

J.J. Dongarra, F.C. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984. 29


Key Concepts for Parallel Out-of-Core LU Factorization - Dongarra, Hammarling, Walker (1996)   (3 citations)  Self-citation (Dongarra)   (Correct)

....(the familiar recursive algorithm) computes a block row and column at each step and uses them to update the trailing submatrix. These variants have been called the i,j,k variants owing to the arrangement of loops in the algorithm. For a more complete discussion of the different variants, see [9, 15]. We now develop these block variants of LU factorization with partial pivoting. PARALLEL OUT OF CORE LU FACTORIZATION 4 Left looking variant Right looking variant Figure 1: Memory access patterns for variants of LU decomposition. The shaded parts indicate the matrix elements accessed in forming ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984.


Tools to Aid in the Analysis of Memory Access Patterns for.. - Brewer Dongarra (1988)   (2 citations)  Self-citation (Dongarra)   (Correct)

....a matrix in preparation to solving a system of linear equations via Gaussian elimination. Each method performs the same number of floating point operations; the algorithms differ only in the way in which the data is accessed. The three methods are block jki, block Crout, and block rank update (see [2 4, 6] for more details) When MAPA displays the trace file produced by merging the trace files from the execution of the instrumented versions of the three different programs, we obtain the picture at shown in Figure 3. Figure 3. Display of Fortran execut ion matrix of order 40 blocksize of 5 The ....

J. J. Dongarra, F. Gustavson, and A. Karp, "Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine," SIAM Review , vol. 26, 1, pp. 91-112, Jan. 1984.


Key Concepts For Parallel Out-Of-Core LU Factorization - Dongarra, Hammarling, Walker (1996)   (3 citations)  Self-citation (Dongarra)   (Correct)

....(the familiar recursive algorithm) computes a block row and column at each step and uses them to update the trailing submatrix. These variants have been called the i,j,k variants owing to the arrangement of loops in the algorithm. For a more complete discussion of the different variants, see [8, 13]. We now develop these block variants of LU factorization with partial pivoting. 2.1 Right Looking Algorithm Suppose that a partial factorization of A has been obtained so that the first k columns of L and the first k rows of U have been evaluated. Then we may write the partial factorization in ....

J. J. Dongarra, F. G. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26:91--112, 1984.


A Parallel-Vector Algorithm for Rapid Structural.. - Storaasli, Nguyen.. (1990)   (2 citations)  (Correct)

No context found.

Dongarra, J. J., Gustafson, F. G., and Karp, A., "Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine", SIAM Review, Vol. 26, No. 1, January, 1984.


A Parallel Block Implementation of Level 3 BLAS for MIMD.. - Daydé, Duff, Petitet (1993)   (Correct)

No context found.

Dongarra, J.J., Gustavson, F.G., and Karp, A. (1984). Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Rev.(26), 91-112.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC