| J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991. |
....of the matrix vector product. 4.3.2 Inner product The Cray T3D computer is composed of pipelined RISC processors. It is well known that on such a processor the total time of a vector operation consists of a start up time and the time to get one result multiplied by the length of the vector ([7] p. 56) The start up time is denoted by t is and f im is the maximum flop rate once the routine runs. Using this notation the flop rate of the inner product is f i (n) t is fim : 13) We give this formula only for the inner product. However also for the other operations the flop rate ....
....segment is used. The start up time, transmission time and latency are obtained from the measurements of the inner product. It appears from Table 10 that the inner product behaviour does not match model (13) for vectors with fewer than 64 elements. This is possibly caused by strip mining effects ([7] p. 11) Applying the model to the remaining data and assuming a maximum flop rate of f im = 33:4 Megaflops per second, we find a start up time of approximately t is = 12seconds for Table 9 and t is = 7seconds for Table 10. We estimate the communication latency t l from Table 9 and 10. The inner ....
J.J. Dongarra, I.S. Duff, D.C. Sorensen, and H.A. van der Vorst. Solving linear systems on vector and shared memory computers. SIAM, Philadelphia, 1991.
....Typical applications associate values to edges or nodes or both, and update the values of the nodes with a function of the values of their neighbors. It is well known that the performance of these operations in a sequential computer depends heavily on the internal representation of these grids [2]. For this reason, representations that layout contiguously in memory all the edges associated with a particular node are commonly used. Figure 2 shows a contiguous representation of the grid in Figure 1. The advantage of This research was supported in part by the Advanced Research Projects ....
.... newN DO j=0 to newR[i] Y[newH[i] j] DO i=0 to N DO j=0 to r[i] Y[h[i] j] DO i=0 to N DO j=0 to r[i] count=0 ref[count] h[i] j count = count 1 TRACE GENERATION [4,9] Global ref TRANSLATION TABLE Local ref Processor 0 0 0 1 1 1 0 1 2 0 1 2 Local ref Processor 0 1 [0,2] [0,2] GENERATE SCHEDULE GENERATE NEW TRACE DO i=0 to N DO j=0 to r[i] count =0 Gather external with Schedule . Y[newRef[count] count = count 1 EXECUTOR CHAOS PARTI PILAR Figure 4: Using spatial regularity CPIR is a C abstract base class that unifies the interface of our ....
[Article contains additional citation context not shown here]
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst, Solving linear systems on vector and shared memory computers. SIAM, 1991.
....with constant absolute deviations arc being graphed. These figures are not meant to show speedups over a well coded single processor version of the algorithm but only to show the relation of the different coherence methods. The LU decomposition is a partial pivot blocked right looking algorithm [7]. The blocking factor for all tests was 10. That is close to optimal for all measured test cases. The no coherence case produces numerous floating point exceptions duc to both division by zero and overflow errors. Trapping these exceptions skews the results. To avoid this problem, all test cases ....
J. J. Donagrra, I. S. Duff, D.C. Sorenson, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991.
....This is due to the reduced bandwidth needed for the BBCS scheme and the reduction of overhead costs induced by short vectors. However, besides direct SMVM, often the transposed SMVM of a matrix is needed in applications. Such an application is the Bi Conjugate Gradient Iterative Solving Algorithm [4] that requires one normal and one transposed matrix vector multiplication at each iteration. This is presented in literature [4] as a disadvantage of the algorithm. This relates to the fact that storage formats such as CRS and JD either perform worse for the Transpose SMVM or it is not possible at ....
....besides direct SMVM, often the transposed SMVM of a matrix is needed in applications. Such an application is the Bi Conjugate Gradient Iterative Solving Algorithm [4] that requires one normal and one transposed matrix vector multiplication at each iteration. This is presented in literature [4] as a disadvantage of the algorithm. This relates to the fact that storage formats such as CRS and JD either perform worse for the Transpose SMVM or it is not possible at all to perform the transposed SMVM (e.g. JD for vector code) In the latter case we need to use a second copy of the matrix ....
[Article contains additional citation context not shown here]
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM Publications, Philadelphia, PA, 1990.
.... of a matrix is given by a triple of nonnegative integers: the number of eigenvalues less, equal, and greater than zero [15, 33] Therefore, the number of eigenvalues in [ R] is simply given by the difference ndR nd; where ndR and nd are the number of negative 1 x 1 plus 2 x 2 pivots in D [11], for rr = and rr = EL, respectively. Such a test, also referred to as Sturm sequence check [3] allows as well the location in the spectrum of any pair (k, 3.4 Spectrum Slicing and Computational Interval A spectrum sl cing strategy is useful when many eigensolutions are required, the eigenvalue ....
J. J. Dongarra, I. S. Duff, D.C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, USA, 1991.
....unstructured grid and its contiguous layout edges or nodes or both and update the values of the nodes with a function of the values of their neighbors. It is well known that the performance of these operations in a sequential computer depends heavily on the internal representation of these grids [19]. For this reason, representations that layout contiguously in memory all the edges associated with a particular node are commonly used. Figure 3.1 shows an unstructured grid and its contiguous representation. The advantage of this layout comes into play when the iterative algorithm traverses all ....
....is mainly due to improved single node performance. The reason is that enumeration adds an extra level of indirection for every memory access to obtain local addresses and, in particular, on machines with fast floating point hardware like the SP 1, indirection degrades performance significantly [19]. On the other hand, after conventional optimizations performed by the sequential compiler, the single node performance of the intervalbased program is roughly the same as the original code. This is due to the local conversion transformation in the presence of intervals described in Section 4.8 ....
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst. Solving linear systems on vector and shared memory computers. SIAM, 1991. 88
....For performance reasons, it is often necessary to formulate the algorithm as a blocked algorithm as illustrated in Fig. 2. The performance bene t comes from the fact that the algorithm is rich in matrix multiplication which allows processors with multi level memories to achieve high performance [7, 2, 8, 5]. Note 7 The algorithm in Fig. 2 is implemented by the more traditional MATLAB code given in Fig. 3. We claim that the introduction of indices to explicitly indicate the regions involved in the update complicates readability and reduces con dence in the correctness of the MATLAB implementation. ....
J. J. Dongarra, I. S. Du, D. C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....[10] The LINPACK benchmark measures the performance attained by a given architecture when solving a linear system of equations in 64 bit arithmetic via an LU factorization with partial pivoting. Highperformance implementations cast the LU factorization in terms of matrix multiplication [9]. The HPL implementation of this benchmark was used for this experiment [25] Our approach to implementing dgemm was used to benchmark the 300 compute node, dualPentium (R) 4 processor (2.4 GHz) based, cluster at the Center for Computational Research at the University at Bu alo, SUNY. This ....
Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....2g. For performance reasons it is often necessary to formulate the algorithm as a blocked algorithm as illustrated in Fig. 2. The performance bene t comes from the fact that the algorithm is rich in matrix multiplication which allows processors with multi level memories to achieve high performance [10, 3, 12, 8]. Note 7 The algorithm in Fig. 2 is implemented by the Matlab code given in Fig. 3. We would like to claim that the introduction of indices to explicitly indicate the regions involved in the update complicates readability and reduces con dence in the correctness of the Matlab implementation. ....
Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....in a distributed environment. Of course, all the algorithms we tried in this paper have their parallel equivalences, with some being easy to implement (e.g. the iterative method) and others (e.g. Crout factorization) requiring significant amount of re work from their sequential counterparts ([3]) Using a network of workstations as a data paging farm is a promising notion. The mobile agent based approach borrows this idea, but it distinguishes itself from the one presented in [4] in two ways. First, we do not move all data to one single machine; rather we move computation to data, which ....
Jack J. Dongarra, lain S. Duff, Danny C. Sorerisen, and Henk A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1991.
....different computations than the traditional algorithm, they differ formally only by an underlying change of variables. 1 Introduction Today high performance computers are used heavily to solve large linear systems, and they run variants of Gaussian Elimination and LU decomposition constantly [2, 8]. Gaussian Elimination is a prominent topic in every text on computational linear algebra. A comprehensive overview of modern numerical software references up to the late 1980 s can be found in Chapter 3 of [5] Gaussian Elimination is often presented as the for Gammaloop program shown in Figure ....
J. Dongarra, I.S. Duff, D.C. Sorensen, H.A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Philadelphia, PA: SIAM Press, 1991.
....IRI 8917907. Pivoting introduces complexity into the otherwise straightforward nested loop Gaussian elimination computation. The problem of managing this complexity without loss of performance has resulted in hundreds of different implementations for high performance machines. Dongarra et al. [3] survey a variety of techniques for squeezing additional performance out of Gaussian elimination on vector and parallel architectures. Robert [21] reviews many variations of the Gaussian elimination algorithm that have evolved to exploit features of particular parallel machine architectures. The ....
J. Dongarra, I.S. Duff, D.C. Sorensen, H.A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Philadelphia, PA: SIAM Press, 1991.
....The characterizations also help in better understanding the effects of roundoff error and pivoting. 1 Introduction Today high performance computers are often used for little more than solving large linear systems, and they run variants of Gaussian Elimination and LU decomposition constantly [8, 31]. It is a prominent topic in every text on computational linear algebra. A comprehensive overview of modern numerical software references up to the late 1980 s can be found in Chapter 3 of [14] In its basic form, Gaussian Elimination is a sequence of transformations to an n Theta n square ....
J. Dongarra, I.S. Duff, D.C. Sorensen, H.A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Philadelphia, PA: SIAM Press, 1991.
.... such that I( A is close to the identity matfix I [5] Since the operations in the basic CG method consist of vector updates, inner products, and sparse matfix vector multiplies, efficient parallel versions of the algorithm have been demonstrated on many vector machines and M1MD multiprocessors [4, 6]. Precondirioned CG methods, however, have not enjoyed the same success. In many of the most popular preconditioning techniques, the preconditioner steps involve recurrence relations which do not vectorize or parallelize easily. Algorithmic solutions have been proposed which use different ....
....axis in the solution space along which to partition work for the different processors and simultaneously avoid heavy dependencies. The difficulty in performing computations involving recurrence relations is well known. Table 2 shows performance numbers on some vector computers as presented in [4]. The first column of numbers shows the absolute peak floating point performance of the machine. The remaining columns give the maximum performance in MFlops on each of the four vector operations that appear in Y I X( barrier barrier I I Figure 2: Coarse grain parallel ....
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia. 1991.
....after which the 11 . 02 after which we must solve the triangular system L 12 , overwriting the original A 12 . These steps and the preceding discussion lead one directly to the algorithm in 2(c) 3. 5 Column Lazy Algorithm This algorithm is referred to as a left looking algorithm in [10] while Stewart [27] calls it Pickett s charge east. Let us assume that only (2) and (4) have been satis ed. Now it suces to compute U 11 , and L 21 . Using the same techniques as before derives the algorithm in Fig. 2 (d) Again, this algorithm overwrites given n n matrix A with its LU ....
Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....in Fig. 2.3(c) The proof of the following theorem is similar to that of Theorem 1. Theorem 3 The algorithm in Fig. 2.3(c) overwrites a given non singular, n n matrix, A, with its LU factorization. 2.4. 5 Column lazy algorithm This algorithm is referred to as a left looking algorithm in [27] while Stewart [71] calls it Pickett s charge east. Let us assume that only (2.1) and (2.3) have been satis ed. Now it suces to compute U 21 . Using the same techniques as before one derives the algorithm in Fig. 2.3 (d) Again, this algorithm overwrites the given non singular, n n matrix, ....
Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....(a part of the NASA High Performance Computing and Communications Program funded by the NASA Oce of Space Science. 1 Introduction The high performance implementation of many linear algebra operations depends on the ability to cast most of the computation in terms of matrix matrix multiplication [2, 3, 6, 12]. High performance for matrix matrix multiplication itself results from the fact that, for this operation, the cost of moving b b blocks of the operands between the layers of the memory hierarchy is proportional to b which can be amortized over O(b ) computations. These observations impact ....
Jack J. Dongarra, Iain S. Du, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....A to a banded matrix with bandwidth 2k 1. Such a matrix can also be viewed as a block tridiagonal matrix with k k blocks. The complexity of obtaining such a form is comparable to that of obtaining a scalar tridiagonal form, but it can be computed more e#ciently on parallel architectures [6]. Moreover, if k 2 # p, then steps (6.5) and (6.8) both still require O(np 2 ) flops. In many applications (e.g. PDEs) the matrix A has a special sparsity pattern that can also be exploited. For example, one often encounters matrices A that already have a banded form and therefore do not ....
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, 1991.
.... of a matrix is given by a triple of nonnegative integers: the number of eigenvalues less, equal, and greater than zero [15, 33] Therefore, the number of eigenvalues in [ L ; R ] is simply given by the di erence ndR ndL , where ndR and ndL are the number of negative 1 1 plus 2 2 pivots in D [11], for = R and = L , respectively. Such a test, also referred to as Sturm sequence check [3] allows as well the location in the spectrum of any pair ( k ; x k ) 3.4 Spectrum Slicing and Computational Interval A spectrum slicing strategy is useful when many eigensolutions are ....
J. J. Dongarra, I. S. Du, D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, USA, 1991.
....and an efficient and highly accurate condition number estimator. In the design of this parallel dense matrix equation into solver, the following issues were taken memory on each node; the amount of memory and its organization on a processor must be take into account to achieve high performance [11,12]. Data decomposition resulting from the integral equation formulation; the algorithm which calculates the entries may lead to a natural decomposition of the matrix among the processors. Data decomposition that to an efficient factorization; the factorization algorithm may require a specific ....
....near the processor s peak rate. One reason is that the implementation is rich in vector operations, which makes access to main memory a bottleneck. In essence, O(n) operations are performed on data, and the processor cache, which allows high data access rates, cannot be utilized optimally [12]. 14] corrects this by reformulating the factorization in terms of operations. Partition A, L and U as follows where This leads to the equations = 14) # (15) 16) 17) 6 We see that the LU factorization can be by overwriting a panel of width k with its LU factorization (15) by solving the ....
J. J. I. S. Duff, D. C. Sorenson, and H. A. van der Vorst, Solving linear systems on vector and shared memory computers, SIAM, Philadelphia, 1991. 13
....and a symmetric minimum degree (MMD) ordering, while the latter uses a multifrontal Gauss elimination and a combination of MMD and the Markowitz ordering strategy. Ordering schemes aim at reducing the ll in and the number of operations to be performed during the factorization process, see [8, 9, 13] for details. By choosing appropriate values for , A becomes positive de nite. Therefore, pivoting to assure numerical stability is not needed and the factorization can be performed more eciently. Pivoting in MA27 can be suppressed by setting the variable U in common block MA27D to zero after ....
J. J. Dongarra, I. S. Du, D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, USA, 1991.
....of code that use the Array package. ffl Array inquiry methods: this group contains methods last, size, rank and shape. They are fast, descriptor only operations which return information about the (invariant) properties of the Array. ffl BLAS routines: the Basic Linear Algebra Subprograms (BLAS) [12] are the building blocks of efficient linear algebra codes. BLAS implements a variety of elementary vector vector (level 1) matrixvector (level 2) and matrix matrix (level 3) operations. We have designed a BLAS class as part of the Array package. It provides the functionality of the BLAS ....
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991.
....last years there has been a considerable interest for the linear system iterative solvers; they allow efficient calculations with implicit unfactored schemes yielding to better convergence and stability property. There is a 2 large literature in the area of iterative solutions of linear systems [4, 5]. Although the Conjugate Gradient method for the solution of a symmetric positive definite system is well established, the methods in use for solving non symmetric systems are still evolving. The Bi Conjugate Gradient Stabilized (Bi CGSTAB) 6] technique is an efficient method for solving non ....
....a ij is equal to zero, the corresponding l ij element (or u ij element) is also set equal to zero. All the elements of the main diagonal of U are set equal to unity. Incomplete factorization results in an efficient algorithm on a scalar machine, but it is not easily vectorizable or parallelizable [4]. 4. Parallel Implementations In the recent years the use of MIMD parallel computers for solving CFD problems has strongly increased: it is evident that the computing resources that the new parallel architectures offer are the only that, in the future, should satisfy the computing power demand. ....
Dongarra, J., Duff, I., Sorensen, D., Van der Vorst, H. (1991) Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia.
....in a distributed environment. Of course, all the algorithms we tried in this paper have their parallel equivalences, with some being easy to implement (e.g. the iterative method) and others (e.g. Crout factorization) requiring significant amount of re work from their sequential counterparts ([3]) Using a network of workstations as a data paging farm is a promising notion. The mobile agent based approach borrows this idea, but it distinguishes itself from the one presented in [4] in two ways. First, we do not move all data to one single machine; rather we move computation to data, which ....
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1991.
....reduce A to a banded matrix with bandwidth 2k 1. Such a matrix can also be viewed as a block tridiagonal matrix with k k blocks. The complexity of obtaining such a form is comparable to that of obtaining a scalar tridiagonal form but it can be computed more eciently on parallel architectures [6]. Moreover, if k 2 p then both steps (6.5) and (6.8) still require O(np 2 ) ops. In many applications (e.g. PDE s) the matrix A has a special sparsity pattern that can also be exploited. One e.g. often encounters matrices A that have already a banded form and therefore do not need a ....
J. Dongarra, I. Duff, D. Sorensen, and H. van der Vorst, Solving linear systems on vector and shared memory computers, Society for Industrial and Applied Mathematics, Philadelphia, 1991.
....its resultant fill in is prohibitive. Further details of iterative methods can be found in Young (1971) and Varga (1962) while sparse matrix research was reviewed in Duff (1977) Blackford et al. 1997) discusses the LAPACK library modifications to parallel processing for all types of matrices. Dongarra et al. 1990) discusses the impact of parallelization on matrix computations. 8.4.1 Successive iteration (linear) Suppose A and y are given and a guess at the solution x; say x 1 is known. Then if x = A Gamma1 y; the iteration x i 1 = G i x i r i ; i = 1; 2; is proposed where G i is a matrix ....
Dongarra J.J., Duff I. and Sorenson D., Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, 1990.
....to a banded matrix with bandwidth 2k 1. Such a matrix can also be viewed as a block tridiagonal matrix with k Theta k blocks. The complexity of obtaining such a form is comparable to that of obtaining a scalar tridiagonal form but it can be computed more efficiently on parallel architectures [DDSvdV91]. Moreover, if k 2 p then both steps (21) and (24) still require O(np 2 ) flops. In many applications (e.g. PDE s) the matrix A has a special sparsity pattern that can also be exploited. One e.g. often encounters matrices A that have already a banded form and therefore do not need a ....
J. Dongarra, I. Duff, D. Sorensen, and H. van der Vorst, Solving linear systems on vector and shared memory computers, SIAM, 1991.
....methods. Besides these historical references, there are several more easily accessible references which provide an introduction to this class of methods. Among these are the following books: Axelsson Barker [8] Ortega [83] Golub and Van Loan [62] Dongarra, Duff, Sorensen and Van der Vorst [50], Pommerell [85] Barrett et al. [16] Hackbusch [65] and Axelsson [7] The following papers provide brief reviews: Axelsson [3, 4] Beauwens [21] and Il in [67] For more recent research, see the collections of papers in [61, 66] The paper by Demmel, Heath and van der Vorst [43] contains a recent ....
....when the grid is stored as a 2D array. The use of hyperplane ordering has been investigated by Radicati and Vitaletti [86] for the IBM 3090. Numerical experiments on the CM2 can be found in Berryman et al. [25] Chan, Kuo and Tong [32] and Tong [91] Results for the Alliant FX can be found in [50]. The performance on the CM2 is very poor for 2D problems. The hyperplane ordering for 3D has been described in detail for vector computers in van der Vorst [98] Results which show very high efficiency for the Cyber 205 (a notorious machine for indirect addressing) are reported in Schlichting and ....
J.J. Dongarra, I.S. Duff, D.C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....strongly biased, in favor of direct methods, situation one still has a substantial gain with the MRAI approach where the work per step is just O(N) Similar conclusions, although less pronounced, can be made for the 2D case. Besides, direct sparse methods are much more difficult to parallelize [9], so that the picture will be even less favorable for them on a parallel computer. Furthermore, we note that the above mentioned RKC and VODPK codes virtually possess high parallelism too. However, the RKC code is in general less attractive since it does not work well for Jacobians with complex ....
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991.
....representative data. 2 A Look at Parallel Processing While collecting the data presented in Table 1, we were able to experiment with parallel processing on a number of computer systems. For these experiments, we used either the standard LINPACK algorithm or an algorithm based on matrix matrix [2] techniques. In the case of the LINPACK algorithm, the loop around the SAXPY can be performed in parallel. In the matrix matrix implementation the matrix product can be split into submatrices and performed in parallel. In either case, the parallelism follows a simple fork and join model where each ....
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM Publications, Philadelphia, PA, 1990.
....the level 1 and 2 BLAS, as shown in Table 2.1. Hence on machines with a hierarchical memory (e.g. main memory, cache memory, vector registers) the level 3 BLAS involve less data movement per floating point operation, leading to faster execution. For a more detailed explanation see, for example, [52], 65] or [69, Ch. 1] Work is ongoing to extend the BLAS standards to support parallelism and sparsity; see [48] It is important to realise that the BLAS comprise subprogram specifications only; there is freedom in the method used to match the specifications. This freedom is most relevant in ....
....matrix factorizations are now well understood, and the algorithms in LAPACK represent the state of the art. Three representative references for LU factorization spanning the last 12 years include Dongarra, Gustavson and Karp [47] Ortega [105] and Dongarra, Duff, Sorensen and van der Vorst [52]. For variants of LU factorization, developing or choosing a partitioned algorithm may not be trivial. For example, developing a partitioned version of the block LDL factorization is somewhat complicated for the Bunch Kaufman partial pivoting strategy [4] 52] 94] 97] For matrix ....
[Article contains additional citation context not shown here]
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. 1991. x+256 pp. ISBN 0-89871-270-X.
....representative data. 2 A Look at Parallel Processing While collecting the data presented in Table 1, we were able to experiment with parallel processing on a number of computer systems. For these experiments, we used either the standard LINPACK algorithm or an algorithm based on matrix matrix [2] techniques. In the case of the LINPACK algorithm, the loop around the SAXPY can be performed in parallel. In the matrix matrix implementation the matrix product can be split into submatrices and performed in parallel. In either case, the parallelism follows a simple fork and join model where each ....
J. J. Dongarra, I. S. Du, D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM Publications, Philadelphia, PA, 1990.
....representative data. 2 A Look at Parallel Processing While collecting the data presented in Table 1, we were able to experiment with parallel processing on a number of computer systems. For these experiments, we used either the standard LINPACK algorithm or an algorithm based on matrix matrix [2] techniques. In the case of the LINPACK algorithm, the loop around the SAXPY can be performed in parallel. In the matrix matrix implementation the matrix product can be split into submatrices and performed in parallel. In either case, the parallelism follows a simple fork and join model where each ....
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM Publications, Philadelphia, PA, 1990.
No context found.
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991.
No context found.
Dongarra, J.J., Du#, I.S., Sorensen, D.C., Van der Vorst, H.A. Solving Linear Systems on Vector and Shared Memory Computers SIAM Publications, 1991.
No context found.
J.J. Dongarra, I.S. Du, D.C. Sorensen and H.A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics (SIAM), 1991.
No context found.
Dongarra J., Duff I., Sorensen D.,Van Der Vorst H., "Solving Linear Systems on Vector and Shared Memory Computers", SIAM, 1991
No context found.
J. Dongarra, I. Du, D. Sorensen, and H. van der Vorst, Solving Linear Systems on Vector and Shared-Memory Computers, SIAM, Philadelphia, 1991.
No context found.
D.C. Sorenson J. Dongarra, I. Du# and H.A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, 1991.
No context found.
J. J. Dongarra, I. S. Duff, D. C. Sorenson, and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for Industrial and Applied Mathematics, Philadelphia, 1991.
No context found.
J. J. Dongarra, I. S. Du, D. C. Sorenson, and H. A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for Industrial and Applied Mathematics, Philadelphia, 1991.
No context found.
Dongarra, J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, 1991.
No context found.
Dongarra, J.J., Duff, I.S., Sorensen, D.C. and Van der Vorst, H.A.: Solving Linear Systems on Vector and Shared Memory Computers, SIAM(1991).
No context found.
J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vost. Solving linear systems on vector and shared memory computers. SIAM, 1990.
No context found.
J. J. Dongarra, I. S. Du#, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA, 1991. 181
No context found.
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1991.
No context found.
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Solving linear systems on vector and shared memory computers, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1991.
No context found.
J.J. Dongarra, I. S. Duff, D.C. Sorensen & H. A. van der Vorst, Solving linear systems on vector and shared memory computers (SIAM, Philadelphia, 1991).
No context found.
J. J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. Society for Industrial and Applied Mathematics, 1991.
No context found.
JJ Dongarra, IS Duff, DC Sorensen, and HA van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, 1991.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC