73 citations found. Retrieving documents...
K. A. Gallivan, R. J. Plemmons, A. H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Review 32 (1) (1990) 54-135.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Communication-Efficient Parallel Gaussian Elimination - Tiskin (2003)   (Correct)

.... to the cost of the cube dag method (see Section 2) Figure 1: Iterative block Gaussian elimination Figure 2: Recursive block Gaussian elimination A lower communication cost for LU decomposition can be achieved by applying the block algorithm recursively (see e.g. [8, 7]) This standard method was suggested as a means of reducing the communication cost in [1] for the transitive closure problem) The BSP cost of block Gauss Jordan elimination was analysed in [17] we summarise the results here for completeness. Given a nonsingular matrix A, the algorithm produces ....

....its bottom element, we pre multiply it by a 2 2 transformation matrix: a 2 =a 1 1 if a 1 6= 0 1 if a 1 = 0 In a larger matrix, several such transformations on 2 1 blocks can be carried out simultaneously. This technique is known as pairwise pivoting (see e.g. [16, 8]) A similar pattern can be applied to perform elimination on numerical matrices. In this case, we have to be more careful in choosing the transformation, in order to achieve better numerical stability. A standard, numerically stable transformation known as Givens rotation is de ned as c s s ....

[Article contains additional citation context not shown here]

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54-135, March 1990.


Recent Developments in Dense Numerical Linear Algebra - Higham (2000)   (Correct)

....level 1 and 2 BLAS, as shown in Table 2.1. Hence on machines with a hierarchical memory (e.g. main memory, cache memory, vector registers) the level 3 BLAS involve less data movement per floating point operation, leading to faster execution. For a more detailed explanation see, for example, 52] [65], or [69, Ch. 1] Work is ongoing to extend the BLAS standards to support parallelism and sparsity; see [48] It is important to realise that the BLAS comprise subprogram specifications only; there is freedom in the method used to match the specifications. This freedom is most relevant in the ....

.... Theta Theta Theta 3 4 5 Theta Theta Theta 4 5 6 7 Theta Theta 5 6 7 8 9 Theta 7 7 7 7 7 In general, there are 2n Gamma 3 stages in each of which up to n=2 elements are eliminated in parallel. This algorithm is discussed by Wilkinson [130, pp. 236 237] and Gallivan et al. [65]. Sorensen [119] derives an error bound for the factorization that is proportional to 4 which, roughly, comprises a factor 2 bounding the growth factor and a factor 2 bounding L . This method, along with other variants such as pairwise pivoting, in which all row operations are between ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, 1990.


A Systematic Approach to the Design and Analysis of Linear.. - Gunnels   (Correct)

....Implementation tweaking is a standard part of the process when one is developing highperformance scienti c applications intended to run on parallel architectures. In this area of research, algorithmic restructuring and code level optimizations have traditionally been done by di erent groups [32]. Unfortunately, information that could be employed to make code more ecient is traditionally obscured in the translation from a high level description into low level code. Allowing the user to code in a domain speci c language such that highlevel information is retained while automatically ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54-135, 1990.


A Highly Scalable Parallel Algorithm for Sparse Matrix.. - Gupta, Karypis, Kumar (1995)   (39 citations)  (Correct)

....factorization that substantially improve the state of the art in parallel direct solution of sparse linear systems both in terms of scalability and overall performance. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [8, 46, 10, 34]. We show that the parallel Cholesky factorization algorithm described here is as scalable as the best parallel formulation of dense matrix factorization on both mesh and hypercube architectures for a wide class of sparse matrices, including those arising in two and three dimensional finite ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March


Eager Combining: A Coherency Protocol for Increasing.. - Bianchini, LeBlanc (1994)   (6 citations)  (Correct)

....with synchronization. Optimization algorithms can avoid contention by examining or updating the global solution infrequently. Linear algebra algorithms can exploit the properties of numerical equations to improve locality of reference, and as a side effect eliminate most producer consumer sharing [7]. Although most of these techniques reduce contention and improve locality of reference, they may introduce significant complexity in the algorithm, and do not generalize to all producer consumer sharing. For example, the block algorithms used in linear algebra are quite complex (compared to the ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


High Performance Algorithms To Solve Toeplitz And Block.. - Thirumalai (1996)   (5 citations)  (Correct)

....matrix matrix primitives (BLAS3) may be used in the factorization of block Toeplitz matrices. On machines with a memory hierarchy, such as the Alliant FX 8 and present day RISC microprocessor based workstations, this provides a significant improvement in performance compared to BLAS2 primitives [51]. In this section, we deal only with the factorization of symmetric positive definite block Toeplitz matrices. If the matrix is indefinite, the Schur algorithm may break down if a principal minor is encountered. Factorization of indefinite block Toeplitz matrices is discussed in a later chapter. ....

....requires some BLAS1 routines, such as dotproducts and triads, and some BLAS2 routines, such as matrix vector products and rank 1 updates. If the block size m is very large, on machines with hierarchical memory, such as the Alliant FX 8 or the Cedar multiprocessor, a two level blocking scheme [51] can be used where the hyperbolic Householders are blocked every k steps, and the block transformations are applied to the remaining portion of the pivot 29 0 0 0 0 k m k 2m U (k) Figure 2.2 Sparsity pattern of U (k) 0 0 k k m 0 V k = 0 0 k k m 0 k = Y 2m Figure 2.3 Sparsity ....

[Article contains additional citation context not shown here]

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, "Parallel algorithms for dense linear algebra computations," SIAM Review, vol. 32, pp. 54--135, 1990.


A Jacobi Method By Blocks On A Mesh Of Processors - Giménez.. (1997)   (Correct)

....methods, the QR algorithm, divide and conquer (Cuppen s algorithm) and Jacobi s method. More recently, novel algorithms based on invariant subspace decomposition that use matrix matrix multiplication have been proposed [2, 16] For a complete survey and references, we suggest the reader turn to [12, 14]. Jacobi s method is the oldest, dating back to the mid 1800 s. It has fallen out of favor and been resurrected on many occasions. Its most recent resurgence is mainly due to better stability properties [5] and straight forward parallelization [4, 7, 8, 11, 20, 21] The primary purpose of this ....

K. A. Gallivan, R. J. Plemmons and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, 1990.


Reordering of Sparse Matrices for Parallel Processing - Basermannn, Weidner, Hansen, .. (1994)   (Correct)

....the second phase. The sparsity is normally preserved well when a large drop tolerance is used during the factorization. Therefore both the first phase of the factorization and the second phase of the factorization are normally performed by removing non zero elements by using large drop tolerance [27, 28, 29, 30, 53, 54]. This means that the factors L and U may be inaccurate, and should only be used as preconditioners in some iterative method. 3.5 Numerical Results All experiments were carried out on an ALLIANT FX 80 computer by using all eight processors. The parallelism is exploited mainly by calling ....

....We selected a set containing all 25 matrices whose order is greater than 900. Two smaller matrices were added to the set: the first of them contains many non zero elements, the second one arises from an air pollution model. It should be noted that the same set of test matrices has been used in [27, 28, 29, 30, 54]. Some information about the matrices used is given in Table 3.1. The right hand side vectors b of the systems Ax = b were created so that all components of the solution vectors are equal to 1. The iteration process is terminated when the twonorm of the solution vector becomes less than ACCUR = ....

[Article contains additional citation context not shown here]

K. A. Gallivan, R. J. Plemmons and A. H. Sameh, "Parallel algorithms for dense linear algebra computations", SIAM Rev. 32 (1990), 54--135.


Software Interleaving - Ricardo Bianchini Mark (1994)   (Correct)

....is the key to significant performance improvement. Several techniques have been developed for reducing memory contention. Linear algebra algorithms can exploit the properties of numerical equations to improve locality of reference, and as a side effect eliminate most non uniform memory addressing [4]. How ever, this approach may introduce significant complexity in the algorithm, and does not generalize to other classes of applications, such as the graph algorithms used in our study. Many other techniques for alleviating the effects of a non uniform distribution of memory accesses assume ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


Data Prefetching for Linear Algebra Operations on High.. - Garcia, Herrero, Navarro (1995)   (Correct)

....taking advantage of superscalar operation can therefore be regarded as the most important directions when exploiting the corresponding features of matrix multiplication algorithms. Many advances have been made in both directions, specially through the application of techniques such as blocking ([GaPS90] [LaRW91] NaJL94] Nava94] and software pipelining ( Lam88] RaST92] AiNi88] Blocking reduces data cache misses. However, this reduction is not enough to obtain optimal performance. Despite using blocking techniques, the processor is stalled during a considerable amount of time waiting for ....

K. A. Gallivan, R. J. Plemmons and A.H. Sameh, Parallel Algorithms for Dense Linear Algebra Computations, in Parallel Algorithms for Matrix Computations by K. A. Gallivan et al. SIAM, 1990, pp. 1--82.


Scalable Parallel Algorithms for Solving Sparse Systems of Linear.. - Gupta   (Correct)

....is the most time consuming phase in the direct solution of a sparse system of linear equations, there has been considerable interest in developing its parallel formulations. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [9, 41, 11, 29]. However, despite inherent parallelism in sparse direct methods, not much success has been achieved to date in developing their scalable parallel formulations [22, 51] and for several years, it has been a challenge to implement efficient sparse linear system solvers using direct methods on even ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990. Also appears in K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, PA, 1990.


Potential and Achievable Parallelism in Unsymmetric-Pattern.. - Hadfield, Davis (1994)   (Correct)

....and a structure is defined for the computations by the relationships between these dense submatrices. While parallelism is available between independent submatrices, there is also parallelism within these dense submatrices and means of exploiting this parallelism have been extensively investigated [8]. Duff and Johnsson [5] explored the available parallelism in the multifrontal method applied to symmetric pattern matrices using analytical models based on unbounded parallelism. In this work, we explore the both the available and achievable parallelism of a new multifrontal method for the LU ....

....has the processor that owns the current pivot column compute the multipliers (column of L) and broadcast them to the other processors. All of the processors in the frontal matrix s subcube then update their active columns using these multipliers. This is commonly refered to as a fan out algorithm [8]. Node weights correspond to a predicted parallel execution time. This predicted time is based on an analytical model of the fan out algorithm with specific parameters set according to results from an implementation and evaluation of this algorithm on the nCUBE 2. Equation (4) describes the time ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. In R. J. Plemmons, editor, Parallel Algorithms for Matrix Computations, pages 1--82. SIAM, Philadelphia, PA, 1990.


A High Performance Sparse Cholesky Factorization Algorithm.. - Karypis, Kumar (1994)   (5 citations)  (Correct)

....systems arising in certain applications, such as linear programming and some structural engineering applications, they are the only feasible methods for numerical factorization. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [4, 27, 7, 22]. However, despite inherent parallelism in sparse sparse direct methods, not much success has been achieved to date in developing their scalable parallel formulations [15, 38] and for several years, it has been a challenge to implement efficient sparse linear system solvers using direct methods ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990. Also appears in K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, PA, 1990.


Block LU Factorization - Demmel, al. (1995)   (5 citations)  (Correct)

....primary 65F05, 65F25, 65G05. 1 Introduction Block methods in matrix computations are widely recognised as being able to achieve high performance on modern vector and parallel computers. Their performance benefits have been investigated by various authors over the last decade (see, for example, [11, 14, 15]) and in particular by the developers of LAPACK [1] The rise to prominence of block methods has been accompanied by the development of the level 3 Basic Linear Algebra Subprograms (BLAS3) a set of specifications of Fortran primitives for various types of matrix multiplication, together with ....

....3 Theta 3 matrix. L is block lower triangular with identity matrices on the diagonal (and hence is lower triangular) and U is block upper triangular (but the diagonal blocks U ii are not triangular, in general) Block LU factorization has been discussed by various authors; see, for example, [5, 15, 22, 25]. It appears to have first been proposed for block tridiagonal matrices, which frequently arise in the discretization of partial differential equations [16, Sec. 4.5.1] 21, p. 59] 23] 27] An attraction of block LU factorization is that one particular implementation has a greater amount of ....

[Article contains additional citation context not shown here]

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Review, 32 (1990), pp. 54--135.


Stability of Block Algorithms with Fast Level 3 BLAS - Demmel, Higham (1992)   (11 citations)  (Correct)

....1 1 Introduction A block algorithm in matrix computations is defined in terms of operations on submatrices rather than matrix elements. Such algorithms are well suited to many high performance computers because their data locality properties lead to efficient usage of memory hierarchies [16] [17], 18, Ch.1] When a block algorithm is coded in Fortran, advantage can be taken of the level 3 Basic Linear Algebra Subprograms (BLAS3) The BLAS3 are a set of Fortran primitives for various types of matrix multiplication, together with solution of a triangular system with multiple right hand ....

....2.34e 16 (4.56e 17) 6.81e 16 (1.07e 16) 2.31e 17 (7.16e 18) 2.99e 17 (8.38e 18) 12 2. 3 Block Triangular LU Factorization Next, we discuss the computation of a true block LU factorization A = LU 2 IR n Thetan , where L and U are block lower triangular and block upper triangular respectively [17, 29]. This factorization is not used in LAPACK; we consider it here because it provides a salutary example of how a plausible block algorithm can be unstable. Assuming that A 11 2 IR r Thetar is nonsingular we can write A = A 11 A 12 A 21 A 22 = I 0 L 21 I A 11 A 12 0 B = LU; 2.19) which ....

[Article contains additional citation context not shown here]

K.A. Gallivan, R.J. Plemmons and A.H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Review, 32 (1990), pp. 54--135.


A High Performance Sparse Cholesky Factorization Algorithm.. - Karypis, Kumar (1994)   (5 citations)  (Correct)

....systems arising in certain applications, such as linear programming and some structural engineering applications, they are the only feasible methods for numerical factorization. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [4, 17, 20]. However, despite inherent parallelism in sparse direct methods, not much success has been achieved to date in developing their scalable parallel formulations [12, 28] and for several years, it has been a challenge to implement efficient sparse linear system solvers using direct methods on even ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


Highly Scalable Parallel Algorithms for Sparse Matrix.. - Gupta, Karypis, Kumar (1995)   (39 citations)  (Correct)

....factorization that substantially improve the state of the art in parallel direct solution of sparse linear systems both in terms of scalability and overall performance. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [8, 47, 10, 35]. We show that the parallel Cholesky factorization algorithms described here are as scalable as the best parallel formulation of dense matrix factorization on both mesh and hypercube architectures for a wide class of sparse matrices, including those arising in two and three dimensional finite ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990. Also appears in K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, PA, 1990.


Analysis and Design of Scalable Parallel Algorithms for Scientific .. - Gupta (1995)   (2 citations)  (Correct)

....is the most time consuming phase in the direct solution of a sparse system of linear equations, there has been considerable interest in developing its parallel formulations. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [40, 109, 42, 81]. However, despite inherent parallelism in sparse direct methods, not much success has been achieved to date in developing their scalable parallel formulations [63, 123] and for several years, it has been a challenge to implement efficient sparse linear system solvers using direct methods on even ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990. Also appears in K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, PA, 1990.


A Scalable Parallel Algorithm for Sparse Matrix Factorization - Gupta, Kumar (1994)   (7 citations)  (Correct)

....incurs strictly less communication overhead than any known parallel formulation of sparse matrix factorization, and hence, can utilize a higher number of processors effectively. It is well known that dense matrix factorization can be implemented efficiently on distributed memory parallel computers [10, 39, 12, 26]. However, despite inherent parallelism in sparse sparse direct methods, not much success has been achieved to date in developing their scalable parallel formulations [23, 47] In this paper, we show that the parallel Cholesky factorization This work was supported by Army Research Office under ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990. Also appears in K. A. Gallivan et al. Parallel Algorithms for Matrix Computations. SIAM, Philadelphia, PA, 1990.


Numerical Linear Algebra and Computer Architecture: An Evolving.. - Hedayat (1993)   (2 citations)  (Correct)

.... to take advantage of the cache memory, vector registers, and multiprocessing capabilities of the IBM 3090 [1] see also [121] for related results) A numerical linear algebra library based on block methods was designed and analyzed for the Cedar machine at the University of Illinois [60] in the mid 1980 s. Calahan [20] introduced a block oriented linear equation solver on the CRAY 2 uniprocessor with a software managed local memory. Dayde and Duff, in [31] compared the performance of block LU factorization on three vector multiprocessors. Demmel and Higham investigated the ....

....to block algorithms, their implementation, and comparisons between various machines. It is indicative of the intense activity in design, restructuring and implementation of block linear algebra algorithms. For further work, the interested reader is referred to the comprehensive survey in [60], and the monograph [46] Block algorithms derive their efficiency by reuse of data, through reordering of computation. Similar to the matrix multiplication example on the Cyber 205, the effect of loop interchange on the access pattern of LU decomposition was studied in detail in [44] and [113] ....

[Article contains additional citation context not shown here]

Gallivan K., Plemmons R., Sameh A., Parallel algorithms for dense linear algebra computations, SIAM Review, March 1990.


MOB Forms: A Class of Multilevel Block Algorithms for Dense.. - Juan Navarro Toni (1994)   (11 citations)  (Correct)

....results of the MOB forms in some present high performance workstations are presented. 1 Introduction In the last decade block algorithms have been proposed for dense linear algebra operations, with the objective of exploiting the data locality in architectures with a memory hierarchy [GaPS90]. These proposals are for one or two levels of the hierarchy. We can mention, for example, the LAPACK library [Ande92] the numerical codes developed to exploit the use of vector registers and the cache in Cedar [GaJM88] GaPS90] and the algorithm to utilize the registers and the cache for the ....

....the data locality in architectures with a memory hierarchy [GaPS90] These proposals are for one or two levels of the hierarchy. We can mention, for example, the LAPACK library [Ande92] the numerical codes developed to exploit the use of vector registers and the cache in Cedar [GaJM88] [GaPS90] and the algorithm to utilize the registers and the cache for the IBM RS6000 [DoMR91] Moreover, compiler techniques to generate these algorithms automatically are being developed [Wolf87] CaKe92] Although this blocking approach has produced dramatic improvements for the machines considered, it ....

[Article contains additional citation context not shown here]

K. A. Gallivan, R. J. Plemmons and A.H. Sameh, Parallel Algorithms for Dense Linear Algebra Computations, in Parallel Algorithms for Matrix Computations by K. A. Gallivan et al. SIAM, 1990, pp. 1--82.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1997)   (13 citations)  (Correct)

....factorizations involve on the order of n 3 floating point operations for data that needs n 2 memory locations. With the advent of vector and parallel supercomputers, the efficiency of the factorizations were seen to depend dramatically upon the algorithmic form chosen for the implementation [16, 18, 32]. These studies concluded that managing the memory hierarchy is the single most important factor governing the efficiency of the software implementation computing the factorization. The motivation of the LAPACK [2] project was to recast the algorithms in the EISPACK [35] and LINPACK [14] software ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32:54--135, 1990.


On The LU Factorization Of Sequences Of Identically Structured.. - Hadfield (1994)   (5 citations)  (Correct)

....recommend inclusion of pipelining to offset the cost of pivoting. For efficiency of pivot determination, row pivoting is preferred with column storage and column pivoting with row storage. A simple illustrative example of a basic (non pipelined) algorithm (taken from Gallivan, Plemmons, and Sameh [59]) is shown in Figure 2 16. It uses column oriented storage with a row pivoting scheme. Sparse Matrix Computations: While considerable attention has been paid to implementation of dense matrix operations on parallel architectures, less has been given to sparse matrix operations and there exists ....

....data dependencies within a parallel solve operation are exactly those reflected by the true L and LU edges for the forward substitution and the true U and LU edges for the back substitution. In order to see this consider the data dependency chart for a lower triangular solve seen in Figure 6 7 [59]. At each diagonal entry, a component of the solution is determined and can then be multiplied by the rest of that column with the resulting set of updates subtracted from the right hand side in a column oriented approach. Each particular update term however, need only be applied just before the ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. In R. J. Plemmons, editor, Parallel Algorithms for Matrix Computations, pages 1--82. SIAM, Philadelphia, PA, 1990.


Trading Off Parallelism and Numerical Stability - Demmel (1992)   (9 citations)  (Correct)

....parallel pivoting are all unstable, but on average only parallel pivoting is unstable. This is why we can using partial pivoting in practice: its worst case is very rare, but parallel pivoting is so often unstable as to be unusable. We note that an alternate kind of parallel pivoting discussed in [42] appears more stable, apparently because it eliminates entries in different columns as well as rows simultaneously. A final analysis of this problem remains to be done. We also note that, on many machines, the cost of partial pivoting is asymptotically negligible compared to the overall ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32:54--135, 1990.


Design And Performance Modeling Of Parallel Block Matrix.. - Dackland, Elmroth   (Correct)

....and shared memory multiprocessors have been the object for an intensive research during the past few years. The results have laid a ground for efficient implementations of block algorithms for basic matrix computations on different hierarchical memory and shared memory environments (e.g. see [1, 6, 7, 8, 12, 14, 20]) The current development of scalable distributed memory multicomputers (DMM) asks for the corresponding basis of efficient block algorithms. Several research projects have also been focusing on algorithms for matrix factorizations for DMM. For example, in the mid eighties non block algorithms ....

....multicomputers (DMM) asks for the corresponding basis of efficient block algorithms. Several research projects have also been focusing on algorithms for matrix factorizations for DMM. For example, in the mid eighties non block algorithms were discussed in [16, 17, 23, 24] For more references, see [14]. The development of distributed block algorithms have started more recently (e.g. see [3, 13] This paper is a contribution to the design, analysis, and evaluation of distributed block algorithms for some matrix factorizations which are efficient, and scalable in the sense that they preserve ....

[Article contains additional citation context not shown here]

K. Gallivan, R. Plemmons and A. Sameh, "Parallel Algorithms for Dense Linear Algebra Computations ", SIAM Review, Vol. 32 (1990), pp 54-135.


Potential and Achievable Parallelism in the.. - Hadfield, Davis (1994)   (Correct)

....has the processor that owns the current pivot column compute the multipliers (column of L) and broadcast them to the other processors. All of the processors in the frontal matrix s subcube then update their active columns using these multipliers. This is commonly refered to as a fan out algorithm [6]. Node weights correspond to a predicted parallel execution time. This predicted time is based on an analytical model of the fan out algorithm with 4 Hadfield and Davis specific parameters set according to results from an implementation and evaluation of this algorithm on the nCUBE 2. Assembly ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, Parallel algorithms for dense linear algebra computations, in Parallel Algorithms for Matrix Computations, SIAM, Philadelphia, PA, 1990.


Eager Combining: A Coherency Protocol for Increasing.. - Ricardo Bianchini (1994)   (6 citations)  (Correct)

....and Scott, 1991] Optimization algorithms can avoid contention by examining or updating the global solution infrequently. Linear algebra algorithms can exploit the properties of numerical equations to improve locality of reference, and as a side effect eliminate most producer consumer sharing [Gallivan et al. 1990]. Although most of these techniques reduce contention and improve locality of reference, they may introduce significant complexity in the algorithm, and do not generalize to all producer consumer sharing. For example, the block algorithms used in linear algebra are quite complex (compared to the ....

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, "Parallel Algorithms for Dense Linear Algebra Computations," SIAM Review, 32(1):54--135, March 1990.


A Separation Of Concerns Approach To The Design Of Parallel.. - Michel Chaudron   (Correct)

.... l ij e i e T j = I Gamma l j e T j (9) By taking Wm = V m; an inner product variant is obtained which corresponds to the following schedule of our Gamma program: for i = 2 to N Pi i Gamma1 j=1 TS1(i; j) Pi i Gamma1 l=1 TS2(i) TS3(i) 10) This variant is also known as the row sweep [6] or the ij variant [9] By taking Wm = V ;m , a vector update variant is obtained which corresponds to the following schedule of our Gamma program: for j = 1 to N Gamma 1 Pi N i=j 1 TS1(i; j) TS3(i) 11) This variant is also known as the column sweep [6] or the ji variant [9] In both cases ....

.... variant is also known as the row sweep [6] or the ij variant [9] By taking Wm = V ;m , a vector update variant is obtained which corresponds to the following schedule of our Gamma program: for j = 1 to N Gamma 1 Pi N i=j 1 TS1(i; j) TS3(i) 11) This variant is also known as the column sweep [6] or the ji variant [9] In both cases Wm has only off diagonal non zeros in one single column or row. Multiplying with the Wm s therefore involves vector vector ( BLAS1) operations. 6.3. BLAS2 and BLAS3 variants. The element variants (see Section 6.1) can be generalized to block form. For the ....

[Article contains additional citation context not shown here]

K.A. Gallivan, R.J. Plemmons, and A.H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


Special Purpose Parallel Computing - McColl (1993)   (9 citations)  (Correct)

.... factorisation, solution of general, tridiagonal and triangular linear systems, matrix inversion, iterative solution of linear systems (e.g. using the conjugate gradient method) singular value decomposition, eigenvalue problems, QR factorisation, least squares problems, recursive least squares, [34, 85, 131, 132, 136, 137, 160, 162, 182, 185, 306, 311, 318, 331, 378]. The volume of published literature on this topic is huge. A bibliography on parallel numerical algorithms which contains over 2000 entries, many of them concerned with dense linear algebra computations, was produced in 1989 [296] A useful entry point to the field is the survey paper [132] ....

....331, 378] The volume of published literature on this topic is huge. A bibliography on parallel numerical algorithms which contains over 2000 entries, many of them concerned with dense linear algebra computations, was produced in 1989 [296] A useful entry point to the field is the survey paper [132]. Unfortunately, many linear algebra computations arising in practical applications involve very large sparse matrices with no regular structure [80, 86, 105, 138, 159, 191] The efficient parallel solution of linear systems of this kind is generally much more complex than in the case of dense ....

K A Gallivan, R J Plemmons, and A H Sameh. Parallel algorithms for dense linear algebra computations. In Parallel algorithms for matrix computations, K A Gallivan, M T Heath, E Ng, J M Ortega, B W Peyton, R J Plemmons, C H Romine, A H Sameh and R G Voigt, pages 1--82. SIAM Press, 1990. McCOLL : SPECIAL PURPOSE PARALLEL COMPUTING


A Parallel Formulation of Interior Point Algorithms - Karypis, Gupta, Kumar (1994)   (10 citations)  (Correct)

....concentrated on dense or nearly dense problems [9, 61, 48, 21, 28] Eckstein [10] provides a good survey of work on parallel algorithms for dense linear programming problems. For dense matrices, there are efficient parallel formulations for both rank one updates [9] and Cholesky factorizations [12, 30], making it easy to develop highly scalable formulations for dense LP problems [28, 9] In contrast, attempts to develop general parallel algorithms for sparse linear programming problems have had limited success. Shu and Wu [60] developed parallel formulations for both the product form of ....

K. A. Gallivan, R. J. Plemons, and A. H. Sameh. Parallel Algorithms for Dense Linear Algebra Computations. In Parallel Algorithms for Matrix Computations, pages 1--82. SIAM, 1990.


The Formal Derivation of Parallel Triangular System Solvers .. - Chaudron, van Duin   (Correct)

....b 1, the component T l;b;m of the schedule F l;b;h describes a matrixvector computation. According to the BLAS classification, this is a level 2 operation. For b = 1, the strategy F l;1;h describes a strategy in terms of vectorvector (inner product) operations that is also known as the row sweep [7] or ij method [14] These are BLAS1 primitives. The inner product can be further refined, by performing the computations in a recursive doubling manner. In its turn, the recursive doubling strategy can be refined by a strategy which performs the computation in a sequential, element wise fashion ....

....Taking b 1 yields a strategy where the T l;b;h describes a matrix vector multiplication; i.e. a BLAS level 2 operation. For b = 1, the matrix vector multiplication reduces to vector vector operations which are BLAS1 level operations. This strategy is known in the literature as the column sweep [7] or ji method [14] The vector update can be further refined by computing the vector products in a sequential fashion. Hence, this computation proceeds by element wise (point point) computations. In BLAS terminology, these can be called level 0 BLAS operations. 5 Conclusions and Related Work We ....

K.A. Gallivan, R.J. Plemmons, and A.H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


Parallelizing Strassen's Method for Matrix Multiplication.. - Chou, Deng, Li, Wang (1994)   (1 citation)  (Correct)

....only on large matrices, which require large machines such as parallel computers. Thus, designing efficient parallel algorithms for these methods becomes essential. The parallelization of the general linear algebra routines on distributed memory MIMD architectures has achieved reasonable success [8]. But, due to the complication of these dedicated MM methods, the progress has not been compatible. In fact, Manber [12] in 1989 claimed that S method cannot be easily parallelized. In addition, the Winograd method has not been parallelized so far. Indeed, the attempts for the parallelization of ....

Gallivan, K. A., Plemmons, R. J. and Sameh, A. H. Parallel algorithms for dense linear algebra computations. SIAM Rev. 32 (1990), 54--135.


Techniques for the Interactive Development of Numerical Linear.. - Marsolf (1997)   (3 citations)  (Correct)

....are not present in the original algorithm. Recently, Gallivan et al. have pointed out that it is often possible to move easily from standard algorithms with limited parallelism to highly parallel algorithms by using higher level transformations that exploit algebraic knowledge of the operations [GPS90] It is this merger of high level transformations, algebraic knowledge, and more traditional compiler strategies applied to an algebraically expressive language that is to be exploited by the restructuring techniques within the FALCON environment. 2.2 Languages Instead of developing tools to ....

K. A. Gallivan, P. J. Plemmons, and A. H. Sameh. Parallel Algorithms for Dense Linear Algebra Computations. SIAM Review, 32(1):54--135, March 1990.


Software Libraries For Linear Algebra Computations On High.. - Dongarra, Walker (1995)   (42 citations)  (Correct)

....of one particular block algorithm, we now describe examples of the performance achieved with two well known block algorithms: LU and Cholesky factorizations. No extra floating point operations or extra working storage are required for either of these simple block algorithms. See Gallivan et al. [31] and Dongarra et al. 19] for surveys of algorithms for dense linear algebra on high performance computers. Table 3 illustrates the speed of the LAPACK routine for LU factorization of a real matrix, SGETRF in single precision on CRAY machines, and DGETRF in double precision on all other do j = ....

K. Gallivan, R. Plemmons, and A. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, 1990.


Notification And Multicast Networks For Synchronization.. - Andrews, Beckmann.. (1992)   (9 citations)  (Correct)

....before writing to a successors s flag, and reading the critical data from memory after acquiring the lock. 2.3.3 A Triangular Solve Algorithm Figure 5 shows pseudo code for a triangular system solver using multicast. It is a blocked columnsweep algorithm with parallel column updates [14] using static assignment of rows to processors. A is a lower triangular matrix, and xy are solution vectors that initially contain the right hand side vectors and eventually contain the solution. The figure shows the code executed by each processor. Figure 6 illustrates one iteration of the outer ....

Gallivan K., Plemmons P., and Sameh A. Parallel Algorithms for Dense Linear Algebra Computations. SIAM Review, 32, 1 (March 1990), 54-135.


The Design and Implementation of the ScaLAPACK LU, .. - Choi, Dongarra.. (1994)   (7 citations)  (Correct)

....which the BLAS and the BLACS are available. This paper presents the implementation details, performance, and scalability of the ScaLAPACK routines for the LU, QR, and Cholesky factorization of dense matrices. These routines have been studied on various parallel platforms by many other researchers [13, 19, 12]. We maintain compatibility between the ScaLAPACK codes and their LAPACK equivalents by isolating as much of the distributed memory operations as possible inside the PBLAS and ScaLAPACK auxiliary routines. Our goal is to simplify the implementation of complicated parallel routines while still ....

K. Gallivan, R. Plemmons, and A. Sameh. Parallel Algorithms for Dense Linear Algebra Computations. SIAM Review, 32:54--135, 1990.


The Design of Linear Algebra Libraries for High Performance.. - Dongarra, Walker (1993)   (1 citation)  (Correct)

....of one particular block algorithm, we now describe examples of the performance achieved with two well known block algorithms: LU and Cholesky factorizations. No extra floating point operations nor extra working storage are required for either of these simple block algorithms. See Gallivan et al. [33] and Dongarra et al. 19] for surveys of algorithms for dense linear algebra on high performance computers. Table 3 illustrates the speed of the LAPACK routine for LU factorization of a real matrix, SGETRF in single precision on CRAY machines, and DGETRF in double precision on all other machines. ....

K. Gallivan, R. Plemmons, and A. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, 1990.


Numerical Methods in Aero-Optics - Plemmons   Self-citation (Plemmons)   (Correct)

....by h in (10) is a matrix that we denote by H. Here, in the spatially invariant case, H is block Toeplitz with Toeplitz blocks. Thus the fast Fourier transform (FFT) can be used in computations involving H, e.g. 4, 5, 47] with efficient implementation possible on high performance architectures [19]. A classical approach employed for solving (9) is penalized least squares, also called Tikhonov regularization in the inverse problems literature [15] It requires minimization of an expression kHf Gamma gk 2 ffJ(f) 11) where k Delta k denotes the norm on L 2( Omega Gamma4 ff is a ....

K. Gallivan, R. J Plemmons, and A. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32:54--135, 1990.


An Environment for the Development of Numerical.. - De Rose, Gallivan.. (1997)   Self-citation (Gallivan)   (Correct)

....that are not present in the original algorithm. Gallivan et al. have pointed out that it is often possible to move easily from standard algorithms with limited parallelism to highly parallel algorithms by using higher level transformations that exploit algebraic knowledge of the operations [GPS90] It is this merger of high level transformations, algebraic knowledge, and more traditional compiler strategies applied to an algebraically expressive language that is to be exploited by the restructuring techniques within the FALCON environment. The approach in [BBC 93] is to distribute ....

K. A. Gallivan, P. J. Plemmons, and A. H. Sameh. Parallel Algorithms for Dense Linear Algebra Computations. SIAM Review, 32(1):54--135, March 1990.


Parallel Numerical Algorithms and Software - Houstis, Sameh, Vavalis.. (1998)   Self-citation (Sameh)   (Correct)

....the factorization: A = 0 B B B A 11 A 12 A 21 A 22 1 C C C A = 0 B B B I 0 L 21 I 1 C C C A 0 B B B A 11 A 12 0 B 1 C C C A ; where A 11 is a square matrix of order . The block LU algorithm is given in Table 3 where statements (i) and (ii) can be implemented in several ways [21]. Table 3. The main step of four LU versions. The arrow is used to represent the portion of the array which is overwritten by the new information obtained in each phase. Version 1 Version 2 (i) Solve for G: C L i Gamma1 G = C Factor: A 11 L 11 U 11 = A 11 (ii) Solve for M : B U T i Gamma1 ....

....this form of the algorithm becomes the BLAS2 version based on rank 1 updates. As with Versions 1 4, which produce the classical LU factorization, the computations of Version 5 can be reorganized so that different and combinations of BLAS3 primitives and different shapes of submatrices are used [21]. For distributed memory implementations we consider the two basic storage schemes: storage of A by rows and by columns. These row storage cases lead to so called Row Storage with Row Pivoting (RSRP) algorithm and Column Storage with Row Pivoting (CSRP) scheme. Gaussian elimination with pairwise ....

[Article contains additional citation context not shown here]

K. Gallivan, R. Plemmons, and A. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Rev., 32 (1990), pp. 54--135.


On Solving Block Toeplitz Systems Using a Block Schur Algorithm - Srikanth Thirumalai (1994)   Self-citation (Gallivan)   (Correct)

....on the use of block hyperbolic Householder matrices to Available as CSRD Report 1416 and in shortened form in the Proceedings of 1994 ICPP, pp. 274 281. represent products of hyperbolic Householder reflectors. Block operations are desirable since they are rich in level 3 BLAS operations [6] [7]. The algorithm can also be used to factor symmetric positive definite Toeplitz matrices by foregoing some of the Toeplitz structure in the matrix and considering it to be a block Toeplitz matrix. The factorization of symmetric indefinite Toeplitz matrices is handled by a modification to the block ....

....matrices requires some BLAS1 routines such as dotproducts and triads and some BLAS2 routines such as matrix vector products and rank 1 updates. If the block size m is very large, then on machines with hierarchical memory like the Alliant FX 8 or the Cedar multiprocessor a two level blocking scheme [7] can be used where the hyperbolic Householders are blocked every k steps and the block transformations are applied to the remaining portion of the pivot block or the entire generator matrix. If the block size is small then the generation of V; Y or Y; T can be carried through till the m th step ....

[Article contains additional citation context not shown here]

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32:54--135, 1990.


Communication-Efficient Parallel Gaussian Elimination - Tiskin (2004)   (Correct)

No context found.

K. A. Gallivan, R. J. Plemmons, A. H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Review 32 (1) (1990) 54-135.


Highly Scalable Parallel Algorithms for Sparse Matrix.. - Gupta, Karypis, Kumar (1995)   (39 citations)  (Correct)

No context found.

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


The Design and Analysis of Bulk-Synchronous Parallel Algorithms - Tiskin (1998)   (7 citations)  (Correct)

No context found.

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


The Design and Analysis of Bulk-Synchronous Parallel Algorithms - Tiskin (1998)   (7 citations)  (Correct)

No context found.

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel algorithms for dense linear algebra computations. SIAM Review, 32(1):54--135, March 1990.


A Block Version of the Eskow-Schnabel Modified Cholesky.. - Daydé (1995)   (1 citation)  (Correct)

No context found.

Gallivan, K., Plemmons, R.J., and Sameh, A.H. (1990). Parallel algorithms for dense linear algebra computations. SIAM Rev.(32), 54-135. 11


Parallel and Distributed Scientific Computing - A.. - Petitet..   (Correct)

No context found.

Gallivan, K., Plemmons, R., Sameh, A., Parallel algorithms for dense linear algebra computations, SIAM Review 32, 1990.


Stability Of The Partitioned Inverse Method For Parallel.. - Higham, Pothen (1994)   (1 citation)  (Correct)

No context found.

K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, Parallel algorithms for dense linear algebra computations, SIAM Review, 32 (1990), pp. 54--135.


Parallel and Distributed Scientific Computing - A.. - Petitet..   (Correct)

No context found.

Gallivan K., Plemmons R. and Sameh A., Parallel Algorithms for Dense Linear Algebra Computations, SIAM Review 32, 1990, 54-135.


A Numerical Linear Algebra Problem Solving.. - Petitet.. (1998)   (Correct)

No context found.

K. Gallivan, R. Plemmons and A. Sameh, Parallel Algorithms for Dense Linear Algebra Computations, SIAM Review, 32(1), 1990

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC