Results 1  10
of
20
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract

Cited by 29 (15 self)
 Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on blockpartitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel blockpartitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128node Intel iPSC/860 hypercube. It is shown that the routines are highl...
The Multicomputer Toolbox Approach to Concurrent BLAS
 Proc. Scalable High Performance Computing Conf. (SHPCC
, 1993
"... Concurrent Basic Linear Algebra Subprograms (CBLAS) are a sensible approach to extending the successful Basic Linear Algebra Subprograms (BLAS) to multicomputers. We describe many of the issues involved in generalpurpose CBLAS. Algorithms for dense matrixvector and matrixmatrix multiplication on ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
Concurrent Basic Linear Algebra Subprograms (CBLAS) are a sensible approach to extending the successful Basic Linear Algebra Subprograms (BLAS) to multicomputers. We describe many of the issues involved in generalpurpose CBLAS. Algorithms for dense matrixvector and matrixmatrix multiplication on general P \Theta Q logical process grids are presented, and experiments run demonstrating their performance characteristics. This work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy. Work performed under the auspices of the U. S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W7405ENG48. Submitted to the Concurrency: Practice & Experience. y Address correspondence to: Mississippi State University, Engineering Research Center, PO Box 6176, Mississippi State, MS 39762. 6013258435. tony@cs.msstate.edu. Falgout, Skjellum, Smith & Still  The Multicomputer Toolbo...
The Design and Evolution of Zipcode
 Parallel Computing
, 1994
"... Zipcode is a messagepassing and processmanagement system that was designed for multicomputers and homogeneous networks of computers in order to support libraries and largescale multicomputer software. The system has evolved significantly over the last five years, based on our experiences and iden ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
(Show Context)
Zipcode is a messagepassing and processmanagement system that was designed for multicomputers and homogeneous networks of computers in order to support libraries and largescale multicomputer software. The system has evolved significantly over the last five years, based on our experiences and identified needs. Features of Zipcode that were originally unique to it, were its simultaneous support of static process groups, communication contexts, and virtual topologies, forming the "mailer" data structure. Pointtopoint and collective operations reference the underlying group, and use contexts to avoid mixing up messages. Recently, we have added "gathersend" and "receivescatter" semantics, based on persistent Zipcode "invoices," both as a means to simplify message passing, and as a means to reveal more potential runtime optimizations. Key features in Zipcode appear in the forthcoming MPI standard. Keywords: Static Process Groups, Contexts, Virtual Topologies, PointtoPoint Communica...
Broadway: A Software Architecture for Scientific Computing
 THE ARCHITECTURE OF SCIENTIFIC SOFTWARE
, 2000
"... Scientific programs rely heavily on software libraries. This paper describes the limitations of this reliance and shows how it degrades software quality. We offer a solution that uses a compiler to automatically optimize library implementations and the application programs that use them. Using exa ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Scientific programs rely heavily on software libraries. This paper describes the limitations of this reliance and shows how it degrades software quality. We offer a solution that uses a compiler to automatically optimize library implementations and the application programs that use them. Using examples and experiments with the PLAPACK parallel linear algebra library and the MPI message passing interface, we present our solution, which includes a simple declarative annotation language that describes certain aspects of a library's implementation. We also show how our approach can yield simpler scientific programs that are easier to understand, modify and maintain.
A PolyAlgorithm for Parallel Dense Matrix Multiplication on TwoDimensional Process Grid Topologies
, 1995
"... In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + βC on twodimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algori ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = &alpha;AB + &beta;C on twodimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently into three categories according to the communication primitives used and thus we offer a taxonomy for this family of related algorithms. All these algorithms are represented in the data distribution independent approach and thus do not require a specific data distribution for correctness. The algorithmic compatibility condition result shown here ensures the correctness of the matrix multiplication. We define and extend the data distribution functions and introduce permutation compatibility and algorithmic compatibility. We also discuss a permutation compatible data distribution (modified virtual 2D data distribution). We conclude that no single algorithm always achieves the best performance...
The DataDistributionIndependent Approach to Scalable Parallel Libraries
, 1995
"... ..."
(Show Context)
The Multicomputer Toolbox { FirstGeneration Scalable Libraries
 In Proceedings of HICSS{ 27. IEEE Computer
, 1994
"... 1 \Firstgeneration " scalable parallel libraries have been achieved, and are maturing, within the Multicomputer Toolbox. The Toolbox includes sparse, dense, iterative linear algebra, a sti ODE/DAE solver, and an open software technology for additional numerical algorithms, plus an interarch ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
1 \Firstgeneration " scalable parallel libraries have been achieved, and are maturing, within the Multicomputer Toolbox. The Toolbox includes sparse, dense, iterative linear algebra, a sti ODE/DAE solver, and an open software technology for additional numerical algorithms, plus an interarchitecture Make le mechanism for building applications. We have devised Cbased strategies for useful classes of distributed data structures, including distributed matrices and vectors. The underlying Zipcodemessage passing system has enabled processgrid abstractions of multicomputers, communication contexts, and process groups, all characteristics needed for building scalable libraries, and
The Multicomputer Toolbox: Current and Future Directions
 Proceedings of the Scalable Parallel Libraries Conference. IEEE Computer
, 1993
"... The Multicomputer Toolbox is a set of "firstgeneration " scalable parallel libraries. The Toolbox includes sparse, dense, direct and iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms. The Toolbox has an objectoriented des ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
The Multicomputer Toolbox is a set of "firstgeneration " scalable parallel libraries. The Toolbox includes sparse, dense, direct and iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms. The Toolbox has an objectoriented design; Cbased strategies for classes of distributed data structures (including distributed matrices and vectors) as well as uniform calling interfaces are defined. At a high level in the Toolbox, datadistributionindependence (DDI) support is provided. DDI is needed to build scalable libraries, so that applications do not have to redistribute data before calling libraries. Datadistributionindependent mapping functions implement this capability. Datadistributionindependent algorithms are sometimes more efficient than fixeddata distribution counterparts, because redistribution of data can be avoided. Underlying the system is a "performance and portability layer," which includes interfaces to sequent...
Explicit Parallel Programming in C++ based on the MessagePassing Interface (MPI)
, 1996
"... Introduction Explicit parallel programming using the Message Passing Interface (MPI), a de facto standard created by the MPI Forum [15], is quickly becoming the strategy of choice for performanceportable parallel application programming on multicomputers and networks of workstations, so it is inev ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Introduction Explicit parallel programming using the Message Passing Interface (MPI), a de facto standard created by the MPI Forum [15], is quickly becoming the strategy of choice for performanceportable parallel application programming on multicomputers and networks of workstations, so it is inevitably of interest to C++ programmers who use such systems. MPI programming is currently undertaken in C and/or Fortran77, via the language bindings defined by the MPI Forum [15]. While the committee deferred the job of defining a C++ binding for MPI to MPI2 [16], it is already possible to develop parallel programs in C++ using MPI, with the added help of one of several support libraries [2, 6, 13]. These systems all strive to enable immediate C++ programming based on MPI. The first such enabling system, MPI++, is the focus of this chapter. MPI++ was an early effort on our part to let us leverage MPI while programming in C++. Here this system is, to a large extent, our vehicle to i
Dense and Iterative Concurrent Linear Algebra in the Multicomputer Toolbox
 in Proceedings of the Scalable Parallel Libraries Conference (SPLC '93
, 1993
"... The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We discuss uniform calling interfaces and functionality for linear algebra libraries. We include a detailed explanation of how the level3 dense LU factorization works, including features that support data distribution independence with a blocked algorithm. We illustrate the data motion for this algorithm, and for a representative iterative algorithm, PCGS. We conclude that data distribution independent libraries are feasible and highly desirable. Much work remains to be done in performance tuning of these algorithms, though good portability and applicationrelevance have already been achieved.