Results 1 - 10
of
20
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract
-
Cited by 29 (15 self)
- Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely-used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highl...
The Multicomputer Toolbox Approach to Concurrent BLAS
- Proc. Scalable High Performance Computing Conf. (SHPCC
, 1993
"... Concurrent Basic Linear Algebra Subprograms (CBLAS) are a sensible approach to extending the successful Basic Linear Algebra Subprograms (BLAS) to multicomputers. We describe many of the issues involved in general-purpose CBLAS. Algorithms for dense matrix-vector and matrix-matrix multiplication on ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
Concurrent Basic Linear Algebra Subprograms (CBLAS) are a sensible approach to extending the successful Basic Linear Algebra Subprograms (BLAS) to multicomputers. We describe many of the issues involved in general-purpose CBLAS. Algorithms for dense matrix-vector and matrix-matrix multiplication on general P \Theta Q logical process grids are presented, and experiments run demonstrating their performance characteristics. This work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy. Work performed under the auspices of the U. S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48. Submitted to the Concurrency: Practice & Experience. y Address correspondence to: Mississippi State University, Engineering Research Center, PO Box 6176, Mississippi State, MS 39762. 601-325-8435. tony@cs.msstate.edu. Falgout, Skjellum, Smith & Still --- The Multicomputer Toolbo...
The Design and Evolution of Zipcode
- Parallel Computing
, 1994
"... Zipcode is a message-passing and process-management system that was designed for multicomputers and homogeneous networks of computers in order to support libraries and large-scale multicomputer software. The system has evolved significantly over the last five years, based on our experiences and iden ..."
Abstract
-
Cited by 22 (9 self)
- Add to MetaCart
(Show Context)
Zipcode is a message-passing and process-management system that was designed for multicomputers and homogeneous networks of computers in order to support libraries and large-scale multicomputer software. The system has evolved significantly over the last five years, based on our experiences and identified needs. Features of Zipcode that were originally unique to it, were its simultaneous support of static process groups, communication contexts, and virtual topologies, forming the "mailer" data structure. Point-to-point and collective operations reference the underlying group, and use contexts to avoid mixing up messages. Recently, we have added "gather-send" and "receive-scatter" semantics, based on persistent Zipcode "invoices," both as a means to simplify message passing, and as a means to reveal more potential runtime optimizations. Key features in Zipcode appear in the forthcoming MPI standard. Keywords: Static Process Groups, Contexts, Virtual Topologies, Point-to-Point Communica...
Broadway: A Software Architecture for Scientific Computing
- THE ARCHITECTURE OF SCIENTIFIC SOFTWARE
, 2000
"... Scientific programs rely heavily on software libraries. This paper describes the limitations of this reliance and shows how it degrades software quality. We offer a solution that uses a compiler to automatically optimize library implementations and the application programs that use them. Using exa ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Scientific programs rely heavily on software libraries. This paper describes the limitations of this reliance and shows how it degrades software quality. We offer a solution that uses a compiler to automatically optimize library implementations and the application programs that use them. Using examples and experiments with the PLAPACK parallel linear algebra library and the MPI message passing interface, we present our solution, which includes a simple declarative annotation language that describes certain aspects of a library's implementation. We also show how our approach can yield simpler scientific programs that are easier to understand, modify and maintain.
A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies
, 1995
"... In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + βC on two-dimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algori ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + βC on two-dimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently into three categories according to the communication primitives used and thus we offer a taxonomy for this family of related algorithms. All these algorithms are represented in the data distribution independent approach and thus do not require a specific data distribution for correctness. The algorithmic compatibility condition result shown here ensures the correctness of the matrix multiplication. We define and extend the data distribution functions and introduce permutation compatibility and algorithmic compatibility. We also discuss a permutation compatible data distribution (modified virtual 2D data distribution). We conclude that no single algorithm always achieves the best performance...
The Data-Distribution-Independent Approach to Scalable Parallel Libraries
, 1995
"... ..."
(Show Context)
The Multicomputer Toolbox { First-Generation Scalable Libraries
- In Proceedings of HICSS{ 27. IEEE Computer
, 1994
"... 1 \First-generation " scalable parallel libraries have been achieved, and are maturing, within the Mul-ticomputer Toolbox. The Toolbox includes sparse, dense, iterative linear algebra, a sti ODE/DAE solver, and an open software technology for additional numerical algorithms, plus an inter-arch ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
(Show Context)
1 \First-generation " scalable parallel libraries have been achieved, and are maturing, within the Mul-ticomputer Toolbox. The Toolbox includes sparse, dense, iterative linear algebra, a sti ODE/DAE solver, and an open software technology for additional numerical algorithms, plus an inter-architecture Make le mechanism for building applications. We have devised C-based strategies for useful classes of distributed data structures, including distributed matrices and vectors. The underlying Zip-codemessage passing system has enabled process-grid abstractions of multicomputers, communi-cation contexts, and process groups, all characteristics needed for building scalable libraries, and
The Multicomputer Toolbox: Current and Future Directions
- Proceedings of the Scalable Parallel Libraries Conference. IEEE Computer
, 1993
"... The Multicomputer Toolbox is a set of "firstgeneration " scalable parallel libraries. The Toolbox includes sparse, dense, direct and iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms. The Toolbox has an object-oriented des ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
The Multicomputer Toolbox is a set of "firstgeneration " scalable parallel libraries. The Toolbox includes sparse, dense, direct and iterative linear algebra, a stiff ODE/DAE solver, and an open software technology for additional numerical algorithms. The Toolbox has an object-oriented design; C-based strategies for classes of distributed data structures (including distributed matrices and vectors) as well as uniform calling interfaces are defined. At a high level in the Toolbox, data-distributionindependence (DDI) support is provided. DDI is needed to build scalable libraries, so that applications do not have to redistribute data before calling libraries. Data-distribution-independent mapping functions implement this capability. Data-distribution-independent algorithms are sometimes more efficient than fixeddata -distribution counterparts, because redistribution of data can be avoided. Underlying the system is a "performance and portability layer," which includes interfaces to sequent...
Explicit Parallel Programming in C++ based on the Message-Passing Interface (MPI)
, 1996
"... Introduction Explicit parallel programming using the Message Passing Interface (MPI), a de facto standard created by the MPI Forum [15], is quickly becoming the strategy of choice for performance-portable parallel application programming on multicomputers and networks of workstations, so it is inev ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Introduction Explicit parallel programming using the Message Passing Interface (MPI), a de facto standard created by the MPI Forum [15], is quickly becoming the strategy of choice for performance-portable parallel application programming on multicomputers and networks of workstations, so it is inevitably of interest to C++ programmers who use such systems. MPI programming is currently undertaken in C and/or Fortran-77, via the language bindings defined by the MPI Forum [15]. While the committee deferred the job of defining a C++ binding for MPI to MPI-2 [16], it is already possible to develop parallel programs in C++ using MPI, with the added help of one of several support libraries [2, 6, 13]. These systems all strive to enable immediate C++ programming based on MPI. The first such enabling system, MPI++, is the focus of this chapter. MPI++ was an early effort on our part to let us leverage MPI while programming in C++. Here this system is, to a large extent, our vehicle to i
Dense and Iterative Concurrent Linear Algebra in the Multicomputer Toolbox
- in Proceedings of the Scalable Parallel Libraries Conference (SPLC '93
, 1993
"... The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
The Multicomputer Toolbox includes sparse, dense, and iterative scalable linear algebra libraries. Dense direct, and iterative linear algebra libraries are covered in this paper, as well as the distributed data structures used to implement these algorithms; concurrent BLAS are covered elsewhere. We discuss uniform calling interfaces and functionality for linear algebra libraries. We include a detailed explanation of how the level-3 dense LU factorization works, including features that support data distribution independence with a blocked algorithm. We illustrate the data motion for this algorithm, and for a representative iterative algorithm, PCGS. We conclude that data distribution independent libraries are feasible and highly desirable. Much work remains to be done in performance tuning of these algorithms, though good portability and application-relevance have already been achieved.