Results 1 - 10
of
37
Efficient Run-time Support for Irregular Block-Structured Applications
, 1998
"... Parallel implementations of scientific applications often rely on elaborate dynamic data structures with complicated communication patterns. We describe a set of intuitive geometric programming abstractions that simplify coordination of irregular block-structured scientific calculations without sacr ..."
Abstract
-
Cited by 38 (14 self)
- Add to MetaCart
Parallel implementations of scientific applications often rely on elaborate dynamic data structures with complicated communication patterns. We describe a set of intuitive geometric programming abstractions that simplify coordination of irregular block-structured scientific calculations without sacrificing performance. We have implemented these abstractions in KeLP, a C++ run-time library. KeLP's abstractions enable the programmer to express complicated communication patterns for dynamic applications, and to tune communication activity with a high-level, abstract interface. We show that KeLP's flexible communication model effectively manages elaborate data motion patterns arising in structured adaptive mesh refinement, and achieves performance comparable to hand-coded message-passing on several structured numerical kernels. to appear in J. Parallel and Distributed Computing 1 Introduction Many scientific numerical methods employ structured irregular representations to improve accura...
PLAPACK: Parallel Linear Algebra Package
, 1997
"... The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more application-centric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPI-l ..."
Abstract
-
Cited by 32 (9 self)
- Add to MetaCart
The PLAPACK project represents an effort to provide an infrastructure for implementing application friendly high performance linear algebra algorithms. The package uses a more application-centric data distribution, which we call Physically Based Matrix Distribution, as well as an object based (MPI-like) style of programming. It is this style of programming that allows for highly compact codes, written in C but useable from FORTRAN, that more closely reflect the underlying blocked algorithms. We show that this can be attained without sacrificing high performance. 1 Introduction Parallel implementation of most dense linear algebra operations is a relatively well understood process. Nonetheless, availability of general purpose, high performance parallel dense linear algebra libraries is severely hampered by the fact that translating the sequential algorithms, which typically can be described without filling up more than half a chalkboard, to a parallel code requires careful manipulation ...
A Three-Dimensional Approach to Parallel Matrix Multiplication
- IBM Journal of Research and Development
, 1995
"... A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensions---M , N , and K. Each processor performs a sin ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
A three-dimensional (3D) matrix multiplication algorithm for massively parallel processing systems is presented. The P processors are configured as a "virtual" processing cube with dimensions p 1 , p 2 , and p 3 proportional to the matrices' dimensions---M , N , and K. Each processor performs a single local matrix multiplication of size M=p 1 \Theta N=p 2 \Theta K=p 3 . Before the local computation can be carried out, each subcube must receive a single submatrix of A and B. After the single matrix multiplication has completed, K=p 3 submatrices of this product must be sent to their respective destination processors and then summed together with the resulting matrix C. The 3D parallel matrix multiplication approach has a factor P 1=6 less communication than the 2D parallel algorithms. This algorithm has been implemented on IBM POWERparallel TM SP2 TM systems (up to 216 nodes) and has yielded close to the peak performance of the machine. The algorithm has been combined with Winog...
Communication Overlap in Multi-Tier Parallel Algorithms
- Proceedings of Supercomputing
, 1998
"... Hierarchically organized multicomputers such as SMP clusters offer new opportunities and new challenges for high-performance computation, but realizing their full potential remains a formidable task. We present a hierarchical model of communication targeted to block-structured, bulk-synchronous appl ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
Hierarchically organized multicomputers such as SMP clusters offer new opportunities and new challenges for high-performance computation, but realizing their full potential remains a formidable task. We present a hierarchical model of communication targeted to block-structured, bulk-synchronous applications running on dedicated clusters of symmetric multiprocessors. Our model supports node-level rather processor-level communication as the fundamental operation, and is optimized for aggregate patterns of regular section moves rather than point-topoint messages. These two capabilities work synergistically. They provide flexibility in overlapping communication and overcome deficiencies in the underlying communication layer on systems where internode communication bandwidth is at a premium. We have implemented our communication model in the KeLP2.0 run time library. We present empirical results for five applications running on a cluster of Digital AlphaServer 2100's. Four of the applicatio...
A Programming Methodology for Dual-tier Multicomputers
- IEEE Transactions on Software Engineering
, 1999
"... Hierarchically-organized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dual-tier computers introduce many new degrees of freedom into the programmer 's performance model. We pr ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
Hierarchically-organized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dual-tier computers introduce many new degrees of freedom into the programmer 's performance model. We present a methodology for implementing block-structured numerical applications on dual-tier computers, and a run-time infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, run-time geometric meta-data, and asynchronous collective communication. It effectively overlaps communication in cases where non-blocking point-to-point message passing can fail to tolerate communication latency, either due to an incomplete implementation or because the point-to-point model is inappropriate. KeLP's abstractions hide considerable detail without sacrificing performance, and dual-tier applications written in KeLP...
Algorithmic redistribution methods for block cyclic decompositions
- IEEE Trans. on PDS
, 1996
"... ii To my parents iii Acknowledgments The writer expresses gratitude and appreciation to the members of his disser-tation committee, Michael Berry, Charles Collins, Jack Dongarra, Mark Jones and David Walker for their encouragement and participation throughout my doctoral experience. Special apprecia ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
ii To my parents iii Acknowledgments The writer expresses gratitude and appreciation to the members of his disser-tation committee, Michael Berry, Charles Collins, Jack Dongarra, Mark Jones and David Walker for their encouragement and participation throughout my doctoral experience. Special appreciation is due to Professor Jack Dongarra, Chairman, who pro-vided sound guidance, support and appropriate commentaries during the course of my graduate study. I also would like to thank Yves Robert and R. Clint Whaley for many useful and instructive discussions on general parallel algorithms and message passing software libraries. Many valuable comments for improving the presentation of this document were received from L. Susan Blackford. Finally, I am grateful to the Department of Computer Science at the University ofTennessee for allowing me to do this doctoral research work here. A special debt of gratitude is owed to Joanne Martin, IBM POWERparallel Division, for awarding me an IBM Corporation Fellowship covering the tuition as well as a stipend for the 1994-96 academic years. This work was also supported
Runtime Support for Multi-Tier Programming of Block-Structured Applications on SMP Clusters
- International Scientific Computing in Object-Oriented Parallel Environments Conference (ISCOPE ’97
, 1997
"... . We present a small set of programming abstractions to simplify efficient implementations for block-structured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of p ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
. We present a small set of programming abstractions to simplify efficient implementations for block-structured scientific calculations on SMP clusters. We have implemented these abstractions in KeLP 2.0, a C++ class library. KeLP 2.0 provides hierarchical SMPD control flow to manage two levels of parallelism and locality. Additionally, to tolerate slow inter-node communication costs, KeLP 2.0 combines inspector /executor communication analysis with overlap of communication and computation. We illustrate how these programming abstractions hide the low-level details of thread management, scheduling, synchronization, and message-passing, but allow the programmer to express efficient algorithms with intuitive geometric primitives. 1 Introduction Multi-tier parallel computers, such as clusters of symmetric multiprocessors (SMPs), have emerged as important platforms for high-performance computing [1]. A multi-tier computer, with several levels of locality and parallelism, presents a more c...
A framework for adaptive algorithm selection in STAPL
- IN PROC. ACM SIGPLAN SYMP. PRIN. PRAC. PAR. PROG. (PPOPP), PP 277–288
, 2005
"... Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distr ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at run-time. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines run-time tests that correctly select the best performing algorithm from among several competing algorithmic options in 86-100 % of the cases studied, depending on the operation and the system.
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
- INTERN. J. HIGH PERF. COMP. APPLICATIONS
, 2005
"... This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to th ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advantage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the
A Programming Model for Block-Structured Scientific Calculations on SMP Clusters
- Calculations on SMP Clusters. Ph. D. Dissertation, UCSD
, 1998
"... [None] ..."

