Results 1 - 10
of
28
Titanium: A High-Performance Java Dialect
- In ACM
, 1998
"... Abstract Titanium is a language and system for high-performance parallel scientific computing. Titaniumuses Java as its base, thereby leveraging the advantages of that language and allowing us to focus ..."
Abstract
-
Cited by 192 (27 self)
- Add to MetaCart
Abstract Titanium is a language and system for high-performance parallel scientific computing. Titaniumuses Java as its base, thereby leveraging the advantages of that language and allowing us to focus
Parallel Programming in Split-C
- In Proceedings of Supercomputing '93
, 1993
"... We introduce the Split-C language, a parallel extension of C intended for high performance programming on distributed memory multiprocessors, and demonstrate the use of the language in optimizing parallel programs. Split-C provides a global address space with a clear concept of locality and unusual ..."
Abstract
-
Cited by 150 (18 self)
- Add to MetaCart
We introduce the Split-C language, a parallel extension of C intended for high performance programming on distributed memory multiprocessors, and demonstrate the use of the language in optimizing parallel programs. Split-C provides a global address space with a clear concept of locality and unusual assignment operators. These are used as tools to reduce the frequency and cost of remote access. The language allows a mixture of shared memory, message passing, and data parallel programming styles while providing efficient access to the underlying machine. We demonstrate the basic language concepts using regular and irregular parallel programs and give performance results for various stages of program optimization. 1 Overview Split-C is a parallel extension of the C programming language that supports efficient access to a global address space on current distributed memory multiprocessors. It retains the "small language" character of C and supports careful engineering and optimization of ...
Empirical Evaluation of the CRAY-T3D: A Compiler Perspective
, 1995
"... Most recent MPP systems employ a fast microprocessor surrounded by a shell of communication and synchronization logic. The CRAY-T3D 1 provides an elaborate shell to support global-memory access, prefetch, atomic operations, barriers, and block transfers. We provide a detailed empirical performance c ..."
Abstract
-
Cited by 51 (9 self)
- Add to MetaCart
Most recent MPP systems employ a fast microprocessor surrounded by a shell of communication and synchronization logic. The CRAY-T3D 1 provides an elaborate shell to support global-memory access, prefetch, atomic operations, barriers, and block transfers. We provide a detailed empirical performance characterization of these primitives using micro-benchmarks and evaluate their utility in compiling for a parallel language. We have found that the raw performance of the machine is quite impressive and the most effective forms of communication are prefetch and write. Other shell provisions, such as the bulk transfer engine and the external Annex register set, are cumbersome and of little use. By evaluating the system in the context of a language implementation, we shed light on important trade-offs and pitfalls in the machine architecture. 1 Introduction In 1991 and 1992 a wave of large-scale parallel machines were announced that followed the "shell" approach [25], including the Thinking...
Analyses and Optimizations for Shared Address Space Programs
- Journal of Parallel and Distributed Computing
, 1996
"... : We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by anothe ..."
Abstract
-
Cited by 41 (10 self)
- Add to MetaCart
: We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, called cycle analysis, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information from synchronization analysis, which handles postwait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We demonstrate the use of this analysis by optimizing remote access on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for non-blocking memory operations and one-way communication. The optimizations...
The Finite Volume, Finite Element, and Finite Difference Methods as Numerical Methods for Physical Field Problems
- Journal of Computational Physics
, 2000
"... The present work describes an alternative to the classical partial differential equations-based approach to the discretization of physical field problems. This alternative is based on a preliminary reformulation of the mathematical model in a partially discrete form, which preserves as much as possi ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
The present work describes an alternative to the classical partial differential equations-based approach to the discretization of physical field problems. This alternative is based on a preliminary reformulation of the mathematical model in a partially discrete form, which preserves as much as possible the physical and geometrical content of the original problem, and is made possible by the existence and properties of a common mathematical structure of physical field theories. The goal is to maintain the focus, both in the modeling and in the discretizati on step, on the physics of the problem, thinking in terms of numerical methods for physical field problems, and not for a particular mathematical form (for example, a partial differential equation) into which the original physical problem happens to be translated.
LoGPC: Modeling Network Contention in Message-Passing Programs
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1998
"... In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP [9] and LogGP [4] models to account for the impact of network contention and network interface DMA behavior on the performance of message-passing programs. We validate LoGPC by analyzing three applications implemented with Active Messages [11, 18] on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50% of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify tradeoffs between synchronous and asynchronous message passing styles.
Optimizing Parallel Programs with Explicit Synchronization
- In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation
, 1995
"... : We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by anothe ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
: We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, based on work by Shasha and Snir, checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information from post-wait synchronization, barriers, and locks. We demonstrate the use of this analysis by optimizing remote access on distributed memory machines. The optimizations include message pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data re-use. The performance improvements are as high as 20-35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. 1 Introduction Opti...
Mixed Consistency: A Model for Parallel Programming (Extended Abstract)
, 1994
"... A general purpose parallel programmingmodel called mixed consistency is developed for distributed shared memory systems. This model combines two kinds of weak memory consistency conditions: causal memory and pipelined random access memory, and provides four kinds of explicit synchronization operatio ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
A general purpose parallel programmingmodel called mixed consistency is developed for distributed shared memory systems. This model combines two kinds of weak memory consistency conditions: causal memory and pipelined random access memory, and provides four kinds of explicit synchronization operations: read locks, write locks, barriers, and await operations. The resulting suite of memory and synchronization operations can be tailored to solve most programming problems in an efficient manner. Conditions are also developed under which the net effect of programming in this model is the same as programming with sequentially consistent memory. Several examples are included to illustrate the model and the correctness conditions. Keywords: distributed shared memory, memory consistency, concurrency, synchronization.
Mimetic Discretizations for Maxwell's Equations and the Equations of Magnetic Diffusion
"... We construct reliable finite-difference methods for approximating the solutions to Maxwell's equations and equations of magnetic field diffusion using discrete analogs of differential operators that satisfy the identities and theorems of vector and tensor calculus in discrete form. These methods mim ..."
Abstract
-
Cited by 19 (10 self)
- Add to MetaCart
We construct reliable finite-difference methods for approximating the solutions to Maxwell's equations and equations of magnetic field diffusion using discrete analogs of differential operators that satisfy the identities and theorems of vector and tensor calculus in discrete form. These methods mimic many fundamental properties of the underlying physical problem, including the conservation laws, the symmetries in the solution, the nondivergence of particular vector fields and they do not allow spurious modes. The constructed method can be applied when there are strongly discontinuous properties of the media and nonorthogonal, nonsmooth computational grids. In this paper we apply discrete vector analysis techniques [1]--[4] to construct mimetic finite-difference methods to Maxwell's first-order curl equations (hyperbolic type) and to the equations of magnetic diffusion (parabolic type). The system of first-order Maxwell's curl equations can be written as follows: @ ~ B=@t = \Gammacur...
Heap Analysis and Optimizations for Threaded Programs
- In Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques
, 1997
"... Traditional compiler optimizations such as loop invariant removal and common sub-expression elimination are standard in all optimizing compilers. The purpose of this paper is to present new versions of these optimizations that apply to programs using dynamicallyallocated data structures, and to show ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
Traditional compiler optimizations such as loop invariant removal and common sub-expression elimination are standard in all optimizing compilers. The purpose of this paper is to present new versions of these optimizations that apply to programs using dynamicallyallocated data structures, and to show the effect of these optimizations on the performance of multithreaded programs. In this paper we show how heap pointer analyses can be used to support better dependence testing, new applications of the above traditional optimizations, and high-quality code generation for multithreaded architectures. We have implemented these analyses and optimizations in the EARTH-C compiler to study their impact on the performance of generated multithreaded code. We provide both static and dynamic measurements showing the effect of the optimizations applied individually, and together. We note several general trends, and discuss the performance tradeoffs, and suggest when specific optimizations are general...

