Results 1 - 10
of
333
Global Optimizations for Parallelism and Locality on Scalable Parallel Machines
- IN PROCEEDINGS OF THE SIGPLAN '93 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1993
"... Data locality is critical to achieving high performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key i ..."
Abstract
-
Cited by 256 (20 self)
- Add to MetaCart
Data locality is critical to achieving high performance on large-scale parallel machines. Non-local data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key issue in compiling programs for these architectures.
Data and Computation Transformations for Multiprocessors
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract
-
Cited by 177 (15 self)
- Add to MetaCart
(Show Context)
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
Symbolic Bounds Analysis of Pointers, Array Indices, and Accessed Memory Regions
- PLDI 2000
, 2000
"... This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear p ..."
Abstract
-
Cited by 134 (15 self)
- Add to MetaCart
(Show Context)
This paper presents a novel framework for the symbolic bounds analysis of pointers, array indices, and accessed memory regions. Our framework formulates each analysis problem as a system of inequality constraints between symbolic bound polynomials. It then reduces the constraint system to a linear program. The solution to the linear program provides symbolic lower and upper bounds for the values of pointer and array index variables and for the regions of memory that each statement and procedure accesses. This approach eliminates fundamental problems associated with applying standard xed-point approaches to symbolic analysis problems. Experimental results from our implemented compiler show that the analysis can solve several important problems, including static race detection, automatic parallelization, static detection of array bounds violations, elimination of array bounds checks, and reduction of the number of bits used to store computed values.
A Survey of Collective Communication in Wormhole-Routed Massively Parallel Computers
- IEEE COMPUTER
, 1994
"... Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke low-level message passing primitive ..."
Abstract
-
Cited by 110 (6 self)
- Add to MetaCart
Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke low-level message passing primitives, may use a higher-level communications library, or may write the program in a data parallel language and rely on the compiler to translate language constructs into communication operations. Whichever method is used, the performance of communication operations directly affects the total computation time of the parallel application. Communication operations may be either point-to-point, which involves a single source and a single destination, or collective, in which more than two processes participate. This paper discusses the design of collective communication operations for current systems that use the wormhole routing switching strategy, in which messages are divided into small pieces and...
Compiler Optimizations for Eliminating Barrier Synchronization
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By ..."
Abstract
-
Cited by 91 (13 self)
- Add to MetaCart
(Show Context)
This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By exploiting compiletime computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on averageand by several orders of magnitude for certain programs. 1 Introduction Parallel machines with shared address spaces and coherent caches provide an attracti...
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granula ..."
Abstract
-
Cited by 87 (11 self)
- Add to MetaCart
This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique
An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System
- In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems
, 1996
"... On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient executi ..."
Abstract
-
Cited by 82 (10 self)
- Add to MetaCart
On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports. To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remain...
A Linear Algebra Framework for Static HPF Code Distribution
, 1995
"... High Performance Fortran (hpf) was developed to support data parallel programming for simd and mimd machines with distributed memory. The programmer is provided a familiar uniform logical address space and specifies the data distribution by directives. The compiler then exploits these directives to ..."
Abstract
-
Cited by 81 (11 self)
- Add to MetaCart
(Show Context)
High Performance Fortran (hpf) was developed to support data parallel programming for simd and mimd machines with distributed memory. The programmer is provided a familiar uniform logical address space and specifies the data distribution by directives. The compiler then exploits these directives to allocate arrays in the local memories, to assign computations to elementary processors and to migrate data between processors when required. We show here that linear algebra is a powerful framework to encode Hpf directives and to synthesize distributed code with space-efficient array allocation, tight loop bounds and vectorized communications for INDEPENDENT loops. The generated code includes traditional optimizations such as guard elimination, message vectorization and aggregation, overlap analysis... The systematic use of an affine framework makes it possible to prove the compilation scheme correct. An early version of this paper was presented at the Fourth International Workshop on Comp...
On the Design of Chant: A Talking Threads Package
- PROC.SUPERCOMPUTING 94,PP.350-359, WASHINGTON,D.C
, 1994
"... Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. However, lightweight thread packages traditionally support only shared memory synchronization and communication primitives, limiting their ..."
Abstract
-
Cited by 79 (11 self)
- Add to MetaCart
Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous control structures in applications and language implementations. However, lightweight thread packages traditionally support only shared memory synchronization and communication primitives, limiting their use in distributed memory environments. We introduce the design of a runtime interface, called Chant, that supports lightweight threads with the capability of communication using both point-to-point and remote service request primitives, built from standard message passing libraries. This is accomplished by extending the POSIX pthreads interface with global thread identifiers, global thread operations, and message passing primitives. This paper introduces the Chant interface and describes the runtime issues in providing an efficient, portable implementation of such an interface. In particular, we present performance results of the initial portion of our runtime system: point-to-point message passing among threads. We examine the issue of thread scheduling in the presence of polling for messages, and measure the overhead incurred when using this interface as opposed to using the underlying communication layer directly. Weshow that our design can accommodate various polling methods, depending on the level of support present in the underlying thread system, and imposes little overhead in point-to-point message passing over the existing communication layer.