Results 11 - 20
of
20
Compile-time Synchronization Optimizations for Software DSMs
, 1998
"... Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for elimi ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating synchronization overhead in software DSMs, developing new algorithms to handle situations found in practice. We evaluate the contributions of synchronization elimination algorithms based on 1) dependence analysis, 2) communication analysis, 3) exploiting coherence protocols in software DSMs, and 4) aggressive expansion of parallel SPMD regions. We also found suppressing expensive parallelism to be useful for one application. Experiments indicate these techniques eliminate almost all parallel task invocations, and reduce the number of barriers executed by 66% on average. On a 16 processor IBM SP-2, speedups are improved on average by 35%, and are tripled for some applications.
Reducing synchronization overhead for compiler-parallelized codes on software DSMs
- Languages and Compilers for Parallel Computing, Tenth International Workshop, LCPC'97, volume 1366 of Lecture Notes in Computer Science
, 1997
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imba ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20 % on average and by up to 60 % for some applications. 1
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs
- International Journal of Parallel Programming
, 1998
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imb ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance ...
Software Support For Improving Locality in Advanced Scientific Codes
, 2000
"... Programs can achieve good performance only if they possess data locality, This paper describes our proposal to develop and evaluate software support for improving locality for advanced scientific applications for both sequential and parallel machines. The basic premise is that both compile-time anal ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Programs can achieve good performance only if they possess data locality, This paper describes our proposal to develop and evaluate software support for improving locality for advanced scientific applications for both sequential and parallel machines. The basic premise is that both compile-time analyses and sophisticated run-time systems are necessary. Run-time systems are needed because many programs are not analyzable statically. Compiler support is crucial both for inserting interfaces to the run-time system and for directly applying program transformations where possible. We examine locality optimizations needed for three features of advanced scientific applications (3D arrays, irregular accesses, and pointers). Preliminary experimental evaluation is very encouraging, but much work remains to automate and improve these compiler and run-time systems. We propose to extend locality optimizations in several directions, handling: cache conflicts between multiple data, deep memory hierar...
Efficient Support for Two-Dimensional Data Distributions in Distributed Shared Memory Systems
, 2001
"... Despite their clear advantage in scalability, two-dimensional data distributions are not efficiently supported by current distributed shared memory (DSM) systems. This is because sharing between nodes occurs on both columns and rows. Sharing in two dimensions is not a good match for DSM systems, bec ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Despite their clear advantage in scalability, two-dimensional data distributions are not efficiently supported by current distributed shared memory (DSM) systems. This is because sharing between nodes occurs on both columns and rows. Sharing in two dimensions is not a good match for DSM systems, because either a row- or column-major data layout of pages leads to (1) severe thrashing, if a strong memory consistency is used, or (2) exchange of unnecessary data between nodes, if a relaxed memory consistency is used. This paper introduces the 2D protocol, which is a variant of the write-shared protocol that efficiently supports two-dimensional data distributions. It does this by providing an API through which the user informs the DSM system of truly shared elements within pages. This allows the DSM, at synchronization points, to send only truly shared data, instead of all the changes made to entire page. As the problem size and/or the number of nodes grows, programs written using a...
SUIF-Adapt: An Integrated Compiler/Run-Time System for Global and Dynamic Data Distributions
, 2002
"... Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem is especially difficult in programs where (1) data redistribution between computational phases is considered or (2) the participating processors (nodes) executing a parallel appli ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem is especially difficult in programs where (1) data redistribution between computational phases is considered or (2) the participating processors (nodes) executing a parallel application are not dedicated. In either case, the commonly used BLOCK and CYCLIC distributions no longer suffice. We have investigated this problem...
Design and Evaluation of a Computation Partitioning Framework for Data-Parallel Compilers
, 2001
"... this paper, we present the design and evaluation of a flexible computation partitioning framework used in the Rice dHPF compiler for High Performance Fortran. Our CP framework supports a more general class of static computation partitionings than previous data-parallel compilers, enables sophisticat ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
this paper, we present the design and evaluation of a flexible computation partitioning framework used in the Rice dHPF compiler for High Performance Fortran. Our CP framework supports a more general class of static computation partitionings than previous data-parallel compilers, enables sophisticated partitionings that maximize parallelism in the presence of arbitrary control flow, and supports several novel optimizations that have proven essential for obtaining high overall performance when parallelizing scientific programs. In earlier work, we have shown that the dHPF compiler is able to effectively parallelize HPF versions of existing Fortran codes and achieve speedups that are comparable with hand-coded parallel performance [2, 1]. For example, code generated by dHPF for the NAS application benchmarks SP and BT is within 0--21% of the performance of sophisticated hand-coded message-passing versions of the codes, and these results are achieved with HPF versions that require changes to fewer than 6% of the lines of the original serial codes. Three new CP-based optimizations in the dHPF compiler were key to achieving this level of performance. Two of these three optimizations, along with another new algorithm presented in this paper, require the full generality of our CP framework and could not be implemented in any other existing compiler that we are aware of. In data distribution-based languages such as Fortran D [13], Vienna Fortran [11], and High Performance Fortran (HPF) [20], it is natural to represent a CP for a statement instance as the set of processor(s) that "own" a particular array element or scalar variable (by virtue of a data distribution). For example, the widely-used owner-computes rule [25] (a simple heuristic for computation partitioning selection) ...
Hardware support for flexible distributed shared memory
- IEEE Transactions on Computers
, 1998
"... Abstract—Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-s ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-specific hardware can be added to the off-the-shelf component base to reduce overheads. Second, application-specific coherence protocols can avoid some overheads by exploiting programmer (or compiler) knowledge of an application’s communication patterns. To explore the interaction between these approaches, we simulated four designs that add DSM acceleration hardware to a collection of off-the-shelf workstation nodes. Three of the designs support user-level software coherence protocols, enabling application-specific protocol optimizations. To verify the feasibility of our hardware approach, we constructed a prototype of the simplest design. Measured speedups from the prototype match simulation results closely. We find that, even with aggressive DSM hardware support, custom protocols can provide significant speedups for some applications. In addition, the custom protocols are generally effective at reducing the impact of other overheads, including those due to less aggressive hardware support and larger network latencies. However, for three of our benchmarks, the additional hardware acceleration provided by our most aggressive design avoids the need to develop more efficient custom protocols. Index Terms—Parallel systems, distributed shared memory, cache coherence protocols, fine-grain cache coherence, coherence protocol optimization, workstation clusters. 1
Reflections on "Tempest and Typhoon: User-level Shared Memory"
, 1994
"... Introduction Tempest and Typhoon have emerged as among the most influential contributions of the Wisconsin Wind Tunnel project, a collaborative effort with Prof. Mark D. Hill, several staff members, and a large group of graduate students. This retrospective focuses on the origins of the Tempest and ..."
Abstract
- Add to MetaCart
Introduction Tempest and Typhoon have emerged as among the most influential contributions of the Wisconsin Wind Tunnel project, a collaborative effort with Prof. Mark D. Hill, several staff members, and a large group of graduate students. This retrospective focuses on the origins of the Tempest and Typhoon ideas and their subsequent evolution. The Beginnings The seeds of the project began to germinate in late 1990 and early 1991 with our effort to rapidly prototype large-scale shared-memory multiprocessors. Because other research groups had a one- to two-year lead in their prototyping efforts---and considerably more resources---our project started with the goal of exploiting the parallel computers that our department was acquiring with funding from NSF's Institutional Infrastructure program. During this exploratory phase, we made the essential observation that shared-memory systems permit a continuum of implementations, ranging from full hardware support to s
An Integrated Compiler-Time/Run-Time Approach to Reducing Contention in OpenMP Programs
"... Contention is one of the largest obstacles to the scalability of many software DSM programs. It is caused by multiple threads reading from one thread simultaneously. In this paper, we present an integrated compiler/runtime approach to reducing communication contention in software DSM programs in the ..."
Abstract
- Add to MetaCart
Contention is one of the largest obstacles to the scalability of many software DSM programs. It is caused by multiple threads reading from one thread simultaneously. In this paper, we present an integrated compiler/runtime approach to reducing communication contention in software DSM programs in the context of supporting OpenMP on network of workstations. Our approach relies on a combination of compiler and runtime support to precisely predict the global communication patterns for both regular and irregular applications. The compiler uses regular section analysis to compute each thread's access pattern between synchronization points. The access pattern is then combined with runtime information to derive global communication patterns, which are used by the runtime system to optimize communication. The optimizations include multicast and communication staggering. We measure the effect of the optimizations on a 32-node Pentium-II cluster for four applications: Modified Gramm-Schmidt, 3D-F...

