Results 1 - 10
of
37
Global Communication Analysis and Optimization
- In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation
, 1996
"... Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than han ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performan...
Compiler and Software Distributed Shared Memory Support for Irregular Applications
, 1997
"... We investigate the use of a software distributed shared memory (DSM) layer to support irregular computations on distributed memory machines. Software DSM supports irregular computation through demand fetching of data in response to memory access faults. With the addition of a very limited form of co ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
We investigate the use of a software distributed shared memory (DSM) layer to support irregular computations on distributed memory machines. Software DSM supports irregular computation through demand fetching of data in response to memory access faults. With the addition of a very limited form of compiler support, namely the identification of the section of the indirection array accessed by each processor, many of these on-demand page fetches can be aggregated into a single message, and prefetched prior to the access fault. We have measured the performance of this approach for two irregular applications, moldyn and nbf, using the Tread-Marks DSM system on an 8-processor IBM SP2. We find that it has similar performance to the inspector-executor method supported by the CHAOS run-time library, while requiring much simpler compile-time support. For moldyn, it is up to 23~0 faster than CHAOS, depending on the input problem’s characteristics; and for nbf, it is no worse than 14 % slower. If we include the execution time of the inspector, the software DSM-based approach is always faster than CHAOS. The advantage of this approach increases as the frequency of changes to the indirection array increases. The disadvantage of this approach is the potential for false sharing overhead when the data set is small or has poor spatial locality,
Compiling for the Multiscalar Architecture
, 1998
"... High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level para ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level parallelism (ILP) to achieve high performance on such applications by searching for independent instructions in a dynamic window of instructions and executing them on a wide-issue pipeline. Increasing the window size and the issue width to extract more ILP may hinder achieving high clock speeds, limiting overall performance. The Multiscalar architecture employs multiple small windows and many narrow-issue processing units to exploit ILP at high clock speeds. Sequential programs are partitioned into code fragments called tasks, which are speculatively executed in parallel. Inter-task register dependences are honored via communication and synchronization and inter-task control flow and memory depe...
A Unified Framework for Optimizing Communication in Data-Parallel Programs
- IEEE Transactions on Parallel and Distributed Systems
, 1996
"... This paper presents a framework, based on global array data-flow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. We introduce available section descriptor, a novel representation of communication involving array sections. This representation al ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
This paper presents a framework, based on global array data-flow analysis, to reduce communication costs in a program being compiled for a distributed memory machine. We introduce available section descriptor, a novel representation of communication involving array sections. This representation allows us to apply techniques for partial redundancy elimination to obtain powerful communication optimizations. With a single framework, we are able to capture optimizations like (i) vectorizing communication, (ii) eliminating communication that is redundant on any control flow path, (iii) reducing the amount of data being communicated, (iv) reducing the number of processors to which data must be communicated, and (v) moving communication earlier to hide latency, and to subsume previous communication. We show that the bidirectional problem of eliminating partial redundancies can be decomposed into simpler unidirectional problems even in the context of an array section representation, w...
An HPF Compiler for the IBM SP2
, 1995
"... We describe pHPF, an research prototype HPF compiler for the IBM SP series parallel machines. The compiler accepts as input Fortran 90 and Fortran 77 programs, augmented with HPF directives; sequential loops are automatically parallelized. The compiler supports symbolic analysis of expressions. T ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
We describe pHPF, an research prototype HPF compiler for the IBM SP series parallel machines. The compiler accepts as input Fortran 90 and Fortran 77 programs, augmented with HPF directives; sequential loops are automatically parallelized. The compiler supports symbolic analysis of expressions. This allows parameters such as the number of processors to be unknown at compile-time without significantly affecting performance. Communication schedules and computation guards are generated in a parameterized form at compile-time. Several novel optimizations and improved versions of well-known optimizations have been implemented in pHPF to exploit parallelism and reduce communication costs. These optimizations include elimination of redundant communication using data-availability analysis; using collective communication; new techniques for mapping scalar variables; coarse-grain wavefronting; and communication reduction in multi-dimensional shift communications. We present experimenta...
A Comparison of Locality Transformations for Irregular Codes
, 2000
"... Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Researchers have proposed several data and computation transformations to improve locality in irregular scientific codes. We experimentally compare their performance and present GPART, a new technique based on hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighboring nodes with priority on nodes with high degree, and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Experimental results show GPART matches the performance of more sophisticated partitioning algorithms to with 6%–8%, with a small fraction of the overhead. It is thus useful for optimizing programs whose running times are not known.
Interprocedural Partial Redundancy Elimination and Its Application To Distributed Memory Compilation
- University of Maryland
, 1995
"... Partial Redundancy Elimination #PRE# is a general scheme for suppressing partial redundancies which encompasses traditional optimizations likeloopinvariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally.We use interp ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
Partial Redundancy Elimination #PRE# is a general scheme for suppressing partial redundancies which encompasses traditional optimizations likeloopinvariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally.We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines. 1 Introduction Partial Redundancy Elimination #PRE# is a well known technique for optimizing code by suppressing partially redundant computations. It encompasses traditional optimizations like invariant code motion and redundant computation elimination. It is widely used in optimizing compilers for performing common subexpression elimination and strength reduction. More recently, it has been used for more complex code placement tasks like placement of communication statements while compiling for parallel machines #...
Improving Compiler and Run-Time Support for Adaptive Irregular Codes
- In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
, 1998
"... Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a new technique based on the owner-computes rule which eliminates the need for buffers or synchronized writes but may replicate computation. We evaluate its performance for irregular codes while varying connectivity, locality, and adaptivity. LOCALWRITE improves performance by 50--150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity. 1 Introduction Scientists are beginning to exploit parallelism to provide the computing power they need for research and development. As they attempt to model more complex problems, irregular adaptive computations become increasingly important. The core of these applications is fre...
Improving Compiler and Run-Time Support for Irregular Reductions
, 1998
"... Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DS ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DSMs apply local reductions to replicated buffers (e.g., CVM, TreadMarks). We introduce LO-CALWRITE, a new technique for parallelizing irregular reductions based on the owner-computes rule. It eliminates the need for buffers or synchronized writes, but may replicate computation. We investigate the impact of connectivity (node/edge ratio), locality (accesses to local data) and adaptivity (edge modifications) on their relative performance. LOCALWRITE improves performance by 50-150% compared to using replicated buffers. Gather/scatter using CHAOS generally provides the best performance, but LO-CALWRITE can outperform CHAOS for applications with low locality or high adaptivity. We also discover the flushupdate coherence protocol can improve performance by 15-25 % for software DSMs over an invalidate protocol.
Compiler Support for Software Prefetching
, 1998
"... Due to the growing disparity between processor speed and main memory speed, techniques that improve cache utilization and hide memory latency are often needed to help applications achieve peak performance. Compiler-directed software prefetching is a hybrid software/hardware strategy that addresses t ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Due to the growing disparity between processor speed and main memory speed, techniques that improve cache utilization and hide memory latency are often needed to help applications achieve peak performance. Compiler-directed software prefetching is a hybrid software/hardware strategy that addresses this need. In this form of prefetching, the compiler inserts cache prefetch instructions into a program during the compilation process. During the program's execution, the hardware executes the prefetch instructions in parallel with other operations, bringing data items into the cache prior to the point where they are actually used, eliminating processor stalls due to cache misses. In this dissertation, we focus on the compiler's role in software prefetching. In a set of experimental studies, we evaluate the performance of current software prefetching strategies, first for sequential benchmark programs running on a simulated uniprocessor machine, and then for a set of parallel benchmarks on a...

