MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors (1996) [17 citations — 2 self]

Download:
Download as a PDF
by Rajeev Barua, David Kranz, Anant Agarwal
In Proceedings of the Ninth Workshop on Languages and Compilers for Parallel Computing
ftp://cag.lcs.mit.edu/pub/papers/pdf/barua-lcpc96.pdf
Add To MetaCart

Abstract:

Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access. The compiler implements a solution to the problem of finding communication-minimal partitions of loops and data. Loop and data partitions specify the distribution of loop iterations and array data across processors. A good loop partition maximizes the cache hit rate while a good data partition minimizes remote cache misses. The problems of finding loop and data partitions interact when multiple loops access arrays with differing reference patterns. Our algorithm handles programs with multiple nested parallel loops accessingmany arrays with array access indices being general affine functions of loop variables. It discovers communication-minimal partitions when communication-free partitions do not exist. The compiler also uses sub-blocking to handle finite cache sizes. A cost model that estimates the cost of a loop and data partition given machine parameters such as cache, local and remote access timings, is presented. Minimizing the cost as estimated by our model is an NP-complete problem, as is the fully general problem of partitioning. A heuristic method which provides good approximate solutions in polynomial time is presented. The loop and data partitioning algorithm has been implemented in the compiler for the MIT Alewife machine. The paper presents results obtained from a working compiler on a 16-processor machine for three real applications: Tomcatv, Erlebacher, and Conduct. Our results demonstrate that combined optimization of loops and data can result in improvements in runtime by nearly a factor of two over optimization of loops alone. 1

Citations

361 A Loop Transformation Theory and an Algorithm to Maximize Parallelism – Wolf, Lam - 1991
285 Noise strategies for improving local search – Selman, Kautz, et al. - 1994
241 Global optimizations for parallelism and locality on scalable parallel machines – Anderson, Lain - 1993
188 Compiler optimizations for improving data locality – Carr, McKinley, et al. - 1994
176 SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers – Wilson, French, et al.
163 The MIT Alewife Machine: Architecture and Performance – Agarwal, Bianchini, et al. - 1995
154 The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor – Agarwal, Chaiken, et al. - 1991
152 Unifying data and control transformations for distributed shared memory machines – Cierniak, Li - 1995
151 Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers – Gupta, Banerjee - 1992
142 An optimizing Fortran D compiler for MIMD distributed-memory machines – TSENG - 1993
129 Data optimization: Allocation of arrays to reduce communication on SIMD machines – Knobe, Lukas, et al. - 1990
85 Compile-time techniques for data distribution in distributed memory machines – Ramanujam, Sadayappan - 1991
66 Automatic Data Layout Using 0{1 Integer Programming – Bixby, Kennedy, et al. - 1993
52 Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors – Agarwal, Krantz, et al. - 1995
42 Compiling Fortran D for MIMD distributed memory machines’ CACM – Hiranandani, Kennedy, et al. - 1992
34 Reduction of cache coherence overhead by compiler data layout and loop transformation – Ju, Dietz - 1992
23 Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic – Abraham, Hudak - 1991
17 Venkat Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors – Agarwal, Kranz - 1993
1 Addressing Partitioned Arrays – Barua - 1996