Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall, and apply a fully parallel data dependence test to determine if it had any cross--iteration dependences; if the test fails, then the loop is re--executed serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run--time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run--time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods.
|
1206
|
Introduction to Parallel Algorithms and Architectures: Arrays
– Leighton
- 1992
|
|
693
|
Virtual time
– Jefferson
- 1985
|
|
401
|
Supercompilers for Parallel and Vector Computers
– Zima, Chapman
- 1991
|
|
344
|
Dependence Analysis for Supercomputing
– Banerjee
- 1988
|
|
296
|
Advanced compiler optimizations for supercomputers
– Padua, Wolfe
- 1986
|
|
218
|
Dependence graphs and compiler optimizations
– Kuck, Kuhn, et al.
- 1981
|
|
214
|
Conversion of control dependence to data dependence
– ALLEN, KENNEDY, et al.
|
|
213
|
The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers
– Berry, Chen, et al.
- 1989
|
|
157
|
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization
– Rauchwerger, Padua
- 1995
|
|
133
|
The program dependence web: a representation supporting control, data-, and demand-driven interpretation of imperative languages
– Ottenstein, Ballance, et al.
- 1990
|
|
123
|
Automatic Array Privatization
– Tu, Padua
- 1993
|
|
115
|
RunTime Parallelization and Scheduling of Loops
– Saltz, Mirchandaney, et al.
- 1991
|
|
114
|
Performance analysis of parallelizing compilers on the Perfect Benchmarks programs
– Blume, Eigenmann
- 1992
|
|
102
|
Experience in the automatic parallelization of four perfect benchmark programs
– Hoeflinger, Li, et al.
- 1992
|
|
92
|
An empirical comparison of monitoring algorithms for access anomaly detection
– Dinning, Schonberg
- 1990
|
|
80
|
Compiler algorithms for synchronization
– Midkiff, Padua
- 1987
|
|
72
|
Array Privatization for Parallel Execution of Loops
– Li
- 1992
|
|
63
|
On-the-fly detection of data races for programs with nested fork-join parallelism
– Mellor-Crummey
- 1991
|
|
61
|
Detecting nondeterminacy in parallel programs
– Emrath, Ghosh, et al.
- 1992
|
|
59
|
Runtime compilation methods for multicomputers
– Wu, Saltz, et al.
- 1991
|
|
58
|
Compiler Optimizations for Enhancing Parallelism and Their Impact on the Architecture Design
– Polychronopoulos
- 1988
|
|
58
|
A Scheme to Enforce Data Dependence on Large Multiprocessor Systems
– Zhu, Yew
- 1987
|
|
56
|
Improving the performance of runtime parallelization
– Leung, Zahorjan
- 1993
|
|
48
|
The privatizing doall test: A run-time technique for doall loop identification and array privatization
– Rauchwerger, Padua
- 1994
|
|
47
|
On-the-fly detection of access anomalies
– Schonberg
- 1991
|
|
45
|
Data Dependence and Data-Flow Analysis of Arrays
– Maydan, Amarasinghe, et al.
- 1992
|
|
45
|
Optimizing Compilers for Supercomputers
– Wolfe
- 1989
|
|
40
|
Massively Parallel Methods for Engineering and Scientific Problems
– Camp, Plimpton, et al.
- 1994
|
|
39
|
Array privatization for shared and distributed memory machines
– Tu, Padua
- 1992
|
|
38
|
Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
– Zhang, Rauchwerger, et al.
- 1998
|
|
37
|
Efficient Parallel Algorithms for Graph Problems
– Kruskal
- 1986
|
|
36
|
An Efficient Algorithm for the Run-time Parallelization of DOACROSS Loops
– Chen, Torrellas, et al.
|
|
33
|
Tools for the efficient development of efficient parallel programs
– Nudler, Rudolph
- 1986
|
|
30
|
Automatic generation of nested, fork-join parallelism
– Burke, Cytron, et al.
- 1989
|
|
30
|
Compile-time support for efficient data race detection in shared-memory parallel programs
– Mellor-Crummey
- 1993
|
|
30
|
Parallelizing While Loops for Multiprocessor Systems
– Rauchwerger, Padua
- 1995
|
|
26
|
A manual for PARTI runtime primitives
– Berryman, Saltz
- 1990
|
|
21
|
An approach to synchronization of parallel computing
– Krothapalli, Sadayappan
- 1988
|
|
18
|
A Scalable Method for Run-Time Loop Parallelization
– Rauchwerger, Amato, et al.
- 1995
|
|
18
|
The preprocessed doacross loop
– Saltz, Mirchandaney
- 1991
|
|
13
|
The doconsider loop
– Saltz, Mirchandaney, et al.
- 1989
|
|
6
|
Debugging fortran on a shared-memory machine
– Allen, Padua
- 1987
|
|
6
|
Speculative Parallel Execution of Loops with Cross-Iteration Dependences in DSM Multiprocessors
– Zhang, Rauchwerger, et al.
- 1999
|
|
5
|
Time-stamping algorithms for parallelization of loops at run-time
– Xu, Chaudhary
- 1997
|
|
4
|
GSA based demand-driven symbolic analysis
– Tu, Padua
- 1994
|
|
3
|
LSI Circuit Simulation on Vector Computers
– Vladimirescu
- 1982
|
|
3
|
Effects of Parallelism Degree on Runtime Parallelism of Loops
– Xu
- 1998
|