Abstract:
Speculative parallel execution of statically non-analyzable codes on Distributed Shared-Memory (DSM) multiprocessors is challenging because of the long latency and memory distribution present. However, such an approach may well be the best way of speeding up codes whose dependences can not be compiler analyzed. In this paper, we have extended past work by proposing a hardware scheme for the speculative parallel execution of loops that have a modest number of cross-iteration dependences. In this case, when a dependence violation is detected, we locally repair the state. Then, depending on the situation, we either re-execute one out-of-order iteration or, restart parallel execution from that point on. The general algorithm, called the Unified Privatization and Reduction algorithm (UPAR), privatizes, on demand, at cache-line level, executes reductions in parallel, merges the last values and partial results of reductions on-the-fly with minimum residual work at loop end. UPAR allows for completely dynamic scheduling and does not get slowed down if the working set of an iteration is larger than the cache size. Simulations indicate good speedups relative to sequential execution. The hardware support for reduction optimizations brings, on average, 50 % performance improvement and can be used both in speculative and normal execution.
Citations
|
1206
|
Introduction to Parallel Algorithms and Architectures: Arrays
– Leighton
- 1992
|
|
401
|
Supercompilers for Parallel and Vector Computers
– Zima, Chapman
- 1991
|
|
344
|
Dependence Analysis for Supercomputing
– Banerjee
- 1988
|
|
338
|
The Directory-Based Cache Coherence Protocol for the Dash Multiprocessor
– Lenoski
- 1990
|
|
296
|
Advanced compiler optimizations for supercomputers
– Padua, Wolfe
- 1986
|
|
177
|
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
– Steffan, Mowry
- 1998
|
|
166
|
MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors
– Veenstra, Fowler
- 1994
|
|
157
|
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization
– Rauchwerger, Padua
- 1995
|
|
141
|
Speculative versioning cache
– Gopal, Vijaykumar, et al.
- 1998
|
|
114
|
Performance analysis of parallelizing compilers on the Perfect Benchmarks programs
– Blume, Eigenmann
- 1992
|
|
102
|
Experience in the automatic parallelization of four perfect benchmark programs
– Hoeflinger, Li, et al.
- 1992
|
|
80
|
et al., “The PERFECT Club Benchmarks: Effective Performance Evaluation of Supercomputers
– Berry
- 1988
|
|
48
|
Advanced program restructuring for high-performance computers with Polaris
– Blume, Doallo, et al.
- 1996
|
|
45
|
Optimizing Compilers for Supercomputers
– Wolfe
- 1989
|
|
40
|
Massively Parallel Methods for Engineering and Scientific Problems
– Camp, Plimpton, et al.
- 1994
|
|
39
|
LCM: Memory system support for parallel language implementation
– LARUS, RICHARDS, et al.
- 1994
|
|
38
|
Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
– Zhang, Rauchwerger, et al.
- 1998
|
|
37
|
Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor
– Krishnan, Torrellas
- 1998
|
|
37
|
Efficient Parallel Algorithms for Graph Problems
– Kruskal
- 1986
|
|
30
|
Kunle Olukotun. Data Speculation Support for a Chip Multiprocessor
– Hammond, Willey
- 1998
|
|
26
|
Level 3 Basic Linear Algebra Subprograms for sparse matrices: a user level interface
– Marrone, Radicati, et al.
- 1997
|
|
21
|
HPF-2 Scope of Activities and Motivating Applications
– Duff, Schreiber, et al.
- 1994
|
|
18
|
A Scalable Method for Run-Time Loop Parallelization
– Rauchwerger, Amato, et al.
- 1995
|
|
16
|
Architectural Implications of a Family of Irregular Applications
– O’Hallaron, Shewchuk, et al.
- 1998
|
|
14
|
On the Automatic Parallelization of Sparse and Irregular Fortran Codes
– Asenjo, Gutierrez, et al.
- 1996
|
|
8
|
Compiler Technology for Machine-Independent Parallel Programming
– Kennedy
- 1994
|
|
6
|
Speculative Parallel Execution of Loops with Cross-Iteration Dependences in DSM Multiprocessors
– Zhang, Rauchwerger, et al.
- 1999
|