Advances in processor design have led to the development of Chip Multiprocessors (CMP) and Simultaneous Multithreading (SMT) processors that can exploit Thread-Level Parallelism (TLP) in suitably parallelized applications. The performance of these processors, however, largely depends on the programming model chosen for parallelization. For several important classes of applications, the commonly used spatial decomposition model of parallelism leads to inefficient usage of the memory hierarchy on these processors because of increased cache misses and memory bus utilization, causing the applications to perform poorly. As an alternative, we propose the Synchronized Pipelined Parallelism Model (SPPM) for parallelizing such applications on these processors. SPPM restructures applications as a producer-consumer chain and utilizes the high inter-processor communication bandwidth — either in the form of a fast processor interconnect, or one or more shared cache levels — on these multipro-cessor systems for communication between threads. SPPM is also effective on conventional Symmetric Multiprocessor (SMP) systems that have a high-speed front-side bus. In this paper, we demonstrate that SPPM provides better performance than the spatial decomposition model on these processors for the target applications, while maintaining the overall cache miss rate and, consequently, the memory bus utilization at around the same levels as those in the original sequential applications. rency.
|
503
|
The cache performance and optimizations of blocked algorithms
– Lam, Rothberg, et al.
- 1991
|
|
391
|
Parallel Computer Architecture: A Hardware/Software Approach
– Culler, Singh, et al.
- 1999
|
|
264
|
Improving Data Locality with Loop Transformations
– McKinley, Carr, et al.
- 1996
|
|
177
|
Tile size selection using cache organization and data layout
– Coleman, McKinley
- 1995
|
|
131
|
Data-centric multilevel blocking
– Kodukula, Ahmed, et al.
- 1997
|
|
129
|
iWARP: An integrated solution to high-speed parallel computing
– Borkar, Cohn, et al.
- 1988
|
|
110
|
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
– Ghosh, Martonosi, et al.
- 1999
|
|
107
|
The Cray-l Computer System
– Russell
- 1978
|
|
104
|
Cache Miss Equations: An Analytical Representation of Cache Misses
– Ghosh, Martonosi, et al.
- 1997
|
|
102
|
On estimating and enhancing cache effectiveness
– Ferrante, Sarkar, et al.
- 1991
|
|
100
|
To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts
– Temam, Granston, et al.
- 1993
|
|
99
|
A Portable Programming Interface for Performance Evaluation on Modern Processors
– Browne, Dongarra, et al.
- 2000
|
|
77
|
Precise miss analysis for program transformations with caches of arbitrary associativity
– Ghosh, Martonosi, et al.
- 1998
|
|
71
|
Cache interference phenomena
– Temam, Fricker, et al.
- 1994
|
|
55
|
Systolic arrays (for VLSI
– Kung, Leiserson
|
|
47
|
A Comparison of Compiler Tiling Algorithms
– Rivera, Tseng
- 1999
|
|
29
|
An Analytical Model for Designing Memory Hierarchies
– Jacob, Chen, et al.
- 1996
|
|
27
|
Decoupled software pipelining with the synchronization array
– Rangan, Vachharajani, et al.
|
|
25
|
I-structures: An efficient data type for functional languages
– Arvind, Thomas
- 1981
|
|
24
|
A tile selection algorithm for data locality and cache interference
– Chame, Moon
- 1999
|
|
22
|
Automatic analytical modeling for the estimation of cache misses
– Fraguela, Doallo, et al.
- 1999
|
|
18
|
Estimating cache misses and locality using stack distance
– Cascaval, Padua
- 2003
|
|
14
|
An experimental evaluation of tiling and shackling for memory hierarchy management
– Kodukula, Pingali, et al.
- 1999
|
|
12
|
PAPI: A Portable Interface to Hardware Performance Counters
– Browne, Deane, et al.
- 1999
|
|
11
|
Modeling set associative caches behavior for irregular computations
– Fraguela, Doallo, et al.
- 1998
|
|
10
|
Automatic generation of block-recursive codes
– Ahmed, Pingali
- 2000
|
|
9
|
Computational Electrodynamics
– Taflove
- 1995
|
|
9
|
An analytic study of caching in computer systems
– Buck, Singhal
- 1996
|
|
9
|
Using PAPI for Hardware Performance Monitoring on Linux Systems
– Dongarra, London, et al.
- 2001
|
|
8
|
Code and Data Transformations for Improving Shared Cache Performance on SMT Processors
– Nikolopoulos
- 2003
|
|
6
|
Exploiting Global Data Locality in Non-Blocking Multithreaded Architectures
– Lin, Gaudiot
- 1997
|
|
5
|
Advanced Topics in Dataflow Computing and Multithreading
– Gao, Bic, et al.
- 1995
|
|
4
|
Nikolay Mateev, and Keshav Pingali. Tiling imperfectly-nested loops
– Ahmed
- 2000
|
|
3
|
The Synchronized Pipelined Parallelism Model
– Vadlamani, Jenks
- 2004
|
|
2
|
Decoupled Software Pipelining: A Promising Technique to Exploit Thread Level Parallelism
– Ottoni, Rangan, et al.
- 2005
|
|
1
|
Available at “http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.pdf
– Tedler, Dodson, et al.
- 2001
|