MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Analysis and Evaluation of The Synchronized Pipelined Parallelism Model

Download:
Download as a PDF
by Srinivas Vadlamani, Stephen Jenks
http://spds.ece.uci.edu/%7Esvadlama/papers/SPPM-TR-2006.pdf
Add To MetaCart

Abstract:

Advances in processor design have led to the development of Chip Multiprocessors (CMP) and Simultaneous Multithreading (SMT) processors that can exploit Thread-Level Parallelism (TLP) in suitably parallelized applications. The performance of these processors, however, largely depends on the programming model chosen for parallelization. For several important classes of applications, the commonly used spatial decomposition model of parallelism leads to inefficient usage of the memory hierarchy on these processors because of increased cache misses and memory bus utilization, causing the applications to perform poorly. As an alternative, we propose the Synchronized Pipelined Parallelism Model (SPPM) for parallelizing such applications on these processors. SPPM restructures applications as a producer-consumer chain and utilizes the high inter-processor communication bandwidth — either in the form of a fast processor interconnect, or one or more shared cache levels — on these multipro-cessor systems for communication between threads. SPPM is also effective on conventional Symmetric Multiprocessor (SMP) systems that have a high-speed front-side bus. In this paper, we demonstrate that SPPM provides better performance than the spatial decomposition model on these processors for the target applications, while maintaining the overall cache miss rate and, consequently, the memory bus utilization at around the same levels as those in the original sequential applications. rency.

Citations

503 The cache performance and optimizations of blocked algorithms – Lam, Rothberg, et al. - 1991
391 Parallel Computer Architecture: A Hardware/Software Approach – Culler, Singh, et al. - 1999
264 Improving Data Locality with Loop Transformations – McKinley, Carr, et al. - 1996
177 Tile size selection using cache organization and data layout – Coleman, McKinley - 1995
131 Data-centric multilevel blocking – Kodukula, Ahmed, et al. - 1997
129 iWARP: An integrated solution to high-speed parallel computing – Borkar, Cohn, et al. - 1988
110 Cache miss equations: a compiler framework for analyzing and tuning memory behavior – Ghosh, Martonosi, et al. - 1999
107 The Cray-l Computer System – Russell - 1978
104 Cache Miss Equations: An Analytical Representation of Cache Misses – Ghosh, Martonosi, et al. - 1997
102 On estimating and enhancing cache effectiveness – Ferrante, Sarkar, et al. - 1991
100 To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts – Temam, Granston, et al. - 1993
99 A Portable Programming Interface for Performance Evaluation on Modern Processors – Browne, Dongarra, et al. - 2000
77 Precise miss analysis for program transformations with caches of arbitrary associativity – Ghosh, Martonosi, et al. - 1998
71 Cache interference phenomena – Temam, Fricker, et al. - 1994
55 Systolic arrays (for VLSI – Kung, Leiserson
47 A Comparison of Compiler Tiling Algorithms – Rivera, Tseng - 1999
29 An Analytical Model for Designing Memory Hierarchies – Jacob, Chen, et al. - 1996
27 Decoupled software pipelining with the synchronization array – Rangan, Vachharajani, et al.
25 I-structures: An efficient data type for functional languages – Arvind, Thomas - 1981
24 A tile selection algorithm for data locality and cache interference – Chame, Moon - 1999
22 Automatic analytical modeling for the estimation of cache misses – Fraguela, Doallo, et al. - 1999
18 Estimating cache misses and locality using stack distance – Cascaval, Padua - 2003
14 An experimental evaluation of tiling and shackling for memory hierarchy management – Kodukula, Pingali, et al. - 1999
12 PAPI: A Portable Interface to Hardware Performance Counters – Browne, Deane, et al. - 1999
11 Modeling set associative caches behavior for irregular computations – Fraguela, Doallo, et al. - 1998
10 Automatic generation of block-recursive codes – Ahmed, Pingali - 2000
9 Computational Electrodynamics – Taflove - 1995
9 An analytic study of caching in computer systems – Buck, Singhal - 1996
9 Using PAPI for Hardware Performance Monitoring on Linux Systems – Dongarra, London, et al. - 2001
8 Code and Data Transformations for Improving Shared Cache Performance on SMT Processors – Nikolopoulos - 2003
6 Exploiting Global Data Locality in Non-Blocking Multithreaded Architectures – Lin, Gaudiot - 1997
5 Advanced Topics in Dataflow Computing and Multithreading – Gao, Bic, et al. - 1995
4 Nikolay Mateev, and Keshav Pingali. Tiling imperfectly-nested loops – Ahmed - 2000
3 The Synchronized Pipelined Parallelism Model – Vadlamani, Jenks - 2004
2 Decoupled Software Pipelining: A Promising Technique to Exploit Thread Level Parallelism – Ottoni, Rangan, et al. - 2005
1 Available at “http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.pdf – Tedler, Dodson, et al. - 2001