In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache misses and non-local memory access latencies), inter-process synchronization operations, and global barrier synchronizations. Performance measurements to quantify the rate at which communication costs for an algorithm increases as more processors are used is integral to the study of an algorithm's efficiency and scalability. In this thesis, we explore the problem of performance characterization of a multiprocessor in the context of the shared-variable programming model with emphasis on characterizing the dynamic run-time behavior. We have developed a hierarchical model to characterize multiprocessor system performance using a multi-phase computation structure with concurrent asynchronous execution within a phase. Two sets of system characterization parameters have been proposed that completely describe the static and dynamic behavior of a given input workload on a target multiprocessor system. The characterization parameters are calibrated by experimental measurements on the input workload. A series of
|
979
|
An introduction to probability theory and its application, volume I
– Feller
- 1967
|
|
848
|
Memory coherence in shared virtual memory systems
– Li, Hudak
- 1989
|
|
705
|
SPLASH: Stanford Parallel Applications for Shared Memory
– Singh, Weber, et al.
- 1992
|
|
375
|
Algorithms for Scalable Synchronization on Shared-memory Multiprocessors
– Mellor-Crummey, Scott
- 1991
|
|
269
|
Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities
– Amdahl
- 1967
|
|
218
|
Dependence graphs and compiler optimizations
– Kuck, Kuhn, et al.
- 1981
|
|
213
|
The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers
– Berry, Chen, et al.
- 1989
|
|
186
|
Hot-spot Contention and Combining in Multistage Interconnection Networks
– Pfister, Norton
- 1985
|
|
179
|
A Fast Mutual Exclusion Algorithm
– Lamport
- 1983
|
|
143
|
Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors
– Goodman, Vernon, et al.
- 1989
|
|
140
|
The Livermore Fortran kernels: a computer test of the numerical performance range
– McMahon
- 1986
|
|
127
|
The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors
– Anderson, Lazowska, et al.
|
|
126
|
Mirage: A Coherent Distributed Shared Memory Design
– Fleisch, Popek
- 1989
|
|
116
|
A survey of cache coherence schemes for multiprocessors
– Stenstrom
- 1990
|
|
113
|
Performance of processor-memory interconnections for multiprocessors
– Patel
- 1981
|
|
110
|
A characterisation of sharing in parallel programs and its application to coherency protocol evaluation
– Eggers, Katz
- 1988
|
|
108
|
The implementation of a coherent memory abstraction on a NUMA multiprocessor: Experiences with Platinum
– Cox, Fowler
- 1989
|
|
106
|
Allocating independent subtasks on parallel processors
– Kruskal, Weiss
- 1984
|
|
104
|
Dhrystone: a synthetic systems programming benchmark
– Weicker
- 1984
|
|
92
|
Distributing Hot-Spot Addressing in LargeScale Multiprocessors
– Yew, Tzeng, et al.
- 1987
|
|
81
|
The NYU Ultracomputer { designing an MIMD shared memory parallel computer
– Gottlieb, Grishman, et al.
- 1983
|
|
81
|
E cient Synchronization on Multiprocessors with Shared Memory
– Kruskal, Rudolph, et al.
- 1988
|
|
75
|
Supercomputer performance evaluation and the Perfect Benchmarks
– CYBENKO, KIPP, et al.
- 1990
|
|
71
|
Impossibility and universality results for wait-free synchronization
– Herlihy
- 1988
|
|
68
|
A Hypercube Shared Virtual Memory System
– Li, Schaefer
- 1989
|
|
63
|
The IBM research parallel processor prototype (RP3): Introduction and architecture
– PFISTER, BRANTLEY, et al.
- 1985
|
|
62
|
Plus: A Distributed Shared-Memory System
– Bisiani, Ravishankar
|
|
60
|
Machine characterization BASed on an abstract high level machine
– Saavedra-Barrera, Smith, et al.
- 1989
|
|
59
|
A synthetic benchmark
– Curnow, Wichmann
- 1976
|
|
57
|
Two algorithms for barrier synchronization
– Hensgen, Finkel, et al.
- 1988
|
|
46
|
Adaptive backoff synchronization techniques
– Agarwal, Cherian
- 1989
|
|
46
|
The fuzzy barrier: a mechanism for high speed synchronization of processors
– Gupta
- 1989
|
|
39
|
SPEC Benchmark Suite: Designed for today's advanced system
– Uniejewski
- 1989
|
|
38
|
Coherence of Distributed Shared Memory: Unifying
– Ramachandran, Ahamad, et al.
- 1989
|
|
37
|
Vector access performance in parallel memories using a skewed storage scheme
– Harper, Jump
- 1987
|
|
36
|
ªPerformance of Synchronous Parallel Algorithms with Regular Structures,º
– Madala, Sinclair
- 1991
|
|
35
|
How not to lie with statistics: the correct way to summarize benchmark results
– Fleming, Wallace
- 1986
|
|
34
|
The Monarch parallel processor hardware design
– Rettberg, Crowther, et al.
- 1990
|
|
33
|
The butter y barrier
– Brooks
- 1986
|
|
33
|
On the Effective Bandwidth of Interleaved Memories in Vector Processing Systems
– Oed, Lange
- 1985
|
|
32
|
Characterizing Computer Performance with a Single Number
– Smith
- 1988
|
|
31
|
Square Multiprocessor: Early Experiences and Performance
– Kendall
- 1992
|
|
28
|
The performance of spin lock alternatives for shared memory multiprocessors
– Anderson
- 1990
|
|
27
|
The NAS Kernel Benchmark Program
– Bailey, Barton
- 1985
|
|
27
|
The LINPACK benchmark: An explanation
– Dongarra
- 1987
|
|
27
|
Computer Performance Evaluation Methodology
– Heidelberger, Lavenberg
- 1984
|
|
27
|
The prime memory system for array access
– Lawrie, Vora
- 1982
|
|
26
|
Multiprocessor Performance
– Gelenbe
- 1989
|
|
26
|
Synchronization with multiprocessor caches
– Lee, Ramachandran
- 1990
|
|
25
|
Performance Observability
– Malony
- 1990
|