Clusters of symmetric multiprocessors (SMPs) are important platforms for high performance computing. With the success of hardware cache-coherent distributed shared memory (DSM), a lot of e#ort has also been made to support the coherent shared address space programming model in software on clusters. Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the performance of software virtual memory (SVM) is still far from that achieved on hardware DSM systems. The goal of this paper is to improve the performance of SVM on system area network clusters by considering communication and protocol layer interactions. We first examine what are the important communication system bottlenecks that stand in the way of improving parallel performance of SVM clusters; in particular, which parameters of the communication architecture are most important to improve further relative to processor speed, which ones are already adequate on modern systems for most applications, and how will this change with technology in the future. We find that the most important communication subsystem cost to improve is the overhead of generating and delivering interrupts for asynchronous protocol processing.
|
784
|
Myrinet: A Gigabit-per-second Local Area Network
– Boden, Cohen, et al.
- 1995
|
|
477
|
TreadMarks: Distributed shared memory on standard workstations and operating systems
– Keleher, Dwarkadas, et al.
- 1994
|
|
269
|
Virtual memory mapped network interface for the SHRIMP multicomputer
– Blumrich, Li, et al.
- 1994
|
|
236
|
Multi-level adaptive solutions to boundary-value problems
– Brandt
- 1977
|
|
174
|
A comparison of sorting algorithms for the connection machine CM-2
– Blelloch, Leiserson, et al.
- 1991
|
|
137
|
Performance evaluation of two home-based lazy release consistency protocols for shared memory virtual memory systems
– Zhou, Iftode, et al.
- 1996
|
|
118
|
The virtual interface architecture
– Dunning, Regnier, et al.
- 1998
|
|
111
|
A hierarchical O(N log N) force calculation algorithm
– Barnes, Hut
- 1986
|
|
108
|
FFT’s in external or hierarchical memory
– Bailey
- 1990
|
|
95
|
Effects of communication latency, overhead, and bandwidth in a cluster architecture
– Martin, Vahdat, et al.
- 1997
|
|
93
|
Improving Release-Consistent Shared Virtual Memory Using Automatic Update
– Iftode, Dubnicki, et al.
- 1996
|
|
82
|
SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory
– Erlichson, Nuckolls, et al.
- 1996
|
|
81
|
Parallel visualization algorithms: performance and architectural implications
– Singh, Gupta, et al.
- 1994
|
|
70
|
VMMC2: efficient support for reliable, connection-oriented communication
– Dubnicki, Bilas, et al.
- 1997
|
|
68
|
Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite
– Woo, Ohara, et al.
- 1995
|
|
64
|
Active Messages: a Mechanism for Integrated Communication and Computation
– Eicken, Culler
- 1992
|
|
64
|
Volume rendering on scalable shared-memory mimd architectures
– Neih, Levoy
- 1992
|
|
60
|
Application restructuring and performance portability across shared virtual memory and hardwarecoherent multiprocessors
– Jiang, Shan, et al.
- 1997
|
|
59
|
Decoupled Hardware Support for Distributed Shared Memory
– Reinhardt, Pfile, et al.
- 1996
|
|
54
|
Understanding Application Performance on Shared Virtual Memory Systems
– Iftode, Singh, et al.
|
|
47
|
VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks
– Kontothanassis, Hunt, et al.
- 1997
|
|
45
|
Home-based svm protocols for smp clusters: design and performance
– Samanta, Bilas, et al.
- 1998
|
|
44
|
Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation
– Zhou, Iftode, et al.
- 1997
|
|
42
|
Fine-Grain Software Distributed Shared Memory on SMP Clusters
– Scales, Gharachorloo, et al.
- 1998
|
|
38
|
Design Issues and Tradeoffs for Write Buffers
– Skadron, Clark
- 1997
|
|
37
|
Using MemoryMapped Network Interfaces to Improve the Performance of Distributed Shared Memory
– Kontothanassis, Scott
- 1996
|
|
35
|
Hiding Communication Latency and Coherence Overhead in Software DSMs
– Bianchini, Kontohanassis, et al.
- 1996
|
|
33
|
Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers
– Karlsson, Stenstrom
- 1996
|
|
31
|
User-Space Communication: A Quantitative Study
– Araki, Bilas, et al.
- 1998
|
|
31
|
Fast Interrupt Priority Management for Operating System Kernels
– Stodolsky, Bershad, et al.
- 1993
|
|
29
|
Overview of network memory channel for PCI
– Gillett, Collins, et al.
- 1996
|
|
28
|
Implementing Fine-Grain Distributed Shared Memory On Commodity SMP Workstations
– Schoinas, Falsafi, et al.
- 1996
|
|
27
|
Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results and Implications
– Singh, Hennessy
- 1992
|
|
24
|
The effects of communication parameters on end performance of shared virtual memory clusters
– Bilas, Singh
- 1997
|
|
23
|
Scheduling communication on an SMP node parallel machine
– Falsafi, DA
- 1997
|
|
23
|
Augmint: A multiprocessor simulation environment for intel x86 architectures
– Sharma, Nguyen, et al.
- 1996
|
|
21
|
Implications of hierarchical N-body techniques for multiprocessor architecture
– Singh, Gupta, et al.
- 1995
|
|
14
|
Performance Monitoring in a Myrinet-connected Shrimp Cluster
– Liao, Martonosi, et al.
- 1998
|
|
13
|
Accelerating shared virtual memory using commodity ni support to avoid asynchronous message handling
– Bilas, Liao, et al.
- 1999
|
|
13
|
VMMC-2: e#cient support for reliable, connection-oriented communication
– Dubnicki, Bilas, et al.
- 1997
|
|
11
|
Design issues and tradeo s for write bu ers
– Skadron, Clark
- 1997
|
|
10
|
ServerNet SAN I/O architecture
– Horst, Garcia
- 1997
|
|
9
|
Limits to the performance of software shared memory: A layered approach
– BILAS, JIANG, et al.
- 1999
|
|
9
|
Hierarchical N-body methods
– Hernquist
- 1988
|
|
8
|
Supporting a coherent shared address space across SMP nodes: An application-driven investigation
– Bilas, Iftode, et al.
- 1996
|
|
8
|
Telegraphos: A Substrate for High Performance Computing on Workstation Clusters
– Katevenis, Markatos, et al.
- 1997
|
|
8
|
The fast messages (fm) 2.0 streaming interface
– Pakin, Buchanan, et al.
- 1996
|
|
6
|
The effects of latency and occupancy on the performance of dsm multiprocessors
– Holt, Heinrich, et al.
- 1995
|
|
6
|
Architectural and application bottlenecks in scalable DSM multiprocessors
– Holt, Singh, et al.
- 1996
|
|
5
|
The SGI Origin2000: a scalable cc-numa server
– Laudon, Lenoski
- 1997
|