Results 1 - 10
of
75
CRL: High-Performance All-Software Distributed Shared Memory
, 1995
"... This paper introduces the C Region Library (CRL), a new all-software distributed shared memory (DSM) system. CRL requires no special compiler, hardware, or operating system support beyond the ability to send and receive messages. It provides a simple, portable shared address space programming model ..."
Abstract
-
Cited by 208 (13 self)
- Add to MetaCart
This paper introduces the C Region Library (CRL), a new all-software distributed shared memory (DSM) system. CRL requires no special compiler, hardware, or operating system support beyond the ability to send and receive messages. It provides a simple, portable shared address space programming model that is capable of delivering good performance on a wide range of multiprocessor and distributed system architectures. We have developed CRL implementations for two platforms: the CM-5, a commercial multicomputer, and the MIT Alewife machine, an experimental multiprocessor offering efficient support for both message passing and shared memory. We present results for up to 128 processors on the CM-5 and up to 32 processors on Alewife. In a set of controlled experiments, we demonstrate that CRL is the first all-software DSM system capable of delivering performance competitive with hardware DSMs. CRL achieves speedups within 30% of those provided by Alewife's native support for shared memory, eve...
Fine-grain Access Control for Distributed Shared Memory
- In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI
, 1994
"... This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require ..."
Abstract
-
Cited by 186 (33 self)
- Add to MetaCart
(Show Context)
This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing. This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. We incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each shared-memory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is...
Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems
- In Proceedings of the Operating Systems Design and Implementation Symposium
, 1996
"... This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large a ..."
Abstract
-
Cited by 160 (20 self)
- Add to MetaCart
(Show Context)
This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large amount of memory it consumes for protocol overhead data, and because of the diculty of garbage collecting that data. To achieve more scalable performance, we introduce and evaluate two new protocols. The rst, Home-based LRC (HLRC), is based on the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are propagated and from which all copies are derived. Unlike AURC, HLRC requires no specialized hardware support. We nd that the use of homes provides substantial improvements in performance and scalability over LRC. Our second protocol, called Overlapped Home-based LRC (OHLRC), takes advantage of the communication processor found on each node of the Paragon to ooad some of the protocol overhead of HLRC from the critical path followed by the compute processor. We nd that OHLRC provides modest improvements over HLRC. We also apply overlapping to the base LRC protocol, with similar results. Our experiments were done using ve of the Splash-2 benchmarks. We report overall execution times, as well as detailed breakdowns of elapsed time, message trac, and memory use for each of the protocols. 1
Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network
- In Proceedings of the 16th ACM Symposium on Operating Systems Principles
, 1997
"... Low-latency remote-write networks, such as DEC’s Memory Channel, provide the possibility of transparent, inexpensive, large-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an S ..."
Abstract
-
Cited by 131 (28 self)
- Add to MetaCart
(Show Context)
Low-latency remote-write networks, such as DEC’s Memory Channel, provide the possibility of transparent, inexpensive, large-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an SMP, and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster. In this paper, we describe a “twolevel” software coherent shared memory system—Cashmere-2L— that meets this challenge. Cashmere-2L uses hardware to share memory within a node, while exploiting the Memory Channel’s remote-write capabilities to implement “moderately lazy ” release consistency with multiple concurrent writers, directories, home nodes, and page-size coherence blocks across nodes. Cashmere-2L employs a novel coherence protocol that allows a high level of
Efficient Support for Irregular Applications on Distributed-Memory Machines
, 1995
"... Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and a ..."
Abstract
-
Cited by 90 (13 self)
- Add to MetaCart
Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance. This paper explores three issues -- partitioning, mutual exclusion, and data transfer -- crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.
Improving Release-Consistent Shared Virtual Memory using Automatic Update
- IN THE 2ND IEEE SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1996
"... Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" ..."
Abstract
-
Cited by 89 (21 self)
- Add to MetaCart
Shared virtual memory is a software technique to provide shared memory on a network of computers without special hardware support. Although several relaxed consistency models and implementations are quite effective, there is still a considerable performance gap between the "software-only" approach and the hardware approach that uses directory-based caches. Automatic update is a simple communication mechanism, implemented in the SHRIMP multicomputer, that forwards local writes to remote memory transparently. In this paper we propose a new lazy release consistency based protocol, called Automatic Update Release Consistency (AURC), that uses automatic update to propagate and merge shared memory modifications. We compare the performance of this protocol against a software-only LRC implementation on several Splash2 applications and show that the AURC approach can substantially improve the performance of LRC. For 16 processors, the average speedup has increased from 5.9 under LRC, to 8.3 under AURC.
SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a vi ..."
Abstract
-
Cited by 81 (0 self)
- Add to MetaCart
(Show Context)
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters. To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight
MGS: A multigrain shared memory system
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Abstract Parallel workstations, each comprising 10-100 processors, promisecost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared mem-ory multiprocessors through software over a local area network to synthesize larger shared memory syste ..."
Abstract
-
Cited by 67 (6 self)
- Add to MetaCart
(Show Context)
Abstract Parallel workstations, each comprising 10-100 processors, promisecost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared mem-ory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systemsDistributed Scalable Shared-memory Multiprocessors (DSSMPs). This paper introduces the design of a shared memory system thatuses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain sharedmemory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multi-grain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharingonly when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced. Using MGS, an in-depth study of several shared memory ap-plications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploitmultigrain sharing. Keeping the number of processors fixed, applications execute up to 85 % faster when each DSSMP node is amultiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance ad-vantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applicationsallows a DSSMP to almost match the performance of a tightlycoupled multiprocessor. 1 Introduction Large-scale shared memory multiprocessors have traditionally beenbuilt using custom communication interfaces, high performance VLSI networks, and special-purpose hardware support for sharedmemory. These systems achieve good performance on a wide range of applications; however, they are costly. Despite attempts to makecost (in addition to performance) scalable, fundamental obstacles prevent large tightly-coupled systems from being cost effective.Power distribution, clock distribution, cooling, and other packagAppears in Proceedings of the 23rd Annual InternationalSymposium on Computer Architecture, May 1996.
Higher-Order Distributed Objects
, 1995
"... IONS 3.1 Scheme 48 Kali Scheme is implemented as an extension to Scheme 48 [Kelsey and Rees 1994], an implementation of Scheme [Clinger and Rees 1991]. Scheme is a lexically scoped dialect of Lisp. Scheme 48 is based on as byte-coded interpreter written in a highly optimized, restricted dialect of ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
IONS 3.1 Scheme 48 Kali Scheme is implemented as an extension to Scheme 48 [Kelsey and Rees 1994], an implementation of Scheme [Clinger and Rees 1991]. Scheme is a lexically scoped dialect of Lisp. Scheme 48 is based on as byte-coded interpreter written in a highly optimized, restricted dialect of Scheme called Pre-Scheme, which compiles to C. Because of the way it is implemented, the system is very portable and is reasonably efficient for an interpreted system. 2 Unlike other Scheme implementations, 2 Scheme 48 is roughly 10-15 times slower slower than a highly optimized Scheme compiler generating native code [Kranz et al. 1986]. (define-record-type thread : : : continuation : : : ) (define current-thread : : : ) (define (spawn thunk) (let ((thread (make-thread))) (set-thread-continuation! thread (lambda (ignore) (thunk) (terminate-current-thread))) (context-switch thread))) (define (context-switch thread) (add-to-queue! runnable-threads current-thread) (switch-to-thread thre...
Understanding Application Performance on Shared Virtual Memory Systems
- In Proceedings of the 23rd Annual Symposium on Computer Architecture
, 1996
"... Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper b ..."
Abstract
-
Cited by 60 (20 self)
- Add to MetaCart
(Show Context)
Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics. We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches--- Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less opt...