Results 1 - 10
of
208
Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory
"... This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granulari ..."
Abstract
-
Cited by 236 (5 self)
- Add to MetaCart
(Show Context)
This paper describes Shasta, a system that supports a shared address space in software on clusters of computers with physically distributed memory. A unique aspect of Shasta compared to most other software distributed shared memory systems is that shared data can be kept coherent at a fine granularity. In addition, the system allows the coherence granularity to vary across different shared data structures in a single application. Shasta implements the shared address space by transparently rewriting the application executable to intercept loads and stores. For each shared load or store, the inserted code checks to see if the data is available locally and communicates with other processors if necessary. The system uses numerous techniques to reduce the run-time overhead of these checks. Since Shasta is implemented entirely in software, it also provides tremendous flexibility in supporting different types of cache coherence protocols. We have implemented an efficient
Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems
- In Proceedings of the Operating Systems Design and Implementation Symposium
, 1996
"... This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large a ..."
Abstract
-
Cited by 160 (20 self)
- Add to MetaCart
(Show Context)
This paper investigates the performance of shared virtual memory protocols on large-scale multicomputers. Using experiments on a 64-node Paragon, we show that the traditional Lazy Release Consistency (LRC) protocol does not scale well, because of the large number of messages it requires, the large amount of memory it consumes for protocol overhead data, and because of the diculty of garbage collecting that data. To achieve more scalable performance, we introduce and evaluate two new protocols. The rst, Home-based LRC (HLRC), is based on the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are propagated and from which all copies are derived. Unlike AURC, HLRC requires no specialized hardware support. We nd that the use of homes provides substantial improvements in performance and scalability over LRC. Our second protocol, called Overlapped Home-based LRC (OHLRC), takes advantage of the communication processor found on each node of the Paragon to ooad some of the protocol overhead of HLRC from the critical path followed by the compute processor. We nd that OHLRC provides modest improvements over HLRC. We also apply overlapping to the base LRC protocol, with similar results. Our experiments were done using ve of the Splash-2 benchmarks. We report overall execution times, as well as detailed breakdowns of elapsed time, message trac, and memory use for each of the protocols. 1
Mondrian Memory Protection
, 2002
"... Mondrian memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier pagebased systems, MMP allows arbitrary permissions control at the granularity of individual words. We use a com ..."
Abstract
-
Cited by 124 (3 self)
- Add to MetaCart
(Show Context)
Mondrian memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier pagebased systems, MMP allows arbitrary permissions control at the granularity of individual words. We use a compressed permissions table to reduce space overheads and employ two levels of permissions caching to reduce run-time overheads. The protection tables in our implementation add less than 9% overhead to the memory space used by the application. Accessing the protection tables adds less than 8% additional memory references to the accesses made by the application. Although it can be layered on top of demandpaged virtual memory, MMP is also well-suited to embedded systems with a single physical address space. We extend MMP to support segment translation which allows a memory segment to appear at another location in the address space. We use this translation to implement zero-copy networking underneath the standard read system call interface, where packet payload fragments are connected together by the translation system to avoid data copying. This saves 52% of the memory references used by a traditional copying network stack.
An analysis of dag-consistent distributed shared-memory algorithms
- in Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1996
"... In this paper, we analyze the performance of parallel multi-threaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKE ..."
Abstract
-
Cited by 98 (19 self)
- Add to MetaCart
(Show Context)
In this paper, we analyze the performance of parallel multi-threaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C)of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C)is the total work of the computation including page faults, T ∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As
An Efficient Implementation of Java's Remote Method Invocation
- In Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’99
, 1999
"... Java offers interesting opportunities for parallel computing. In particular, Java Remote Method Invocation provides an unusually flexible kind of Remote Procedure Call. Unlike RPC, RMI supports polymorphism, which requires the system to be able to download remote classes into a running application. ..."
Abstract
-
Cited by 91 (25 self)
- Add to MetaCart
Java offers interesting opportunities for parallel computing. In particular, Java Remote Method Invocation provides an unusually flexible kind of Remote Procedure Call. Unlike RPC, RMI supports polymorphism, which requires the system to be able to download remote classes into a running application. Sun's RMI implementation achieves this kind of flexibility by passing around object type information and processing it at run time, which causes a major run time overhead. Using Sun's JDK 1.1.4 on a Pentium Pro/Myrinet cluster, for example, the latency for a null RMI (without parameters or a return value) is 1228 µsec, which is about a factor of 40 higher than that of a user-level RPC. In this paper, we study an alternative approach for implementing RMI, based on native compilation. This approach allows for better optimization, eliminates the need for processing of type information at run time, and makes a light weight communication protocol possible. We have built a Java system based on a n...
Piccolo: Building Fast, Distributed Programs with Partitioned Tables
"... Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centers. Unlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Piccolo enables efficient appli ..."
Abstract
-
Cited by 86 (3 self)
- Add to MetaCart
(Show Context)
Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centers. Unlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Piccolo enables efficient application implementations. In particular, applications can specify locality policies to exploit the locality of shared state access and Piccolo’s run-time automatically resolves write-write conflicts using userdefined accumulation functions. Using Piccolo, we have implemented applications for several problem domains, including the PageRank algorithm, k-means clustering and a distributed crawler. Experiments using 100 Amazon EC2 instances and a 12 machine cluster show Piccolo to be faster than existing data flow models for many problems, while providing similar fault-tolerance guarantees and a convenient programming interface. 1
SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a vi ..."
Abstract
-
Cited by 81 (0 self)
- Add to MetaCart
(Show Context)
One potentially attractive way to build large-scale shared-memory machines is to use small-scale to medium-scale shared-memory machines as clusters that are interconnected with an off-the-shelf network. To create a shared-memory programming environment across the clusters, it is possible to use a virtual shared-memory software layer. Because of the low latency and high bandwidth of the interconnect available within each cluster, there are clear advantages in making the clusters as large as possible. The critical question then becomes whether the latency and bandwidth of the top-level network and the software system are sufficient to support the communication demands generated by the clusters. To explore these questions, we have built an aggressive kernel implementation of a virtual shared-memory system using SGI multiprocessors and 100Mbyte/sec HIPPI interconnects. The system obtains speedups on 32 processors (four nodes, eight
Active Message Applications Programming Interface and Communication Subsystem Organization
, 1995
"... High-performance network hardware is advancing, with multi-gigabit link bandwidths and sub-microsecond switch latencies. Network-interface hardware also continues to evolve, although the design space remains large and diverse. One critical abstraction, a simple, portable, and general-purpose communi ..."
Abstract
-
Cited by 77 (6 self)
- Add to MetaCart
(Show Context)
High-performance network hardware is advancing, with multi-gigabit link bandwidths and sub-microsecond switch latencies. Network-interface hardware also continues to evolve, although the design space remains large and diverse. One critical abstraction, a simple, portable, and general-purpose communications interface, is required to make effective use of these increasingly high-performance networks and their capable interfaces. Without a new interface, the software overheads of existing ones will dominate communication costs, and many applications may not benefit from the advancements in network hardware. This document specifies a new active message communications interface for these networks. Its primitives, in essence an instruction set for communications, map efficiently onto underlying network hardware and compose effectively into higher-level protocols and applications. For high-performance implementations, the interface enables direct application-network interface interactions. In...
MGS: A multigrain shared memory system
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Abstract Parallel workstations, each comprising 10-100 processors, promisecost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared mem-ory multiprocessors through software over a local area network to synthesize larger shared memory syste ..."
Abstract
-
Cited by 67 (6 self)
- Add to MetaCart
(Show Context)
Abstract Parallel workstations, each comprising 10-100 processors, promisecost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared mem-ory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systemsDistributed Scalable Shared-memory Multiprocessors (DSSMPs). This paper introduces the design of a shared memory system thatuses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain sharedmemory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multi-grain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharingonly when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced. Using MGS, an in-depth study of several shared memory ap-plications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploitmultigrain sharing. Keeping the number of processors fixed, applications execute up to 85 % faster when each DSSMP node is amultiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance ad-vantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applicationsallows a DSSMP to almost match the performance of a tightlycoupled multiprocessor. 1 Introduction Large-scale shared memory multiprocessors have traditionally beenbuilt using custom communication interfaces, high performance VLSI networks, and special-purpose hardware support for sharedmemory. These systems achieve good performance on a wide range of applications; however, they are costly. Despite attempts to makecost (in addition to performance) scalable, fundamental obstacles prevent large tightly-coupled systems from being cost effective.Power distribution, clock distribution, cooling, and other packagAppears in Proceedings of the 23rd Annual InternationalSymposium on Computer Architecture, May 1996.