Results 1 - 10
of
21
The Relative Importance of Concurrent Writers and Weak Consistency Models
- in Proceedings of the 16th International Conference on Distributed Computing Systems
, 1996
"... This paper presents a detailed comparison of the relative importance of allowing concurrent writers versus the choice of the underlying consistency model. Our comparison is based on single- and multiplewriter versions of a lazy release consistent (LRC) protocol, and a single-writer sequentially cons ..."
Abstract
-
Cited by 95 (20 self)
- Add to MetaCart
This paper presents a detailed comparison of the relative importance of allowing concurrent writers versus the choice of the underlying consistency model. Our comparison is based on single- and multiplewriter versions of a lazy release consistent (LRC) protocol, and a single-writer sequentially consistent protocol, all implemented in the CVM software distributed shared memory system. We find that in our environment, which we believe to be representative of distributed systems today and in the near future, the consistency model has a much higher impact on overall performance than the choice of whether to allow concurrent writers. The multiple writer protocol performs an average of 9% better than the single writer LRC protocol, but 34% better than the single-writer sequentially consistent protocol. Set against this, MW-LRC required an average of 72% memory overhead, compared to 10% overhead for the single-writer protocols. 1 Introduction Sophisticated page-based distributed shared memor...
SIMPLE: A methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs
- Journal of Parallel and Distributed Computing
, 1999
"... We describe a methodology for developing high performance programs running on clusters of SMP nodes. Our methodology is based on a small kernel (SIMPLE) of collective communication primitives that make e cient use of the hybrid shared and message passing environment. We illustrate the power of our m ..."
Abstract
-
Cited by 52 (13 self)
- Add to MetaCart
We describe a methodology for developing high performance programs running on clusters of SMP nodes. Our methodology is based on a small kernel (SIMPLE) of collective communication primitives that make e cient use of the hybrid shared and message passing environment. We illustrate the power of our methodology by presenting experimental results for sorting integers, two-dimensional fast Fourier transforms (FFT), and constraint-satis ed searching. Our testbed is a cluster of DEC AlphaServer 2100 4/275 nodes interconnected by anATM switch.
Online Data-Race Detection via Coherency Guarantees
, 1996
"... We present the design and evaluation of an on-thefly data-race-detection technique that handles applications written for the lazy release consistent (LRC) shared memory model. We require no explicit association between synchronization and shared memory. Hence, shared accesses have to be tracked and ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
We present the design and evaluation of an on-thefly data-race-detection technique that handles applications written for the lazy release consistent (LRC) shared memory model. We require no explicit association between synchronization and shared memory. Hence, shared accesses have to be tracked and compared at the minimum granularity of data accesses, which is typically a single word. The novel aspect of this system is that we are able to leverage information used to support the underlying memory abstraction to perform on-the-fly data-race detection, without compiler support. Our system consists of a minimally modified version of the CVM distributed shared memory system, and instrumentation code inserted by the ATOM code re-writer. We present an experimental evaluation of our technique by using our system to look for data races in four unaltered programs. Our system correctly found read-write data races in a program that allows unsynchronized read access to a global tour bound, and a write-write race in a program from a standard benchmark suite. Overall, our mechanism reduced program performance by approximately a factor of two.
A Protocol-Centric Approach to On-The-Fly Race Detection
, 1998
"... We present the design and evaluation of a new data-race-detection technique. Our technique executes at runtime rather than post-mortem, and handles unmodified shared-memory applications that run on top of CVM, a software distributed shared memory system. We do not assume explicit associations betwee ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
We present the design and evaluation of a new data-race-detection technique. Our technique executes at runtime rather than post-mortem, and handles unmodified shared-memory applications that run on top of CVM, a software distributed shared memory system. We do not assume explicit associations between synchronization and shared data, and require neither compiler support nor program source. Instead, we use a binary code re-writer to instrument instructions that may access shared memory. The most novel aspect of our system is that we are able to use information from the underlying memory system implementation in order to reduce the number of comparisons made at run time. We present an experimental evaluation of our techniques by using our system to look for data races in five common sharedmemory programs. We quantify the effect of several optimizations to the basic technique: data flow analysis, instrumentation batching, runtime code modification, and instrumentation inlining. Our syste...
Strings: A High-Performance Distributed Shared Memory for Symmetrical Multiprocessor Clusters
- in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing
, 1998
"... This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these ligh ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these lightweight processes. Thus, Strings is designed to exploit data parallelism at the application level and task parallelism at the DSM system level. We show how using multiple kernel threads can improve the performance even in the presence of false sharing, using matrix multiplication as a case-study. We also show the performance results with benchmark programs from the SPLASH-2 suite [17]. Though similar work has been demonstrated with SoftFLASH [18], our implementation is completely in user space and thus more portable. Some other researach has studied the effect of clustering in SMPs suing simulations [19]. We have shown results from runs on an actual network of SMPs
Exploring Thread-Level Speculation in Software: The Effects of Memory Access Tracking Granularity
, 2001
"... Speculative execution is often the only way to overcome dataflow-imposed limitations and exploit parallelism when dependences can be discovered only at run-time. It also facilitates automatic parallelization of programs that exhibit complicated memory access patterns, which make complete compile-tim ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Speculative execution is often the only way to overcome dataflow-imposed limitations and exploit parallelism when dependences can be discovered only at run-time. It also facilitates automatic parallelization of programs that exhibit complicated memory access patterns, which make complete compile-time dependence analysis either impossible or extremely complicated.
Parallel and Distributed Programming with Pthreads and Rthreads
- IPPS/SPDP International Parallel Processing Symposium & 9th Symposium on Parallel and Distributed Processing
, 1998
"... This paper describes Rthreads (Remote threads), a software distributed shared memory system that supports sharing of global variables on clusters of computers with physically distributed memory. Other DSM systems either use virtual memory to implement coherence on networks of workstations or require ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
This paper describes Rthreads (Remote threads), a software distributed shared memory system that supports sharing of global variables on clusters of computers with physically distributed memory. Other DSM systems either use virtual memory to implement coherence on networks of workstations or require programmers to adopt a special programming model. Rthreads uses primitives to read and write remote data and to synchronize remote accesses similar to the DSM systems that are based on special programming models. Unique aspects of Rthreads are: The primitives are syntactically and semantically closely related to the POSIX thread model (Pthreads). A precompiler automatically transforms Pthreads (source) programs into Rthreads (source) programs. After the transformation the programmer is still able to alter the Rthreads code for optimizing run-time. Moreover, Pthreads and Rthreads can be mixed within a single program. We support heterogeneous workstation clusters by implementing the Rthreads ...
Design Issues for a High-Performance Distributed Shared Memory on Symmetrical Multiprocessor Clusters
- Journal of Networks, Software Tools and Applications
, 1999
"... This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each n ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This paper describes Strings, a high performance distributed shared memory system designed for such SMP clusters. The distinguishing feature of this system is the use of a fully multi-threaded runtime system, using kernel level threads. Strings allows multiple application threads to be run on each node in a cluster. Since most modern UNIX systems can multiplex these threads on kernel level light weight processes, applications written using Strings can exploit multiple processors on a SMP machine. This paper describes some of the architectural details of the system and illustrates the performance improvements with benchmark programs from the SPLASH-2 suite, some computational kernels as well as a full fledged application. It is found that using multiple processes on SMP nodes provides good speedups only for a few of the programs. Multiple application threads can improve the performance in some cases, but other programs show a slowdown. If kernel threads are used additionally, the overall performance improves significantly in all programs tested. Other design decisions also have a beneficial impact, though to a lesser degree.
Multicast-based Runtime System for Highly Efficient Causally Consistent Software-only DSM
- In Lecture Notes in Computer Science 1586, IPPS/SDSP’99 Workshops
, 1999
"... . This paper introduces the application of IP multicasting for enhancing of software-only DSM systems and, at the same time, simplification of the programming model by offering a simple memory consistency model. The described algorithm is the foundation of a runtime system implemented as filesys ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
. This paper introduces the application of IP multicasting for enhancing of software-only DSM systems and, at the same time, simplification of the programming model by offering a simple memory consistency model. The described algorithm is the foundation of a runtime system implemented as filesystems for the Windows NT and FreeBSD operating systems. Keywords: distributed shared memory, DSM, memory coherence protocol, IPv4/IPv6 multicasting, causal consistency, vector logical clock 1 Introduction Software distributed shared memory (DSM) realized on a network of workstations connected through a conventional computer network has gained a lot of attention in both research and industry. On the other hand group communication, also known as multicasting, on IP networks is getting widespread over the Internet. The new IP generation, IPv6, relies heavily on the availability of multicasting. This paper presents the idea of an application of IP multicasting for the purpose of coherence tr...
Adaptive Prefetching Technique for Shared Virtual Memory
"... Though shared virtual memory (SVM) systems promise low cost solutions for high performance computing, they suffer from long memory latencies. These latencies are usually caused by repetitive invalidations on shared data. Since shared data are accessed through synchronizations and the patterns by whi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Though shared virtual memory (SVM) systems promise low cost solutions for high performance computing, they suffer from long memory latencies. These latencies are usually caused by repetitive invalidations on shared data. Since shared data are accessed through synchronizations and the patterns by which threads synchronizes are repetitive, a prefetching scheme based on such repetitiveness would reduce memory latencies. Based on this observation, we propose a prefetching technique which predicts future access behavior by analyzing access history per synchronization variable. Our technique was evaluated on an 8-node SVM system using the SPLASH-2 benchmark. The results show that our technique could achieve 34 % – 45 % reduction in memory access latencies. 1

