Results 1 - 10
of
29
Using Network Interface Support to Avoid Asynchronous Protocol Processing in Shared Virtual Memory Systems
- In Proceedings of the 26th International Symposium on Computer Architecture
, 1999
"... The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper sho ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
(Show Context)
The performance of page-based software shared virtual memory (SVM) is still far from that achieved on hardwarecoherent distributed shared memory (DSM) systems. The interrupt cost for asynchronous protocol processing has been found to be a key source of performance loss and complexity. This paper shows that by providing simple and general support for asynchronous message handling in a commodity network interface (NI), and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. They also require neither visibility into the node memory system nor code instrumentation to identify memory operations. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support in a shared Memory Abstraction), on a cluster of SMPs with a programmable NI, though the mechan...
Fine-Grain Distributed Shared Memory on Clusters of Workstations
, 1997
"... Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
Shared memory, one of the most popular models for programming parallel platforms, is becoming ubiquitous both in low-end workstations and high-end servers. With the advent of low-latency networking hardware, clusters of workstations strive to offer the same processing power as high-end servers for a fraction of the cost. In such environments, shared memory has been limited to page-based systems that control access to shared memory using the memory's page protection to implement shared memory coherence protocols. Unfortunately, false sharing and fragmentation problems force such systems to resort to weak consistency shared memory models that complicate the shared memory programming model.
Sirocco: Cost-Effective Fine-Grain Distributed Shared Memory
- IN PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES
, 1998
"... Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a netw ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a network of low-cost SMPs. Sirocco takes full advantage of SMP nodes by implementing inter-node sharing directly in hardware and overlapping computation with protocol execution. To maintain correct shared-memory semantics, however, SMP nodes require mechanisms to guarantee atomic coherence operations. Multiple SMP processors may also result in contention for shared resources and reduce performance. SMP nodes also impact the cost trade-off. While SMPs typically charge higher price-premiums, for a given system size SMP nodes substantially reduce networking hardware requirement as compared to uniprocessor nodes. In this paper
Toward A Cost-Effective DSM Organization That Exploits Processor-Memory Integration
, 2000
"... Dramatic increases in the number of transistors that can be integrated on a VLSI chip will soon allow commodity microprocessors to include both processor and a sizable fraction of main memory on chip. Distributed Shared-Memory (DSM) multiprocessors typically use the latest off-the-shelf microprocess ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Dramatic increases in the number of transistors that can be integrated on a VLSI chip will soon allow commodity microprocessors to include both processor and a sizable fraction of main memory on chip. Distributed Shared-Memory (DSM) multiprocessors typically use the latest off-the-shelf microprocessors and thus will be affected by the upcoming processor-memory integration. In this paper, we explore how a cache-coherent DSM machine built around Processor-In-Memory (PIM) chips might be cost-effectively organized. To take advantage of the close coupling between processor and memory, we propose tagging the memory and organizing it as a cache. Furthermore, commercial considerations dictate the use of off-the-shelf hardware largely designed for uniprocessors. Consequently, we keep the directory control off-chip. To keep the multiprocessor cheap and simple, and to allow for recon gurability, directory control is performed by chips that are identical to the ones used as compute nodes. As a result, ...
A Programming Model for Block-Structured Scientific Calculations on SMP Clusters
- Calculations on SMP Clusters. Ph. D. Dissertation, UCSD
, 1998
"... [None] ..."
(Show Context)
Improving the Performance of Shared Virtual Memory on System Area Networks
, 1998
"... As clusters of workstations, uniprocessor or symmetric multiprocessors (SMPs), become important platforms for parallel computing, there is increasing research interest in supporting the attractive, shared address space programming model across them in software. The reason is that it may provide succ ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
As clusters of workstations, uniprocessor or symmetric multiprocessors (SMPs), become important platforms for parallel computing, there is increasing research interest in supporting the attractive, shared address space programming model across them in software. The reason is that it may provide successful low--cost, high--performance alternatives to both tightly--coupled, hardware--coherent distributed shared memory machines and to scalable servers. In both these cases, the clusters are formed with o#--the--self, high--end PCs or workstations and system area networks that track technologies well. Given that a shared memory abstraction is an attractive programming model for this architecture, there has been a lot of research in fast communication on clusters connected with system area networks and in protocols for supporting software shared memory across them. However, the end performance of applications that were written for the more proven hardware--coherent shared memory is still not...
Responsiveness without Interrupts
, 1999
"... this paper is a characterization of the delays actually observed in a suite of applications. We show that the majority of notification delays result from a small number of large delays. These delays can dominate any gains achieved through use of new network technologies. The impact of these delays c ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
this paper is a characterization of the delays actually observed in a suite of applications. We show that the majority of notification delays result from a small number of large delays. These delays can dominate any gains achieved through use of new network technologies. The impact of these delays can be considerable. Our applications averaged more than 31% slower without interrupts than with them. This result argues that the problem is serious, and needs to be addressed either by including interrupts in emerging standards, or through use of the techniques discussed below
A Reconfigurable Extension to the Network Interface of Beowulf Clusters
, 2001
"... With a focus on commodity PC systems, Beowulf clusters traditionally lack the cutting edge network architectures, memory subsystems, and processor technologies found in their more expensive supercomputer counterparts. What Beowulf clusters lack in technology, they more than make up for with their si ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
With a focus on commodity PC systems, Beowulf clusters traditionally lack the cutting edge network architectures, memory subsystems, and processor technologies found in their more expensive supercomputer counterparts. What Beowulf clusters lack in technology, they more than make up for with their significant cost advantage over traditional supercomputers. We propose an architectural extension that adds reconfigurable computing to the network interface of Beowulf clusters. This enhances both the network and processor capabilities of the cluster. Furthermore, for some applications, the proposed extension partially compensates for weaknesses in the PC memory subsystem. We discuss two applications, the 2D Fast Fourier Transform (FFT) and integer sorting, which benefit from the resulting architecture. 1.
Cost Effectiveness of an Adaptable Computing Cluster
, 2001
"... With a focus on commodity PC systems, Beowulf clusters traditionally lack the cutting edge network architectures, memory subsystems, and processor technologies found in their more expensive supercomputer counterparts. What Beowulf clusters lack in technology, they more than make up for with their si ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
With a focus on commodity PC systems, Beowulf clusters traditionally lack the cutting edge network architectures, memory subsystems, and processor technologies found in their more expensive supercomputer counterparts. What Beowulf clusters lack in technology, they more than make up for with their significant cost advantage over traditional supercomputers. This paper presents the cost implications of an architectural extension that adds reconfigurable computing to the network interface of Beowulf clusters. A quantitative idea of cost-effectiveness is formulated to evaluate computing technologies. Here, cost-effectiveness is considered in the context of two applications: the 2D Fast Fourier Transform (2D-FFT) and integer sorting.