52 citations found. Retrieving documents...
Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. May 1996.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents  Next 50

Scalable Inter-Cluster Communication Systems for Clustered.. - Jiang, Yeung (1997)   (Correct)

....these small to medium scale multiprocessors as clusters, and the MPPs built by assembling these clusters together as clustered multiprocessors. Though people have proposed to construct scalable multiprocessors by integrating commodity computer clusters, many chose to build custom hardware ( 4] [11], and [7] to provide efficient, protected inter cluster communication. While these systems achieve impressive performance, their inclusion of complicated custom hardware dramatically increases the cost and design time of the system, which is undesirable. This paper studies the problem of how to ....

....are getting faster. Welsh et al. 14] have studied how to provide fast user level communication on Fast Ethernet and ATM. Their results are impressive, but to provide efficient inter cluster communication for parallel applications, parallelism still has to be exploited. Muchwork ( 4] [11], and [7] has been done to provide efficient, protected inter cluster communication by building custom hardware. We focus on commodity hardware given the enormous cost advantage of leveraging commodity components. 6 Summary This paper describes issues of how to design scalable, efficient, ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rdAnnual International Symposium on Computer Architecture, pages 34--43, May 1996.


Fine-Grain Distributed Shared Memory on Clusters of Workstations - Schoinas (1997)   (3 citations)  (Correct)

....access patterns. While other hardware shared memory systems integrate message passing and shared memory [HGDG94] none offers the same flexibility to develop application specific protocols as user (rather than system) libraries. High end Tempest implementations such as the Typhoon designs [RLW94,RPW96] include extensive hardware support for fine grain access control and protocol actions. Such designs achieve performance competitive to other hardware shared memory approaches [RPW96] but still offer Tempest s flexibility to support coherence protocols. Blizzard, however, implements coherence in ....

....as user (rather than system) libraries. High end Tempest implementations such as the Typhoon designs [RLW94,RPW96] include extensive hardware support for fine grain access control and protocol actions. Such designs achieve performance competitive to other hardware shared memory approaches [RPW96] but still offer Tempest s flexibility to support coherence protocols. Blizzard, however, implements coherence in software but maintains fine grain access semantics either in software or hardware using mostly commodity hardware and software. Tempest s flexibility to support custom coherence ....

[Article contains additional citation context not shown here]

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.


Exploiting Instruction-Level Parallelism for Memory System.. - Pai (2000)   (Correct)

....simple processors, but neither work validates these approximations. The Wisconsin Wind Tunnel II uses a more detailed analysis at the basic block level that accounts for pipeline latencies and functional unit resource constraints to model a superscalar HyperSPARC processor [FW97, MRF 97, RPW96] However, this model does not account for memory overlap, which, as our results show, is an important factor in determining the behavior of more aggressive ILP processors. Brooks et al. describe the Cerberus Multiprocessor Simulator, a parallelized instructiondriven simulator for single issue ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 34--43, May 1996.


Limits on the Performance of Software Shared Memory: A.. - Bilas, Jiang, Zhou..   (Correct)

....levels except in the network links and switches themselves. The processor has a P6 like instruction set, and is assumed to be a 1 IPC processor. As discussed earlier, for SC we assume hardware access control at any desired (but fixed) power of two granularity, like in the Typhoon zero prototype [28] but with a more efficient simulated mechanism. The data cache hierarchy consists of a 8 KBytes firstlevel direct mapped write through cache and a 512 KBytes second level two way set associative cache, each with a line size of 32 Bytes. The write buffer [27] has 26 entries, 1 cache line wide ....

....closely approximate those that are found in the real handlers in an SVM implementation. 2.2.4 Simulator Validation We validated the simulator in two ways. First, by setting the communication architecture parameters to be similar (relative to processor speed) to those on the Typhoon zero platform [28], we were able to compare its results with a previous comparison of page grained HLRC and fine grained SC. Taking into account the higher hardware access control time on Typhoon zero, we found that the results for the comparison were surprisingly close for all applications. Second, we compared ....

R. P. Steven K. Reinhardt and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.


Applying Programming Language Implementation Techniques to.. - Schnarr (2000)   (2 citations)  (Correct)

....of the target machine. For example, ATUM uses special microcoded versions of load store instructions to perform cache simulation without modifying the target executable [2] WWT uses existing hardware features (e.g. ECC bit on memory) or special purpose hardware (i.e. the Typhoon 0 hardware [63]) to simulate cache coherent shared memory. 16 This approach can execute programs quickly, but building hardware is expensive and time consuming, and existing hardware lacks the flexibility to handle many kinds of simulation. A common software approach to accelerating simulation is to translate ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood, "Decoupled Hardware Support for Distributed Shared Memory," in the Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), May 1996.


Fine-Grain Protocol Execution Mechanisms & Scheduling Policies on .. - Falsafi (1998)   (Correct)

.... state, and tightly integrate them with the request and message queues and one or more embedded processors on a custom board [K 94,RLW94] Lessintegrated designs with hardware support for only performance critical resources are also feasible based on the desired cost and performance trade offs [RPW96,BLA 94] 2.2 Protocol Execution Semantics There are two types of protocol execution semantics: single threaded and multithreaded. This section describes the two execution semantics and discusses their advantages disadvantages. 2.2.1 Single Threaded Protocol Execution Single threaded ....

....and typically only access a finegrain (e.g. 32 256 bytes) memory block and update the corresponding protocol state. As such, these protocols are likely to incur minimal cache interference under a multiplexed policy. Network interfaces equipped with data caches (such as Typhoon s block buffer [RPW96] or CNI s cachable queues [MFHW96] allow protocols to leave the protocol data in the network interface cache. A requesting (computation) processor may directly load the data from the network interface cache reducing the protocol thread s cache interference with computation. Protocols may also ....

[Article contains additional citation context not shown here]

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.


Accelerating Shared Virtual Memory via General-purpose.. - Bilas, Jiang, al. (2001)   (1 citation)  (Correct)

....refers only to wire latency, whereas in [50] latency is an endto end metric including host overhead and packet processing cost. Various types of hardware support to accelerate protocols have been examined for SVM in [25] and [35] and for fine grained software DSM in the Typhoon zero prototype [41]. In [30] Karlsson et al. find that the latency and bandwidth of an ATM switch is acceptable in a clustered SVM architecture. In [33] a Lazy Release Consistency protocol for hardware cache coherence is presented. In a di#erent context, they find that applications are more sensitive to the ....

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecure, pages 34--43, New York, May22--24 1006. ACM Press.


Limits to the Performance of Software Shared Memory: A.. - Bilas, Jiang, Zhou.. (1997)   (1 citation)  (Correct)

....Core Buffer Write Second Level Cache F F O F F O Core I O B u s Snooping Device Network Interface M e m o r y Figure 2: Simulated node architecture. The fine grained access control needed for FG can be provided via either code instrumentation [18, 17] or hardware support [16]. Code instrumentation is also used for polling to handle asynchronous incoming messages, which would otherwise cause expensive and frequent interrupts (interrupts are much less frequent in the coarser grained SVM, so the tradeo#s between interrupts and polling are less clear 1 Protocols using ....

....software FG assuming very e#cient hardware access control. The cost of each protocol handler is computed according to the protocol task it performs. The simulator has been validated against real system implementations for both FG (by setting parameters close to those of the Typhoon zero system [16] and comparing with it) and SVM for our real cluster [1] The results, omitted for space reasons, are surprisingly accurate. Applications: Table 1 shows the applications and the problem sizes we use in this work. These applications are written for hardware DSM and they are known to deliver ....

[Article contains additional citation context not shown here]

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecure, pages 34--43, New York, May22-- 24 1006. ACM Press.


Performance Portability and Scalability in Shared-Address-Space.. - Jiang (2000)   (Correct)

....software rather than hardware in tightly coupled systems such as the SGI Origin2000 discussed in Chapter 2. Studies on software coherent shared address space multiprocessors have largely used applications as they were written for hardware cache coherent machines. The performance evaluations so far [43, 22, 46, 61, 31, 89, 35, 8] point out that for certain classes of applications there is a large performance gap between hardware cache coherent and software coherent systems. However, it should be possible to modify or restructure applications to interact better with software coherence protocols and granularities, and to ....

....some programmer intervention. The granularities used are 64 bytes in all other cases than the regular applications: FFT (4 KBytes) LU (4 KBytes) and Ocean (1 KBytes) The fine grained access control needed for FG can be provided via either code instrumentation [68, 66] or hardware support [61]. Code instrumentation is also used for polling to handle asynchronous incoming messages, which would otherwise cause expensive and frequent interrupts (interrupts are much less frequent in the coarser 1 Protocols using more complex, delayed consistency or single writer eager release consistency ....

[Article contains additional citation context not shown here]

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecure, pages 34--43, New York, May22--24 1006. ACM Press.


Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization - Kagi (1999)   (2 citations)  (Correct)

.... of dual mappings appear also in the Thinking Machines CM 5 [TMC91] to send commands to the vector units, in the AP1000 multicomputer to initiate data transfer, in the Stanford FLASH multiprocessor [HGDG94] to initiate user level DMA transfers, and in the Wisconsin s Typhoon 0 prototype [RPW96] to modify the fine grain access control bits. The first careful description and study is due to the members of the Princeton SHRIMP project [BDFL96, BLA 94] They use these ideas to support very efficient user level DMA transfers. The SHRIMP system uses user level DMA transfers to let 149 ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 34--43, May 1996.


Hierarchical Directory Controllers In The Numachine Multiprocessor - Grbic (1996)   (1 citation)  (Correct)

....The OBIC ASIC interfaces to the bus, implements the snooping bus logic and manages the remote cache. The SCLIC ASIC contains a programmable protocol engine which implements the directorybased coherence protocol. The DataPump provides the protocol for the SCI network. 2.4. 6 Typhoon 0 Typhoon 0 [21][23] is a part of the Wisconsin Wind Tunnel project aimed at a parallel programming interface called Tempest. This interface provides shared memory and message passing which can be built on a variety of parallel computers. Typhoon is a Tempest implementation on high performance custom hardware using a ....

S. K. Reinhardt, R. W. Pfile and D. A. Wood, "Decoupled Hardware Support for Distributed Shared Memory," Proceedings of the 23th Annual International Symposium on Computer Architecture, Philadelphia, PA, May 1996, pp. 34-43.


Limits to the Performance of Software Shared Memory: A.. - Angelos Bilas Dongming (1997)   (1 citation)  (Correct)

....node architecture. The fine grained access control needed for FG can be provided via either code instrumentation [7, 18] or hard 1 Protocols using more complex, delayed consistency or singlewriter eager release consistency were found to perform only a little better in [22] ware support [17]. Code instrumentation is also used for polling to handle asynchronous incoming messages, which would otherwise cause expensive and frequent interrupts (interrupts are much less frequent in the coarser grained SVM, so the tradeoffs between interrupts and polling are less clear there) Since we do ....

....software FG assuming very efficient hardware access control. The cost of each protocol handler is computed according to the protocol task it performs. The simulator has been validated against real system implementations for both FG (by setting parameters close to those of the Typhoon zero system [17] and comparing with it) and SVM for our real cluster [1] The results, omitted for space reasons, are surprisingly accurate. Applications: Table 1 shows the applications and the problem sizes we use in this work. These applications are written for hardware DSM and they are known to deliver ....

[Article contains additional citation context not shown here]

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecure, pages 34--43, New York, May22-- 24 1006. ACM Press.


Toward A Cost-Effective DSM Organization That Exploits.. - Torrellas, Yang, Nguyen (2000)   (6 citations)  (Correct)

....of the main memory is on the processor chip, while part of it is o chip. Eventually, most of the memory may be on chip. In these designs, processor memory communication is fairly inexpensive. In the past, there have been e orts trying to use commodity workstations as nodes to build DSM machines [2, 11, 14, 17]. In this paper, we explore how to design a cache coherent DSM machine around commodity ProcessorIn Memory (PIM) chips like the one in Figure 1 (c) In our opinion, such a design should be guided by several principles. First, given the close coupling between processor and on chip memory (20 30 ....

S. Reinhardt, R. Ple, and D. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 34-43, May 1996.


Design and Evaluation of Network Interfaces for System Area.. - Mukherjee (1998)   (Correct)

....aggressive CNI that is, CNI i Q m in my nomenclature exploits all eight opportunities for performance improvement outlined above. Chapter 3 develops and optimizes two mechanisms that CNIs use to communicate with processors. A cachable device register derived from cachable control registers [101] is a coherent, cachable block of memory used to transfer status, control, or data between a CNI and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue. An ....

....main memory and the second hop from main memory to the processor cache. Princeton s UDMA dramatically reduces the DMA initiation overhead on the send side by allowing processors to initiate DMA directly through a two instruction sequence from user space without OS intervention. Reinhardt, et al. [101] demonstrated that UDMA can be used as cheaply on the receive side as well. Unfortunately, the UDMA initiation scheme suffers from side effects (Section 2.4) and, like traditional DMA, transfers data in two hops. Transferring data between processor caches and ULNIs through cache block transfers, ....

[Article contains additional citation context not shown here]

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.


Binding Time in Distrubuted Shared Memories - Kong (1999)   (Correct)

....into two parts: space always allocated for memory lines as in CC NUMA and (unallocated) space utilized for data replication as in DVSM. All of the schemes discussed so far could still apply for the hybrid use of the local memory. For example, the Stache approach [RLW94] implemented in Typhoon 0 [RPW96] uses the unallocated space as the local memory of S COMA, i.e. it allocates memory at page granularity in software and manages coherence at line granularity in hardware. The above memory architectures do not allow the physical address of a memory location to be changed. Thus, to rearrange a ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 34--43, Philadelphia, Pennsylvania, May 22--24, 1996.


Memory Organizations in Hybrid DSM: A Performance Comparison - Moga, Gefflaut, Dubois (1997)   (Correct)

....explored. In these approaches, some hardware support is provided to assist the software. Some examples include support for remote writes in the SHRIMP project [3] support for remote reads and writes in the Cashmere project [12] and support for fine grain sharing in Blizzard E [22] Typhoon 0 [19] and START NG [5] This trend goes all the way to adding a dedicated processor to execute the software protocol and interact tightly with the network interface, as in Typhoon [18] 19] or Flash [14] This paper explores three memory organizations for hybrid DSM systems. These hybrid architectures ....

.... reads and writes in the Cashmere project [12] and support for fine grain sharing in Blizzard E [22] Typhoon 0 [19] and START NG [5] This trend goes all the way to adding a dedicated processor to execute the software protocol and interact tightly with the network interface, as in Typhoon [18] [19] or Flash [14] This paper explores three memory organizations for hybrid DSM systems. These hybrid architectures are inspired from three hardware DSMs: CC NUMA, COMA [24] and Simple COMA [20] and are respectively called SCC NUMA (Software CC NUMA) SC COMA (Software Controlled COMA) and SS COMA ....

[Article contains additional citation context not shown here]

Steven K. Reinhardt, Robert W. Pfile, David A. Wood. Decoupled Hardware Support for Distributed Shared Memory. 23rd International Symposium on Computer Architecture (ISCA), May 1996.


Hardware vs. Software Implementation of COMA - Moga, Gefflaut, Dubois (1997)   (Correct)

....to study a hybrid COMA is motivated, in part, by the intuition that a COMA maximizes node hit rates, hence reducing protocol engine occupancy. Another reason is that studies involving NUMA protocols in hybrid architectures with varying degrees of hardware aggressiveness are plentiful [10] 8][22]. In the current version of our system, the targeted platform is a network of uniprocessor workstations. Hence, the protocol runs on the main processor, but the solution could easily be adapted if a dedicated processor is available. Fine grain memory access checking support and the controller for ....

....processor is available. Fine grain memory access checking support and the controller for a setassociative memory are incorporated in a single, relatively simple functional unit. This could be easily integrated in the memory controller or just be plugged into the local bus, like in Typhoon 0 [22] or START NG [5] Our results show that a software implemented protocol engine can perform very well, even when compared to an ideal hardware implementation. Addressing some of the concerns about performing protocol actions on the main processor [10] 11] we show that efficient switching of the ....

[Article contains additional citation context not shown here]

S.K. Reinhardt, R.W. Pfile, D.A. Wood. Decoupled Hardware Support for Distributed Shared Memory. Proc. of the 23rd International Symposium on Computer Architecture, May 1996.


The Impact of Memory Organization in Hybrid DSM - Moga, Gefflaut, Dubois (1997)   (Correct)

....explored. In these approaches, some hardware support is provided to assist the software. Some examples include support for remote writes in the SHRIMP project [1] support for remote reads and writes in the Cashmere project [12] and support for fine grain sharing in Blizzard E [23] Typhoon 0 [20] and START NG [3] This trend goes all the way to adding a dedicated processor to execute the software protocol and interact tightly with the network interface, as in Typhoon [19] 20] or Flash [14] This paper explores four memory organizations for hybrid DSM systems. These hybrid architectures ....

.... reads and writes in the Cashmere project [12] and support for fine grain sharing in Blizzard E [23] Typhoon 0 [20] and START NG [3] This trend goes all the way to adding a dedicated processor to execute the software protocol and interact tightly with the network interface, as in Typhoon [19] [20] or Flash [14] This paper explores four memory organizations for hybrid DSM systems. These hybrid architectures are inspired from four hardware DSMs: CC NUMA, RC NUMA (NUMA with remote data cache) 27] Simple COMA [21] and COMA [25] and are respectively called SCC, SRC, SSC, and SC. The hybrid ....

[Article contains additional citation context not shown here]

S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. Proc. of the 23rd Annual International Symposium on Computer Architecture, pages 34-43, May 1996.


Design and Performance of the Software-controlled COMA - Moga (1998)   (Correct)

....software between the hardware and the kernel, although non standard and slightly less flexible, is a much lower overhead solution. A dedicated protocol processor can be integrated at different degrees in a node. At the low end, one of the processors in an SMP cluster can be used for this purpose [63][16] In addition to off loading the application processor(s) the protocol processor does not need to switch context and can respond to events faster, by using polling. At another level, the protocol processor is integrated with the NI and the ACD, as a bus snooper [61] Private data paths ....

....fragmentation. By reducing the usable capacity of the main memory cache, fragmentation directly increases the node miss ratio. The page faulting overhead is very significant in some cases. Two previous studies either assumed a very low penalty for a page fault [67] or very low memory pressure [63]. There is no real indication from our results that S COMA is effective at dealing with fine grain sharing in 94 the general case, unless large amounts of memory can be used without concerns for efficiency (Ocean and LU have coarse grain sharing and would probably perform well on a pure ....

[Article contains additional citation context not shown here]

S.K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. Proc. of the 23rd Annual Int'l Symposium on Computer Architecture, pages 34-43, May 1996.


Processor Mechanisms for Software Shared Memory - Carter   (Correct)

....good performance, as most blocks are shared by fewer than five processors. In the last several years, a number of systems have appeared which use a combination of hardware and software to implement shared memory in order to support multiple shared memory protocols efficiently. The Tempest [29] [30] project at the University of Wisconsin Madison showed that substantial performance improvements could be obtained on many programs by selecting a shared memory protocol which matched the needs of the application, and developed the Teapot [7] language to simplify the process of implementing ....

....shared memory protocols on Alewife, as both of these protocols require changes in the manner in which the requesting processor responds to a remote memory reference. 7. 3 Typhoon To provide better performance while retaining the flexibility of software only sharedmemory systems, the Typhoon [29] [30] project at the University of Wisconsin explored the use of a dedicated co processor to execute shared memory handlers, eliminating the context switch overhead involved in starting up shared memory handlers on a softwareonly system, and allowing handlers to be executed in parallel with user ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, pages 34-43, May 1996


Parallel Communication Mechanisms for Sparse, Irregular Applications - Chong   (Correct)

.... 15 N A 10 [Dun92a] Moy91] Intel Paragon 50.0 4 Theta 8 Mesh 2800 56.0 12 N A 10 [Int91] Moy91] Stanford DASH 33.0 2 Theta 4 4 proc clusters 480 14.5 31 120 30 [LLG 92] Stanford FLASH 200.0 4 Theta 8 Mesh 3200 16.0 62 352 40 [HKO 94] Wisconsin T0 #200.0 none simulated N A N A 200 1461 40 [RPW96] Rei96] Wisconsin T1 #200.0 none simulated N A N A 200 401 40 [RPW96] Rei96] Cray T3D 150.0 4 Theta 2 Theta 2 Torus 2 proc clusters 4800 32.0 15 100 23 [T3D93] A 95] Cray T3E 300.0 4 Theta 4 Theta 2 Torus 19200 64.0 110 300 600 80 [Sco96b] Sco96a] SGI Origin 200.0 Hypercube 4 proc clusters ....

.... 12 N A 10 [Int91] Moy91] Stanford DASH 33.0 2 Theta 4 4 proc clusters 480 14.5 31 120 30 [LLG 92] Stanford FLASH 200.0 4 Theta 8 Mesh 3200 16.0 62 352 40 [HKO 94] Wisconsin T0 #200.0 none simulated N A N A 200 1461 40 [RPW96] Rei96] Wisconsin T1 #200.0 none simulated N A N A 200 401 40 [RPW96] Rei96] Cray T3D 150.0 4 Theta 2 Theta 2 Torus 2 proc clusters 4800 32.0 15 100 23 [T3D93] A 95] Cray T3E 300.0 4 Theta 4 Theta 2 Torus 19200 64.0 110 300 600 80 [Sco96b] Sco96a] SGI Origin 200.0 Hypercube 4 proc clusters 10800 54.0 60 150 61 [ Gal96] projected, # simulated, latencies ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, 1996.


Wisconsin Wind Tunnel II: A Fast and Portable Parallel.. - Mukherjee (1997)   (24 citations)  Self-citation (Reinhardt Wood)   (Correct)

....our experiments. 7 Workshop on Performance Analysis and Its Impact on Design (PAID) June 1, 1997 WWT II is the successor to WWT, but is more detailed and flexible compared to WWT. Table 3 lists the differences between WWT II and WWT. We have already used WWT II for several research efforts [9, 14, 18, 16]. For this study, we have chosen a 32 node S COMA [10] shared memory machine as our target architecture. Each target node has a single processor and a 256 kilobyte processor cache. Hardware coherence is implemented through a full map directory protocol. Each host processor in WWT II simulates one ....

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.


Relaxed Consistency and Coherence Granularity in DSM.. - Zhou, Iftode, Singh, Li (1997)   (19 citations)  Self-citation (Wood)   (Correct)

.... Examples of providing fine grained access control include taking advantage of architectural features such as the ECC bits to trap access faults [29] using software instrumentation for shared reads and writes [29, 28] and building special access control hardware for commodity workstations [25]. The finer the granularity, the less false sharing and fragmentation occur, and hence the less need to use relaxed models. A disadvantage of finegrain coherence is that the smaller granularity may result in excessive misses and poor remote bandwidth. To date, the performance tradeoffs between ....

.... Examples of providing fine grained access control include taking advantage architectural features such as the ECC bits to trap access faults [29] using software instrumentation for shared reads and writes [29, 28] and building special access control hardware for commodity workstations [25]. Keleher [17] compared sequential consistency with single writer and multiple writer LRC protocols at page granularity and concludes that on average the multiple writer version is 9 better than the single writer one and 34 better than the sequential consistency one. His study used 8 ....

S. K. Reinhard, R. W. Pfile, and D. A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.


Memory Management for Networked Servers - Zhou (2000)   (Correct)

No context found.

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled hardware support for distributed shared memory. May 1996.


Memory Systems for Parallel Programming - Richards (1996)   (1 citation)  (Correct)

No context found.

Steven K. Reinhardt, Robert W. Pfile, and David A. Wood. Decoupled Hardware Support for Distributed Shared Memory. In Proc. of the 23th Annual Int'l Symp. on Computer Architecture (ISCA'96), May 1996.

First 50 documents  Next 50

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC