41 citations found. Retrieving documents...
J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In 6th International Confer5

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Fine-Grain Distributed Shared Memory on Clusters of Workstations - Schoinas (1997)   (3 citations)  (Correct)

....an attractive target to build runtime systems for parallel languages [LRV94,CL96] where compilers can tailor the coherence protocol to incorporate high level knowledge about the application access patterns. While other hardware shared memory systems integrate message passing and shared memory [HGDG94] none offers the same flexibility to develop application specific protocols as user (rather than system) libraries. High end Tempest implementations such as the Typhoon designs [RLW94,RPW96] include extensive hardware support for fine grain access control and protocol actions. Such designs ....

....associated with sending and receiving messages. Therefore, a newer generation of low overhead messaging interfaces attacked the software overheads. Some software architectures originated in the networking community [DP93,Osb94] while others arose from the multicomputer community [vECGS92,PLC95, HGDG94] The emergence of system area networks and networks of workstations have blurred the distinction. In general however, the latter have been more preoccupied with low latencies than the former. Among the key proposals that emerged from the multicomputer community have been the Berkeley active ....

[Article contains additional citation context not shown here]

John Heinlein, Kourosh Gharachorloo, Scott A. Dresser, and Anoop Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 38-50, 1994.


Architectural Support for an Efficient Implementation of a.. - Grahn, Stenström (1995)   (Correct)

....routed synchronous mesh with a flit size of 64 bits. The mesh is clocked at 50 MHz resulting in a fall through time of 40 ns for each node. The bandwidth into and out of each processor node is 400 Mbytes s. The latency and bandwidth in the mesh are comparable to the mesh used in the Stanford FLASH [10, 11]. We correctly model contention of all parts in the system. Table 1 shows the time it takes to satisfy a read request from different levels in the memory hierarchy in the hardware only implementation assuming no contention. However, in our simulations a request usually takes longer time as a ....

....protocols. The research has evolved along two main directions: either a separate protocol processor is used to execute the software handlers that emulate the coherence protocol or the handlers are executed on the compute processor. The first direction is represented by, e.g. the Stanford FLASH [10, 11] and the Wisconsin Typhoon [18] These projects suggest using a separate processor to execute the software handlers. The processor can be located in a central node controller as in FLASH or located in the network interface as in Typhoon. Like us, these projects have the goal to achieve flexibility ....

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), pages 38-50, October, 1994.


FUGU: Implementing Translation and Protection in a .. - Mackenzie.. (1994)   (9 citations)  (Correct)

....impact on performance: 1. User initiated message sends can be permitted to globally named, pre negotiated areas of physical memory at the receiver, for instance as remote write operations. Translation and protection are handled in analogy to virtual memory. SHRIMP [5] and bulk transfers in FLASH [15, 10] use remote write. 2. User level access to the network hardware can be preserved if the machine is rigidly partitioned and all hardware in the partition, including the network, is context switched. The CM 5 adopts this solution [17] 3. User initiated transfers between memories can use explicit ....

....messages require the receiving processor to determine the destination of the data. FUGU extends Alewife features for multiuser operation and uses an exokernel operating system. In FUGU, bulk transfers require translation at the receiving processor, but do not pre negotiate for pages. FLASH [15, 10] is a multimodel machine with a microcoded, kernel level coprocessor for message handling including shared memory protocol messages. Bulk transfers in FLASH are in the form of remotewrites which avoid using the receiving processor, but require pre negotiating the sending addresses in shared ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 38--50. ACM, October 1994.


Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization - Kagi (1999)   (2 citations)  (Correct)

....a mean to distinguish between regular reads and reads with intent tomodify. Other implementations of dual mappings appear also in the Thinking Machines CM 5 [TMC91] to send commands to the vector units, in the AP1000 multicomputer to initiate data transfer, in the Stanford FLASH multiprocessor [HGDG94] to initiate user level DMA transfers, and in the Wisconsin s Typhoon 0 prototype [RPW96] to modify the fine grain access control bits. The first careful description and study is due to the members of the Princeton SHRIMP project [BDFL96, BLA 94] They use these ideas to support very ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Proceedings of the Sixth Symposium on Architectural Support for Programming Languages and Operating Systems, pages 38--50, October 1994.


Protected, User-level DMA for SHRIMP Network Interface - Blumrich, Dubnicki, Felten.. (1996)   (12 citations)  (Correct)

....size or destination address of the transfer to be controlled the size is hardwired to a fixed value, and the destination address is a fixed circular buffer in the receiver s address. In addition, the transfer is not a DMA, since the CPU stalls while the transfer is occurring. The Flash system [7] uses a technique similar to ours for communicating requests from user processes to communication hardware. Flash uses the equivalent of our memory proxy addresses (which they call shadow addresses ) to allow user programs to specify memory addresses to Flash s communication hardware. The Flash ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of message passing and shared memory in the stanford FLASH multiprocessor. In Proceedings of 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, October 1994.


Design and Evaluation of Network Interfaces for System Area.. - Mukherjee (1998)   (Correct)

....translation in the ULNI TLB. The ULNI TLB can be augmented with protection bits and process identifiers, similar to those in a modern processor TLB, and updated along with the translations to ensure protected ULNI access to main memory. Other researchers have explored these issues in detail [133, 47, 105]. Chapter 5 evaluates the impact of alternate buffering strategies on seven parallel scientific applications. 2.4 Cache NI Registers in Processor and NI Caches Unlike main memory, peripheral I O device memory, such as ULNI memory, is typically not cached in processor caches. Instead, ULNI ....

....invalidation signals help a CNI to avoid having stale copies of CQ blocks in the processor cache when the CNI writes new messages arriving from the network into the CQ. The shadow address space technique has been used before to communicate special signals from a processor to an I O device [11, 47], but not in an I O bridge. In this technique, the I O bridge creates a shadow space for the regular I O space by some invertible function 1. PCI supports only two coherent transactions: memory read line and memory write and invalidate. The invalidate command. Consequently, we need to fake an ....

[Article contains additional citation context not shown here]

John Heinlein, Kourosh Gharachorloo, Scott A. Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural 179 Support for Programming Languages and Operating Systems (ASPLOS VI), pages 38--50, 1994.


Architectural Mechanisms for Explicit Communication.. - Umakishore.. (1995)   (5 citations)  (Correct)

.... processor before it is actually used) and poststore (which is sender initiated and sends the data as soon as it is produced to potential consumer processors) Further, there have been several recent proposals to provide message passing style communication primitives in a shared memory machine [14, 17, 19]. We refer to a specific combination of memory model and coherence protocol, together with some explicit communication primitives as a memory system in this paper. All such latency reducing and tolerating mechanisms have simply one goal, namely, to make the parallel machine appear as close as ....

J. Heinlein, K. Gharachorloo, S. A. Dresser, and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, October 1994.


Processor Mechanisms for Software Shared Memory - Carter   (Correct)

....number of methods of implementing flexible shared memory, including the Blizzard systems which were mentioned earlier and the Typhoon systems, which explored the performance tradeoffs involved in using different co processor architectures to execute shared memory protocols. The Stanford FLASH [13] [14] machine also implements flexible shared memory through the use of a co processor to execute shared memory protocols. FLASH is based on the SGI Origin architecture, and replaces the cache coherence controller with a custom protocol processor known as MAGIC. The MAGIC chip has been optimized for ....

....accesses and invoking software handlers, while the M Machine and the full Typhoon system are able to start handlers very quickly. In addition, the M Machine s network has significantly lower latency than the Myrianet used in the Typhoon systems. 7. 4 FLASH The Stanford FLASH machine [19] 13] [14] is another example of an architecture optimized for efficient execution of multiple shared memory protocols. Like Typhoon, FLASH implements shared memory through the use of a dedicated protocol processor, known as MAGIC. In FLASH, the protocol processor is attached to the memory interface bus of ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 38-50, October 1994


Efficient Strategies for Software-Only Directory Protocols.. - Grahn, Stenström (1995)   (7 citations)  (Correct)

....4 by 4 wormhole routed synchronous mesh with a flit size of 64 bits. The mesh is clocked at 50 MHz resulting in a fall through time of 40 ns for each node. The bandwidth into and out of each processor node is 400 Mbytes s. The latency and bandwidth in the mesh are comparable to the Stanford FLASH [9, 10]. We correctly model contention of all parts in the system. Table 2 shows the time it takes to satisfy a read request from different levels in the memory hierarchy in the hardware only implementation assuming no contention. A critical timing assumption is how many cycles we charge for the software ....

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor", In Proceedings of ASPLOS-VI, pages 38-50, October, 1994.


Address Translation Mechanisms in Network Interfaces - Schoinas, Hill (1998)   (9 citations)  (Correct)

....service the miss. Designs that perform the lookup and the miss handling in the NI correspond to network coprocessors or network microcontrollers [1,24] Designs that perform the lookup in the NI and the miss handling in the CPU, correspond to software TLBs or custom hardware finite state machines [20,36]. This classification reveals another interesting design point in which both the lookup and the miss handling are performed on the CPU through an interface that allows user level software to control the mappings that are installed in the NI translation structures. Table 1 shows the design space ....

....not consider such misses further because both in single copy and minimal messaging the limiting factor is how fast the kernel can allocate new pages or swap in old pages from secondary storage. We can implement NI translation structures in software, similar to the software TLBs proposed for FLASH [20]. To implement software structures, we need an NI microcontroller that it is flexible enough to synchronize with the node s CPU to access its own page tables in main memory. Such structures have small associativity and many entries. The lookup overhead is directly proportional to the number of ....

[Article contains additional citation context not shown here]

John Heinlein, Kourosh Gharachorloo, Scott A. Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, San Jose, California, 1994.


Early Experience with Message-Passing on the SHRIMP.. - Felten, Alpert.. (1996)   (23 citations)  (Correct)

.... traditional network interfaces and thus their implementations of the NX message passing library manage communication buffers in the kernel [37, 35] Current machines like the Intel Paragon and Meiko CS 2 attack software overhead by adding a separate processor on every node just for message passing [34, 25, 23, 22, 20]. This approach, however, does not eliminate the overhead of the software protocol on the message processor, which is still tens of microseconds in software overhead. Distributed systems offer a wider range of communication abstractions, including remote procedure call [8, 39, 4] ordered ....

John Heinlein, Kourosh Gharachorloo, Scott A. Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of 6th International Conference on Architectural S upport for Programming Languages and Operating Systems, pages 38--50, October 1994.


Operating System Support for High-Speed Communication - Druschel (1996)   (13 citations)  (Correct)

....memory; the network subsystem can share physical memory dynamically with other subsystems, applications, and file caches. A number of specialized network interfaces exist that support user level network access, for example SHRIMP [3] Memory Channel [11] Hamlyn [4] Telegraphos [19] and MAGIC [12]. These interfaces are specialized to support a shared memory abstraction on loosely coupled multicomputers, and they attach to dedicated networks. An ADC, on the other hand, is a software mechanism implemented with minimal assist from a general purpose network adaptor. As such, it can support ....

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of message passing and shared memory in the stanford flash multiprocessor. In Proceedings of the 6th Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, San Jose, CA, Oct. 1994. ACM.


Exploiting Two-Case Delivery for Fast Protected Messaging - Mackenzie, Kubiatowicz, .. (1998)   (14 citations)  (Correct)

....on a simulator that give the performance of virtual buffering. Section 6 concludes. 2 Related Work Recent architectures demonstrate emerging agreement that it is important to provide support for efficient, fine grain message passing, even in conjunction with hardware support for shared memory [1, 2, 13, 18, 27, 30]. The trend in message interfaces has been to reduce end to end overhead by providing user access to the interface hardware. We build on previous work in messaging models and mechanisms. Model. The UDM model is similar to Active Messages [35] and related to Remote Queues (RQ) 6] as an efficient ....

....network interface low. Using virtual memory is particularly natural when the processor initiates all the buffering because existing support for virtual memory (e.g. the processor s TLB) is reused. It requires a relatively complex DMA engine or coprocessor to manipulate virtual memory independently [13, 25, 29]. Buffering adds a performance cost when used. The buffered path introduces two components of overhead over the fast path. First, there is an extra copy operation: an operating system handler must copy the message from the network interface to memory. Second, the user handler must now retrieve the ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, October 1994.


An Efficient Virtual Network Interface in the FUGU Scalable.. - Mackenzie (1998)   (1 citation)  (Correct)

....in software. Using virtual memory is particularly natural when the processor initiates all the buffering because existing support for virtual memory (e.g. the processor s TLB) is reused. It requires a relatively complex DMA engine or coprocessor to manipulate virtual memory independently [26, 57, 66, 80]. However, virtual buffering is usable in any system that employs buffering. For instance, a system that performs limited buffering in hardware could implement virtual buffering by using interrupts to dynamically expand the buffers [56, 57, 80] One simple way to perform limited buffering would be ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, October 1994.


Mechanisms for Efficient, Protected Messaging - Lee   (Correct)

....Mhz i860 20 S[38] Active Message 1024 max Round Trip LogP analysis 116 S[20] NX library mesg exhange SP2 66.7 Mhz 96 S[20] MPI F library RS 6000 mesg exhange SHRIMP 60 Mhz Pentium 9. 5 S [25] User Level DMA w Automatic Update FLASH 100 Mhz T5 R4000 100 cyc [17] Shared Memory remote read 175 cyc [41] Active Message fetch and add AP1000 25 Mhz 65.6 S[22] Line Sending SPARC ping pong Buffering Receiving Alewife 33 Mhz SPARCLE 14.8 S[15] GID round trip null RPC Myrinet VMMC 166 Mhz Pentium 19.6 S[42] LAN based ping pong Multicomputer iWarp 20 Mhz 800 cycles [27] Message Passing using ....

....at the destination, but can give no guarantees due to the unprotected integer tags. In any case, the message interface privileges are open to abuse, as no facility is provided for regulating userlevel message handlers. System control over message handlers are accomplished on the FLASH system [41] via the virtual address translation layer in its virtual memory mapped message interface. A handler is accessible to a user only if the corresponding entry point is mapped into the user s virtual memory domain. User level handlers are however not supported. In general, a simple protection model ....

John Heinlein, Kourosh Gharachorloo, Scott Dresser, Anoop Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor", in ASPLOS VI, 1994, pp. 38--50.


A Survey of User-Level Network Interfaces for System Area Networks - Mukherjee (1997)   (7 citations)  (Correct)

....for cachable ULNI registers. In this method the I O bridge fakes invalidation signals on the I O bus using a technique called the shadow address space. The shadow address space technique has been used before to communicate special signals and address translations from a processor to an I O device [5, 22], but not in an I O bridge. In this technique, the I Home Caching ULNI Registers In Non Coherent I O Coherent I O Coherent I O I O Bridge Invalidation Support Main Memory ULNI Cache No No No Processor Cache Slow Yes Yes ULNI Processor Cache Slow Slow Yes TABLE 2. Caching I O bus ULNI ....

....multiple copies of data within the user s virtual space. Such access requires the ULNI to support a full blown address translation scheme. Several alternatives exist from stashing the entire page table into the ULNI [46] to caching the translations in a ULNI data structures, either in software [22] or in hardware [39] There are two problems associated with caching the translations in the ULNI: how to fill the ULNI translation buffer and how to avoid stale copies of translations when the operating system has remapped a page or swapped a 21 page to disk. The ULNI translation buffer can be ....

[Article contains additional citation context not shown here]

John Heinlein, Kourosh Gharachorloo, Scott A. Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 38--50, 1994.


Adaptive Granularity: Transparent Integration of Fine- and.. - Daeyeon Park (1996)   (1 citation)  (Correct)

....such as Stanford FLASH and Wisconsin Typhoon integrate both models within a single architecture and implement coherence protocols in software rather than in hardware. In order to use the bulk transfer facility on these machines, several approaches have been proposed such as explicit messages [9, 22] and new programming models [3] In explicit messages, message passing communication primitives such as send receive or memory copy are used selectively to communicate coarse grain data, while load store communication is used for fine grain data [22] In other words, two communication paradigms ....

....machines. Second, because bulk transferred data can be cached, coherence has to be maintained amongst the processors. Enforcing global coherency for arbitrarily sized bulk data using a standard loadstore mechanism substantially increases the hardware complexity and or software overhead [9]. Third, data alignment becomes an issue because bulk transfers might not start or end at cache line boundaries. Supporting arbitrary data alignment adds to hardware and software costs [9] Finally, recent studies indicate that bulk transfer may not actually help the performance of shared memory ....

[Article contains additional citation context not shown here]

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, February 1994.


Connection Resource Management for Compiler-Generated Communication - Hinrichs   (Correct)

....MAGIC chip that takes care of data movement and cache coherence policies. Since the MAGIC chip is programmable, Flash can flexibly support a number of different cache and data movement policies. The differences between the bulk transfers and traditional cache coherence protocols are discussed in [HGDG94] In addition to these protocols, the system programmer can create efficient protocols for other communication patterns. For example, a replication protocol could implement a more efficient algorithm than the naive one to many approach, and a simple reduction communication pattern could ....

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38--50, San Jose, October 1994.


Flow Control Considerations in Network-Based Architectures - Konstantinidou, Ngai   (Correct)

....can be multiphase, that is in response to one message received many messages may have to be created, flow control and in particular buffer management issues are extremely important for correctness. It is generally agreed that both of these programming paradigms have certain limitations [14, 15, 17, 13]. The message passing model requires substantial knowledge and effort by the programmer in order to optimize the data placement and communication. In addition, it can suffer from performance problems caused by the need to match sends to receives, to buffer unanticipated messages, and to perform ....

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of message passing and shared memory in the Stanford FLASH multiprocessor", in Proc. Sixth Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 38--50, 1994.


The Impact of Data Transfer and Buffering Alternatives on.. - Mukherjee (1998)   (6 citations)  (Correct)

....We do not consider this option here. ferred. Unfortunately, users cannot provide authenticated physical addresses of data buffers without violating most operating systems protection model. Consequently, NIs must be prepared to fetch authentic physical addresses from the operating system [35, 17, 42]. To avoid the complexity of building an NI that fetches and manages authentic physical addresses, Blumrich, et al. 2, 31] proposed a low overhead data transfer initiation scheme called User Level DMA (UDMA) In this scheme users provide authentic physical addresses to the NI via a sequence of ....

John Heinlein, Kourosh Gharachorloo, Scott A. Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 38--50, 1994.


Semi-structured Portable Library for Multiprocessor Servers - Tsilikas, Fleury   (Correct)

No context found.

J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In 6th International Confer5


Operating System Support for High-Speed Communication - Druschel (1996)   (13 citations)  (Correct)

No context found.

Heinlein, J., Charachorloo, K., Dresser, S., et al. Integration of message passing and shared memory in the Stanford flash multiprocessor. In Proceedings of the 6th Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 1994), pp. 38--50.


Building Secure and Reliable Network Applications - Birman (1996)   (121 citations)  (Correct)

No context found.

J. Heinlein, K. Garachorloo, S. Dresser and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In 6th International Conference om Architectural Support for Programming Langues and Operating Systems (Oct. 1994), 3850.


Shared Regions: A strategy for efficient cache management in.. - Sandhu (1995)   (2 citations)  (Correct)

No context found.

J. Heinlein, K. Gharachorloo, S. Dressler, and A. Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Sixth Int'l. Symp. on Architectural Support for Programming Languages and Operating Systems, pages 3850, Oct 1994.


Multithreaded Systems - Kavi, Lee, Hurson   (Correct)

No context found.

Heinlein J. et al., "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct., 1994.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC