| D. S. Henry and C. F. Joerg. A tightly-coupled processornetwork interface. In Proc. Fifth Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), pages 111--122, Oct. 1992. |
....quickly by the processor, and acts as a staging area for outgoing messages. A zero copy message protocol allows messages to be delivered directly to user space without copying. Not all of these ideas are new. For example, previous research has explored the use of user level network interfaces[3,9,11,13,18]. However, this specific combination of features is unique, in that it exposes interrupts directly to user level programs. The important aspect of our architecture lies in its support for user level messaging (for both interprocessor communication and I O) in a general purpose operating system ....
....message arrival notification. Sends, receives, and notifications all make passes through operating system code. Since the operating system code is unlikely to reside in the cache, these system calls result in cache misses. Figure 1: Anatomy of a message for a kernel mode NI User level interfaces[3,9,11,13,18] and zero copy protocols[5,7] significantly reduce the overhead of message sends and receives by eliminating operating system and copying overhead on the message send and receive sides. Notifications still have significant opportunity for optimization, as they remain the performance and ....
Dana S. Henry and Christopher F. Joerg. A Tightly-Coupled Processor-Network Interface. In Proceedings of the 5th International ASPLOS, October 1992, pp. 111-122.
....overall file access latency using client initiated RDMA is 20 to 40 lower than with using RPC. We note that this is a rough estimate based on measurements of a prototype with few optimizations. We plan for a more thorough evaluation in the future. NICs that are tightly coupled with the host [29], 30] 31] 14] 32] aim at lowering the NIC overhead as well as the overhead of the NIC interaction with the host for control and data transfer. Previous research [31] has pointed to the importance of NIC design for low latency RPC communication. Scheduling delays included in TNullRPC can be ....
D. S. Henry and C. F. Joerg, "A Tightly-Coupled ProcessorNetwork Interface," in Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), Boston, MA, October 1992, pp. 111--121.
....as load increases, and fosters support for interactive response times. In addition, the fact that interacting processes are guaranteed to execute simultaneously allows them to access hardware communication devices in user mode, without the overheads associated with operating system protection [29, 39, 23]. A distributed hierarchical control (DHC) scheme for supporting gang scheduling has been proposed previously [16] DHC defines a control structure over the parallel machine and combines time slicing with a buddy system partitioning scheme. Given the DHC framework, this paper investigates several ....
Henry, D. S., and Joerg, C. F. A tightly coupled processor-network interface. Fifth Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., Sep. 1992, pp. 111--122.
....in the register file, so that marshaling and unmarshaling would be cheaper. We did not directly simulate the extra registers, but simply changed the accounting for the marshaling and unmarshaling costs. The performance advantages of such an organization have been explored by Henry and Joerg [42]. 45 46 Static Computation Migration in Prelude To evaluate the benefits of computation migration, we implemented static computation migration in the Prelude system. This chapter describes the Prelude implementation, and summarizes some of our performance results. A more complete description ....
D.S. Henry and C.F. Joerg. "A Tightly-Coupled Processor-Network Interface". In Proceedings of the 5th Conference on Architectural Support for Programming Languages and Systems, pages 111--122, October 1992.
....mapped or register based. The Connection Machine CM 5 provides access to the network through a memory mapped interface [21] Register based approaches provide tighter coupling by moving the network interface into the processor and providing direct access to the interface through special registers [5, 9]. One of the problems with the above systems is that they are typically optimized for short messages, thus limiting the achievable bandwidth for large transfers. Another drawback is that the compute processor handles the complete transfer, thus taking cycles away from the main computation. ....
Dana S. Henry and Christopher F. Joerg. A tightly coupled processor-network interface. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111 122, September 1992.
....gradually as load increases, and fosters support for interactive response times. And the fact that interacting processes are guaranteed to execute simultaneously allows them to access hardware communication devices in user mode, without the overheads associated with operating system protection [29, 39, 23]. A distributed hierarchical control (DHC) scheme for supporting gang scheduling has been proposed previously [16] DHC defines a control structure over the parallel machine and combines time slicing with a buddy system partitioning scheme. Given the DHC framework, this paper investigates several ....
D. S. Henry and C. F. Joerg, "A tightly coupled processor-network interface". In 5th Intl. Conf. Architect. Support for Prog. Lang. & Operating Syst., pp. 111--122, Sep 1992.
....cache Second level cache Memory Network interface Network Processor chip FLWB SLWB Interrupt Buffer (IB) Send buffer (SB) Local bus module Figure 3: Processor node organization for a software only directory protocol. 6 faced to the second level cache bus as proposed in [12] where the first entry in the buffer (last for SB) is accessible from software. However, we propose to access the buffers through memorymapped addresses instead of register mapped as in [12] to adjust to mainstream processor designs. Coherence interrupts, also called high availability interrupts, ....
....for a software only directory protocol. 6 faced to the second level cache bus as proposed in [12] where the first entry in the buffer (last for SB) is accessible from software. However, we propose to access the buffers through memorymapped addresses instead of register mapped as in [12] to adjust to mainstream processor designs. Coherence interrupts, also called high availability interrupts, need to be precise as pointed out in [14] to avoid protocol deadlock; we cannot delay the software handler execution until a pending load completes. To see this, consider two nodes, i and ....
D. S. Henry and C. F. Joerg, "A Tightly-Coupled Processor-Network Interface," In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-V), pages 111-122, October, 1992.
....major sources of the communication overhead. The communication hardware aspect includes the architecture and placement of the network interface, and the interconnection network and its services. Many architectures have been proposed for the network interfaces. They are classified as (1) direct [52, 7, 63, 80, 97, 88] and (2) memory based [48, 112, 126, 23] Direct network interfaces allow a processor to directly access the network queue. However, they mostly ignore the issue of multiprogramming. That is, a single thread can only use the network interface at a time. Memory based interfaces provide protection ....
D. S. Henry and C. F. Joerg, "A Tightly-Coupled Processor-Network Interface", Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
....a network interface for stock workstations can only communicate with the processor through a bus. A straightforward message passing interface could be implemented as memory mapped registers such as in CM 5 [14] or as a packet sized array of memory mapped registers as suggested by Joerg and Henry [9] (Figure 1) These interfaces are passive devices that only respond to the processor s direct manipulation through memory mapped operations. A user program composes an outbound packet by writing the content of the packet, with its header, to the registers through memory mapped writes. The content ....
....parallel processing occurs in frequent and small size messages. The communication overhead must be further minimized by giving the user processes direct control of the network interface. These lowoverhead user level network interface designs can be found in many contemporary MPP architectures [9, 14]. However, these designs typically involve the support of custom system or CPU design. In most contemporary workstation designs, the RISC microprocessors are optimized for cached accesses while the bus architectures are optimized for blocked transfers. The network interface design must take these ....
D. S. Henry and C. F. Joerg. A tightly-coupled processor-network interface. In Proceedings of ASPLOS, October 1992.
....Most tightly coupled interface designs use special purpose message instructions (e.g. a send command) in which general purpose processor registers are the operands. Some examples include the Message Driven Processor (MDP) 22] the Caltech Mosaic [24] the Henry Joerg network interface [29] and the Start ( T) 30] network interface. An exception is iWarp from CMU [23] whose systolic communication model is based on operands rather than operators. A send command is constructed by using a message register as the destination of an arithmetic operation; a receive command is constructed ....
Dana S. Henry and Christopher F. Joerg. A tightly coupled processor-network interface. Proc. of the 5th ASPLOS, October 1992, pp. 111-122.
....appropriately distribute among the compiler, operating system, and hardware the functionality that is needed to efficiently support I O bound applications. One approach to a tighter coupling of the network and processor is to directly map the network ports into the processor s register file [10][49]. This approach makes it efficient for a compiler to send a message since the compiler simply needs to build a message in the processor s register file. A shortcoming of this approach is that it bypasses the operating system, and we want to take advantage of the operating system functions that ....
Dana S. Henry and Christopher F. Joerg. A tightly-coupled processor-network interface. In Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
....message passing for streamlining data and control transfer between workstations on a local area network. Though our approach is similar in nature, our emphasis lies on userlevel communication and compiler support for parallel programming; their emphasis is on distributed applications. In [Henry Joerg 92b] the authors propose a network interface design that provides special support for Id [Nikhil 90] programs that have been compiled to Berkeley s Threaded Abstract Machine [Culler et al. 91a] They reduce communication overhead by implementing message dispatching, forwarding and replying in ....
D. S. Henry and C. F. Joerg. A tightly-coupled processor-network interface. In Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
....interface. 2.1 Processing Nodes A fundamental decision in designing the processing node is whether to use commodity or custom processors. A custom processor design can improve communication performance; for example, the architect can integrate the network interface more tightly with the CPU [Henry Joerg 92a] or include special purpose communication instructions in the CPU. The Kendall Square KSR 1 shared memory computer [KSR 92] uses both approaches; its processors provide instructions for prefetching or post storing cache lines, plus a host of instructions that control the memory system, ....
....be useful to examine how the communication requirements for other programming models differ from those of data parallel languages. For example, Henry and Joerg designed a NI for use with the TAM [Culler et al. 91b] model of execution and reached conclusions that are somewhat different from ours [Henry Joerg 92a] While this is a fairly extreme example, in that their programming model is radically different from the data parallel model, it shows the tight connection between the choice of a programming model and the design of the communication architecture. 7.2 Conclusions We have identified ....
D. S. Henry and C. F. Joerg. A tightly-coupled processor-network interface. In Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, October 1992. 117
....transfers, and is implemented using CMAM xfer function which splits up the transfer into a sequence of hardware packets at the source, and CMAM handle left xfer function which reassembles the packets at the destination. 1 While this is not the most efficient type of network interface [13, 8, 4], it has the significant virtue that no changes to the processor are required. Many researchers believe that this type of interface is basically representative of future network interfaces. 2 The CM 5 NI also supports an interrupt driven interface for reception; however, the cost is very high ....
....exploring what impact advanced network features (adaptive routing, virtual channels) have on network interface complexity and software overhead. Our work addresses some of these issues. Research on network interfaces has focused primarily on reducing message injection (and reception) overhead [13, 8, 19, 4] or offloading the communication onto a coprocessor [14, 16, 3] Such efforts are complementary to our goal of software protocol overhead reduction. Improvements in network interface can reduce the basic communication cost in our studies. While reducing the basic cost is important, as can be seen ....
D. S. Henry and C. F. Joerg. A tightly-coupled processor-network interface. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages an Operating Systems, pages 111--122, 1992.
....is still hundreds of CPU instructions. In addition, the node is complex and expensive to build. Several projects have taken the approach of lowering communication latency by bringing the network all the way into the processor and mapping the network interface FIFOs to special processor registers [2, 6, 3]. Writing and reading these registers queues and dequeues data from the FIFOs respectively. While this is efficient for fine grain, low latency communication, it requires the use of a nonstandard CPU, and it does not support the protection of multiple contexts in a multiprogramming environment. ....
Dana S. Henry and Christopher F. Joerg. A tightly-coupled processor-network interface. In Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
....to remote virtual memory bu#ers. Digital s Memory Channel [20] uses an approach that is similar to PRAM on the sending side and to SHRIMP s automatic update on the receiving side. Another direct approach allows applications to compose and retrieve messages using network interface registers [22, 42]. In addition to network interface registers, the Cray T3E [42] supports remote memory accesses with an approach similar to UDMA. It uses complete page tables to describe global communication segments and all communication pages are pinned in memory. A network interface typically can hold only a ....
Dana S. Henry and Christopher F. Joerg. A tightly-coupled processor-network interface. In Proceedings of 5th International Conference on Architectur al Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
.... also supports user level message passing, but places more burden on application programs by requiring them to construct their own message headers [15] Some previous machines have worked to streamline the hardware software interface by mapping network interface FIFOs into processor registers [14, 24, 37]. Such approaches go against SHRIMP s goal of using commodity CPUs. A slightly less integrated approach mapping FIFOs to memory rather than registers was employed in the CM 5 [42] CM 5 implementation restrictions limited the degree of multiprogramming, however, and applications were still ....
Dana S. Henry and Christopher F. Joerg. A TightlyCoupled Processor-Network Interface. In Proceedings of 5th International Conference on Architectur al Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
....is still hundreds of CPU instructions. In addition, the node is complex and expensive to build. Several projects have taken the approach of lowering communication latency by bringing the network all the way into the processor and mapping the network interface FIFOs to special processor registers [5, 11, 7]. Writing and reading these registers queues and dequeues data from the FIFOs respectively. While this is efficient for fine grain, low latency communication, it requires the use of a non standard CPU, and it does not support the protection of multiple contexts in a multiprogramming environment. ....
Dana S. Henry and Christopher F. Joerg. A tightlycoupled processor-network interface. In Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
....Second, the nodes, NI, and the network all have finite buffering, so software buffer management is required. Third, the CM 5 network provides error detection at the packet level, but no error correction, requiring a software 1 While this is not the most efficient type of network interface [12, 6], it requires no changes to the processor. Many researchers believe that this type of interface is representative of future network interfaces. protocol to ensure reliable delivery. And finally, the CM5 network hardware only supports packets with five 32 bit words, so a typical message is broken ....
....remains significant over the range of packet sizes. For finite sequence multi packet deliveries, the messaging overhead is lower, but still significant, accounting for 9 11 of the total cost. Improved network interfaces and DMA hardware If network interfaces can be integrated on chip, as in [12, 6], the basic cost of communication can be reduced, but this will not reduce protocol costs in the messaging layer on which our study focuses. If the base cost is reduced, that increases the importance of the costs in the rest of the messaging layer. Similarly, while DMA hardware can reduce the cost ....
[Article contains additional citation context not shown here]
D. S. Henry and C. F. Joerg. A tightly-coupled processornetwork interface. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages an Operating Systems, pages 111--122, 1992.
....is still hundreds of CPU instructions. In addition, the node is complex and expensive to build. Several projects have taken the approach of lowering communication latency by bringing the network all the way into the processor and mapping the network interface FIFOs to special processor registers [3, 8, 4]. Writing and reading these registers queues and dequeues data from the FIFOs respectively. While this is efficient for fine grain, low latency communication, it requires the use of a non standard CPU, and it does not support the protection of multiple contexts in a multiprogramming environment. ....
Dana S. Henry and Christopher F. Joerg. A tightlycoupled processor-network interface. In Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 111--122, October 1992.
....No T Zero [103] Partial Partial No SHRIMP [12] Yes Write Through No DI Multicomputer[23] No No Network Interface Table 3.5: Comparison of CNI with other network interfaces 80 communicate through the cachable memory accesses, for which most processors and buses are optimized. Henry and Joerg [50] and Dally, et al. 34] advocate changes to a processor s registers. MIT Alewife [2] and Fugu [72] rely on a custom cache controller. MIT StarT NG [22] requires a co processor interface at the same level as the L2 cache. AP1000 [110] requires integrated cache and DMA controllers. Stanford FLASH ....
....of buffering using a synthetic workload and concluded that buffering messages in virtual memory can occur only rarely for realistic applications. However, in contrast I found that for two of my seven macrobenchmarks, buffering can play a significant role in improving performance. Henry and Joerg [50] compared the performance of three NIs mapped respectively to the processor registers, L1 cache bus, and an off chip L2 cache bus. However, unlike my study, they did not examine the impact of buffering on the performance of these NIs. 5.6 Conclusions In this chapter I have systematically ....
Dana S. Henry and Christopher F. Joerg. A Tightly-Coupled Processor-Network Interface. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), pages 111--122, October 1992.
.... the sender and the receiver, and eliminating the dedicated memory for message arrival, as is found on the J Machine [8] Register mapped network interfaces have been used previously in the Mars Machine [2] J Machine, and iWarp [4] and have been described by T [26] as well as Henry and Joerg [15]. However, none of these systems provide protection for user level messages. Systems, like the J Machine, that provide user access to the network interface without atomicity must temporarily disable interrupts to allow the sending process to complete the message. The M Machine s atomic SEND ....
Henry, D. S., and Joerg, C. F. A tightly-coupled processor-network interface. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V) (Oct. 1992), ACM, pp. 111--122.
No context found.
D. S. Henry and C. F. Joerg. A tightly-coupled processornetwork interface. In Proc. Fifth Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), pages 111--122, Oct. 1992.
No context found.
Dana S. Henry and Christopher F. Joerg. A tightly-coupled processor-network interface. In Proc. Fifth Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), pages 111-- 122, October 1992.
No context found.
D. S. Henry and C. F. Joerg, "A tightly-coupled processor-network interface," in Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 111--122, Oct. 1992.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC