| Al Davis, Mark Swanson, and Mike Parker. Efficient Communication Mechanisms for Cluster Based Parallel Computing. Communication, Architecture, and Applications for Network-Based Parallel Computing, 1997, pp. 1-15 |
....that benefit from the increased CPU idle time during I O operations. In many cases applications can realize performance improvements without extensive software restructuring as the user level I O architecture does not restrict the use or management of I O buffers as previous work has often done [24][27] 29] 62] 75] 99] The prototype implementation of this architecture is able to improve the aggregate bandwidth of 23 disk I O streams of a distributed storage architecture by a factor of two while reducing CPU occupancy by almost a factor of 100. The significantly reduced overhead enables a ....
....for the duration of the transfer. This pinning or locking of pages can either be done before every DMA transfer, or when the buffer is initially allocated. Existing solutions for high performance communication networks often require the application to specify the communication buffers in advance [24][32] 39] During the buffer setup, the kernel pins the pages in physical memory, and the address mapping is made available to the I O device. In addition, the kernel can arrange the physical pages contiguously. The application is then able to initiate DMA transfers using this prearranged buffer ....
A. Davis, M. Swanson, and M. Parker, "Efficient Communication Mechanisms for Cluster Based Parallel Computing," Proc. 1st Int'l Workshop Communication and Architectural Support for Network-based Parallel Computing (CANPC `97), IEEE CS Press, Los Alamitos, Calif., 1998, pp 1-15.
....From a software oriented standpoint, they can be grouped into two main families as well, namely the kernel level and the user level. We classify and compare these different approaches. 3.2.4. 1 Commodity versus custom devices Using custom devices rather than commodity off the shelf ones (like in [DSP97, HKO 94, KOH 94, BLA 94, MK96] may lead to very low latency and high bandwidth communication. The performance characteristics as well as the technological level of some custom devices are often much better than off the shelf ones. The performance superiority is often achieved through ....
A. Davis, M. Swanson, and M. Parker. Efficient Communication Mechanisms for Cluster Based Parallel Computing. In Proc. of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'97), number 1199 in Lecture Notes in Computer Science. Springer, February 1997.
....From a software oriented standpoint, they can be grouped into two main families as well, namely the kernel level and the user level. We classify and compare these different approaches. 3.2.4. 1 Commodity versus custom devices Using custom devices rather than commodity off the shelf ones (like in [34, 42, 46, 15, 51]) may lead to very low latency and high bandwidth communication. The performance characteristics as well as the technological level of some custom devices are often much better than off the shelf ones. The performance superiority is often achieved through a closer integration between the ....
A. Davis, M. Swanson, and M. Parker. Efficient Communication Mechanisms for Cluster Based Parallel Computing. In Proc. of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'97), number 1199 in Lecture Notes in Computer Science. Springer, February 1997.
....We believe this drawback is offset by the decrease in host overhead. Transferring specialized functionality to the communication coprocessor has been investigated for message passing parallel machines. For instance, 21] investigates the benefits of running Active Messages on the coprocessor. [1, 6, 20] propose new hardware architectures for network interfaces. Their primary objective is to eliminate the I O bus bottleneck by integrating the network interface with the memory system. We believe that VCM like abstractions will prove useful for such hardware architectures. 7. Conclusions This ....
A. Davis, M. Swanson, and M. Parker. Efficient communication mechanisms for cluster based parallel computing. www.cs.utah.edu/projects/avalanche/index.html, Dec. 1996.
....user processes have to explicitly fragment and re assemble messages longer than the Ethernet MTU. Net exhibits quite low (30.1 s) one way latency, but is comparable to Linux TCP IP sockets as for asymptotic bandwidth. 2.2. 5 Message passing on custom communication devices A number of projects [19, 30, 38, 7, 41] use either special purpose or proprietary communication devices in order to achieve low latency and high bandwidth communication. The performance characteristics as well as the technological level of some custom expensive devices are often much better than off the shelf ones. The performance ....
A. Davis, M. Swanson, and M. Parker. Efficient Communication Mechanisms for Cluster Based Parallel Computing. In Proc. of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC'97), number 1199 in Lecture Notes in Computer Science. Springer, February 1997.
....of the underlying network to the application level. Our research addresses the fundamental question of how to structure and implement the interface between applications and the network. In contrast to other recent attempts at improving network performance which focus on lowering latencies [7, 22, 34], the approach presented in this paper is based on the assumption that the critical step in improving the performance of parallel applications on COWs is the reduction of host communication overheads. We posit that low latencies are important only to the extent that their reduction does not lead ....
....well. Transferring specialized functionality to the communication coprocessor has been investigated in the context of message passing parallel machines. For instance, in [29] the coprocessor is used to implement Active Messages. New hardware architectures for network interfaces are proposed in [1, 7, 27]. Their primary objective is to eliminate the I O bus bottleneck by integrating the network interface with the memory system. We believe that VCM like abstractions will prove useful for such hardware architectures. 7 Conclusions and Future Work This paper presents the design and implementation of ....
A. Davis, M. Swanson, and M. Parker, Efficient communication mechanisms for cluster based parallel computing, in Proceedings of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing, D. K. Panda and C. B. Stunkel eds., Springer-Verlag, Heidelberg, 1997, pp. 1--15.
.... networks of workstations via a combination of hardware and software: Dolphin s SCI interface [19] PRAM [24] Memory Channel [13] Myrinet [6] ServerNet [26] Active Messages [12] Fast Messages [17] Galactica Net [16] Hamlyn [9] U Net [27] NOW [1] Parastation [28] StarT Jt [15] Avalanche [10], Panda [2] and SHRIMP [4] provide efficient message passing on networks of workstations based on memory mapped interfaces. We view our work as complimentary to these projects, in the sense that we propose a fast message notification mechanism that can improve the performance of all these message ....
A. Davis, M. Swanson, and M. Parker. Efficient Communication Mechanisms for Cluster Based Parallel Computing. Technical report, University of Utah, Dept. of Computer Science, 1996.
....receives, and notifications all make passes through operating system code. Since the operating system code is unlikely to reside in the cache, these system calls result in cache misses. Figure 1: Anatomy of a message for a kernel mode NI User level interfaces[3,9,11,13,18] and zero copy protocols[5,7] significantly reduce the overhead of message sends and receives by eliminating operating system and copying overhead on the message send and receive sides. Notifications still have significant opportunity for optimization, as they remain the performance and scalability bottleneck in general ....
....overhead and latency. Having the NI on die gives the processor access to it on a per cycle basis. This close coupling further reduces the overhead in getting information to and from the NI. Message sends and receives do not have to go out over slow and inefficient I O buses. A zero copy protocol[5,7] is used to eliminate copying overhead for received messages. The combination of user level access to a closely coupled NI and the zero copy protocol allow for efficient sends and receives. 3.2 User level Notifications Part of the inefficiency of interrupt processing is due to the legacy view ....
[Article contains additional citation context not shown here]
Al Davis, Mark Swanson, and Mike Parker. Efficient Communication Mechanisms for Cluster Based Parallel Computing. Communication, Architecture, and Applications for Network-Based Parallel Computing, 1997, pp. 1-15
....simulations is an obvious approach. This paper reports on such a parallelization effort and its unusual approach of performing distributed simulation within a shared memory model. 1. 1 The Avalanche Architecture The Avalanche distributed system will be a cluster or network of 32 to 64 workstations[7] interconnected with a Myrinet network. Its unique aspects lie in providing a communications interface supporting extremely efficient message passing and distributed shared memory (DSM) and designing that interface to plug in to commodity workstations. All interactions between processors occur as ....
Swanson, M. R., Davis, A., and Parker, M. Efficient Communication Mechanisms for Cluster Based Parallel Computing. In Workshop on Communication and Architectural Support for Network-based Parallel Computing (CANPC 97) (February 1997), vol. 1199 of Lecture Notes in Computer Science, Springer-Verlag, pp. 1--15.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC