| M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: The intelligent network interface approach. In Proceeding of the 6th International Symposium on High Performance Distributed Computing (HPDC 97), 1997. |
....microprocessors have become less expensive and more powerful leading many to consider the potential benefits of adding a dedicated network coprocessor to free the host machine s resources for more important work. This goal applies equally to true parallel architectures[8] and networked workstations[7]. The goal of the Cal Poly Intelligent Network Interface Card (CiNIC) project is to offload the network functions from a host machine onto a dedicated network coprocessor. In this case, we define network functions to be those processes that manage the movement of data to and from The work ....
ROSU, M., SCHWAN, K., AND FUJIMOTO, R. Supporting parallel applications on clusters of workstations: the intelligent network interface approach. In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing (Portland, OR, Aug. 1997), pp. 159--168.
....busses will continue to inhibit gigabit networking) leads one to focus on adding more processing capabilities to the NIC. Indeed many, gigabit networks have embedded processors on the NIC that researchers are exploiting in many ways. In this way, we are similar to Typhoon [14] Georgia Tech s VCM [15], RWCP s GigaE PM project [16] and the University of British Columbia s GMS NP project [6] All of these use a processor on the NIC to accelerate distributed computing. However, these solutions (1) are based on embedded processors with a fraction of the computing power of reconfigurable logic and ....
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: The intelligent network interface approach. In Proceeding of the 6th International Symposium on High Performance Distributed Computing (HPDC 97), 1997.
....NIC could perform all of the protocol processing for a node, offering higher bandwidth and lower latency communications. Unlike modern system architectures, an Intelligent NIC should be able to handle an arbitrarily high bandwidth connection. Several researchers have proposed similar intelligence [1, 15, 16, 17]. Combined Compute Protocol Accelerator takes advantage of the opportunity to tightly couple a highperformance computing core with a network interface. This is the most interesting of the three modes as it has the advantage of very low latency from the Figure 1. One ACC node computing ....
....to inhibit gigabit speed networking) leads one to focus on adding more of the communications processing to the NIC. Indeed, many Gigabit Ethernet NICs have embedded processors that researchers are exploiting in various ways. In this way, our work is similar to Typhoon [16] Georgia Tech s VCM [17], RWCP s GigaE PM project [18] and the University of British Columbia s GMS NP project [4] All of these use a processor on the NIC to accelerate distributed computing. However, these solutions (1) are based on less powerful (inexpensive) embedded processors and (2) ignore the potential of adding ....
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: The intelligent network interface approach. In Proceeding of the 6th International Symposium on High Performance Distributed Computing (HPDC 97), 1997.
....an SMP for communication processing benefits light weight protocols and improves performance when communication is a bottleneck. Indeed, many gigabit networks now include embedded processors on the NIC for various network processing tasks. Research efforts such as Typhoon [13] Georgia Tech s VCM [14], RWCP s GigaE PM project [16] and the University of British Columbia s GMS NP project [4] all use such a processor to accelerate distributed computing. Similarly, research at CMU explored hardware to augment ATM card to boost distributed programming speeds with a Hardware Assisted Remote Put ....
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: The intelligent network interface approach. In Proceeding of the 6th International Symposium on High Performance Distributed Computing (HPDC 97), 1997.
....are several programming models that can be exploited. The rst task is to provide an extension of MPI to enable clusters scattered around the world to communicate e ciently. Several projects are under development: MPI Connect [35] Nexus [36] PACX MPI [32] MPI Plus [56] Data Exchange [33] VCM [57] and MagPIe [46] Several metacomputing projects are currently building the infrastructure on top of which such extensions of MPI may utilize distributed computing capacity. The most prominent systems are Globus [37] see http: www.globus.org) and Legion [42] see http: www.cs.virginia.edu ....
M.C Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: the virtual communication machines-based architecture. Cluster Computing, 1:51-67, 1998.
....publications specifically describing performance tools for them, nor any using our firmware based approach. Other work has used only microbenchmark and statistical methods [7] or high level software measurements [28] There are some other research projects on other programmable network interfaces [11, 24, 26]. They also study the placement of functionality between the host and the network interface. However, the primary focus of these projects is on reducing the software overhead of communication to achieve maximum performance from the raw hardware, instead of on collecting the performance data of the ....
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach. In Proc. of the 6th IEEE International Symposium on High Performance Distributed Computing, Aug. 1997.
No context found.
M. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Virtual Communication Machine-based Architecture. In Proceedings of Cluster Computing, 1998.
No context found.
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Virtual Communication Machine-based Architecture. Cluster Computing, , May 1998.
....by approaches like XDR are potentially avoided. Furthermore, when sender and receiver use the same native data representation, such as in exchanges between homogeneous architectures, this approach allows received data to be used directly from the message buffer eliminating high copy overheads [10, 11]. When sender s and receiver s formats differ, NDR s DCG based conversions have eiticiency similar to that of systems that rely on a pioi agreements to make use of compile or link time stub generation. However, because NDR s conversion routines are dynamically generated at data exchange ....
....and unpack operations. However, relegating these tasks to the communicating applications means that the communicating components must agree on the format of messages. In addition, the semantics of application side pack unpack operations generally imply a costly data copy to or from message buffers [16, 10]. Other packages, such as MPI, support the creation of user defined data types for messages and fields and provide some marshalling and unmarshalling support for them. Although this provide some level of flexibility, MPI does not have any mechanisms for run time discovery of data types of unknown ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoro, "Supporting parallel applications on clusters of workstations: The virtual communication machine-based ar- chitecture," Cluster Computing, Special Issue on High Performance Distributed Computing, vol. 1, pp. 51-67, January 1998.
....site located on the cluster machine acts as the primary mirror, with other sites acting as secondary mirrors. In an actual, deployed operational system, it should be possible to separate data mirroring from processing functionality, thereby permitting us to use application specific extensions [14, 15] of operating system kernels of communication co processors (i.e. network interface boards) to reduce mirroring overheads. The resulting software architecture used for data mirroring, then, separates the application specific code (i.e. business logic) executed by the Event Derivation Engine from ....
M.C. Rosu, K. Schwan, and R. Fujimoto, "Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach", In Proc of Sixth Symposium on High Performance Distributed Computing (HPDC6) , Portland, Aug. 1997.
....by approaches like XDR are potentially avoided. Furthermore, when sender and receiver use the same native data representation, such as in exchanges between homogeneous architectures, this approach allows received data to be used directly from the message bu er eliminating high copy overheads [10, 11]. When sender s and receiver s formats di er, NDR s DCGbased conversions have eciency similar to that of systems that rely on a priori agreements to make use of compile or link time stub generation. However, because NDR s conversion routines are dynamically generated at data exchange ....
....and unpack operations. However, relegating these tasks to the communicating applications means that the communicating components must agree on the format of messages. In addition, the semantics of application side pack unpack operations generally imply a costly data copy to or from message bu ers [16, 10]. Other packages, such as MPI, support the creation of user de ned data types for messages and elds and provide some marshalling and unmarshalling support for them. Although this provide some level of exibility, MPI does not have any mechanisms for run time discovery of data types of unknown ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoto, \Supporting parallel applications on clusters of workstations: The virtual communication machine-based architecture, " Cluster Computing, Special Issue on High Performance Distributed Computing, vol. 1, pp. 51-67, January 1998.
....Each NI has a high performance host CPU NI interconnect (e.g. a PCI bus) direct connections to the switch, a programmable CoProcessor supporting protocol processing, and local memory with direct connections to disk devices and other peripherals. The NIs used in our research include ATM FORE[19], Myrinet[24] and I2Ocompliant network interface boards (Intelligent I O Industry Consortium) 15, 11] This paper employs a server configured as 16 quad Pentium Pro nodes connected via I2O based NIs, each of which has two 100Mbps Ethernet links, a PCI interface to the host CPU, and two SCSI ....
....API Host VCM API CoProcessor Host CoProcessor Figure 2. Distributed VCM Architecture. Contributions. The DVCM idea, its realization for CoProcessor based NI architectures, and its utility for attaining high performance on cluster machines have been explained and evaluated in previous papers[19]. This paper s novel contributions are the following: ffl We demonstrate the feasibility of implementing the DVCM architecture on COTS (Commercial Off The Shelf) runtime support resident on NIs, namely, the VxWorks embedded real time OS from Wind River Systems[27] running on Intel i960RD I2O ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach. Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing, Aug. 1997.
.... such as in exchanges between homogeneous architectures, this approach allows received data to be used directly from the message buffer, making it feasible for middleware to effectively utilize high performance communication layers like FM [13] or the zero copy messaging demonstrated by Rosu et al. [17] and Welsh et al. 19] When conversion between formats is necessary, these DCG conversions are of the same order of efficiency as the compile time generated stub routines used by the fastest systems relying upon a priori agreements[14] However, because the conversion routines are derived at ....
....order to change message formats can be a significant impediment to the integration, deployment and evolution of complex systems. In addition, the semantics of application side pack unpack operations generally imply a data copy to or from message buffers, with a significant impact on performance [13, 17]. Packages which perform internal marshalling, such as MPI, could avoid data copies and offer more flexible semantics in matching fields provided by senders and receivers. However, existing packages have failed to capitalize on those opportunities. For example, MPIs type matching rules require ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: The virtual communication machine-basedarchitecture. Cluster Computing, Special Issue on High Performance Distributed Computing, 1, January 1998.
....layers (see Figure 2) 1) the VCM based interface, 2) the DVCM extension modules, and (3) the DVCM run time system. In this section, we briefly describe the first two layers and thoroughly describe and evaluate the run time system; a detailed description of the first two layers is included in [16, 15]. VCM based Interface Layer. On each workstation in the cluster, a local NI may be abstracted to the local processes as a Virtual Communication Machine (VCM) The main components of the VCM are the address space and the extensible instruction set. The VCM address space is the union of the memory ....
....In contrast, receive rates decrease in steps because the receiver has to handle an entire ATM cell before determining how much of its content is useful data. Both send and receive rates are an order of magnitude higher than the rates achievable at application level with the same hardware [16]. This is because (1) CoProcessor toCoProcessor messaging requires lower CoProcessor overhead than application level messaging, and (2) I O bus 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 Control Message Latency (Microseconds) Message Size (Bytes) Average Latency Minimum ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Virtual Communication Machine-based Architecture. Cluster Computing, 1:1--17, Jan. 1998.
....to improve at much slower rates, resulting in a performance gap that will continue to widen in the foreseeable future. This implies that interactions between the network and hosts utilizing main memory are expensive. Additional costs arise for such interactions from overheads due to I O bus usage [4, 5], communication protocol implementations (e.g. if interrupts are used [2] and interactions with the host CPU s memory management and caching infrastructure [3] Consequently, network based applications that produce, transport, and process large data sets suffer substantial losses in performance ....
M. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: the virtual communication machine-based architecture. Cluster Computing, pp. 1029, November 1997.
....firmware, the kernel extension, and the user level library relatively independent. 4. Implementation This section briefly describes our current implementation that uses a cluster of Sun UltraSPARCs I Model 170 running Solaris 2. 5 equipped with FORE SBA 200E network cards; details can be found in [18, 19]. The FORE SBA 200E cards can transfer data directly between the wire and the host memory, bypassing the card memory. The VCM is implemented as an interpreter runningon the coprocessor (a 25 MHz i960 microprocessor) and the VCM address space is included in the SPARC s DVMA space. The kernel ....
....card memory. The coprocessor polls the command word and writes into the status word while the host polls the status word and writes into the command word. 4.2. VCM interpreter The main loop of the VCM interpreter considers the following requests in the order in which they are listed below (see [18, 19] for details) ffl Protection related instructions: inform the VCM about changes in its address space or in the application s connection (ATM VCIs) set. ffl VCM programs: all instruction segments are checked for programs to execute. For each program, all its instructions (limited to several ....
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach. Georgia Institute of Technology, College of Computing, TR GIT-CC-97-16, May 1997.
....a perfect match exists between the resources installed on the NI and the resources required for driving the network link at full speed. Most likely, every well designed NI card hosts a certain amount of unused resources. Furthermore, our previous work on the Virtual Communication Machine (VCM) [17] has demonstrated that the amounts of additional resources required by DVCM plus typical application specific extensions are small. This paper demonstrates the benefits of the DVCM extensible architecture with an extension module that implements a task useful in most parallel applications: ....
....(3) describe the extension module for sender coordinationand its performance, and (4) provide a brief overview of the related research. 2. The Distributed Virtual Communication Machine The DVCM architecture is the natural extension of our previous work on the Virtual Communication Machine (VCM) [16, 17]) The VCM is a programmable abstraction of the network interface. It consists of an execution unit and an address space. The VCM execution unit implements a simple set of commands, most of which are used for assembling disassembling application data from into network messages. The command set is ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Virtual Communication Machine-based Architecture. Cluster Computing, 1:1--17, Jan. 1998.
....(3) describe the extension module for sender coordinationand its performance, and (4) provide a brief overview of the related research. 2. The Distributed Virtual Communication Machine The DVCM architecture is the natural extension of our previous work on the Virtual Communication Machine (VCM) [16, 17]) The VCM is a programmable abstraction of the network interface. It consists of an execution unit and an address space. The VCM execution unit implements a simple set of commands, most of which are used for assembling disassembling application data from into network messages. The command set is ....
....and communication systems [2, 7, 19, 26] has demonstrated the fact that a single set of system primitives cannot easily satisfy the requirements of every user provided application program. One of the proposed solutions is to customize the application interaction with the network interface [8, 16, 17]. The research reported here goes further and considers customizing the interaction of the application with the entire network, viewed as a single entity. The DVCM architecture provides two mechanisms with which applications can customize their interactions with the network. First, applications ....
[Article contains additional citation context not shown here]
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach. Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing, Aug. 1997.
No context found.
M.-C. Rosu, K. Schwan, and R. Fujimoto. Supporting parallel applications on clusters of workstations: The intelligent network interface approach. In Proceeding of the 6th International Symposium on High Performance Distributed Computing (HPDC 97), 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC