| CULLER, D. E., LIU, L. T., MARTIN, R. P., AND YOSHIKAWA, C. O. Assessing Fast Network Interfaces. In IEEE Micro (Feb. 1996), vol. 16, pp. 35--43. |
....L, an upper bound on the latency, or delay, incurred in communicating a message, o, the message handling overhead, g, the gap, which is the reciprocal of the available per processor communication bandwidth and P , the number of processor memory modules. Examples of utilization of LogP are shown in [4], 5] and [6] A model that has recently gained wide consideration in the scientific community is the BSP (Bulk Synchronous Parallel) model [7] 8] because is neither too abstract nor too low level and it seems to meet many of the requirements listed above [9] It was proposed by Valiant as a ....
D. Culler, L. T. Liu, R. P. Martin, and C. Yoshikawa, "Assessing Fast Network Interfaces, " IEEE Micro, vol. 16, pp. 35--43, February 96.
....a sequence of contention free steps and are thus limited by the injection overhead and the base network latency. In the absence of contention the two parameters L and o can be properly estimated and the model can lead to important and effective optimizations. Examples of LogP usage are shown in [32], 56] and [97] 4.3.3 BSP The Bulk Synchronous Parallel or BSP model was proposed by Valiant as a bridging model that provides a standard interface between the domains of parallel architectures and algorithms. In the BSP model, a parallel architecture consists of a set of processors, each with ....
David Culler, Lok Tin Liu, Richard P. Martin, and Chad Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35--43, February 96. BIBLIOGRAPHY 136
....streaming bulk performance and LogP microbenchmarks. For Split C, we measure raw communication primitive performance and benchmark five applications. 6. 1 AMVIA The performance of AMVIA is evaluated from three benchmarks: One way message time, streaming performance and the LogP micro benchmark [Cul93, Cul96]. One way message time is a measure of the average time it takes for a message of a given size to be transmitted from a source node and received by the destination node. It is measured through a series of ping pong tests in which a message is sent to a destination node which reflects it back to ....
D. E. Culler, L. Tin Liu, R. P. Martin and C. Yoshikawa, "Assessing Fast Network Interfaces.", IEEE Micro, vol. 16, (no. 1), Feb 1996, pp. 35-43
....interrupt on message arrival; the implementation of a spanning tree forwarding 13 protocol; the implementation of a reliable credit based multicast protocol; and the addition of an efficient primitive to obtain a sequence number. Several other networks also have programmable interface processors [7], so some of the optimizations may be applicable on such networks as well. 4.3 Layer Collapsing The layered structure of our system also adds some overhead. Since FM and Panda are two independent systems, they each maintain their own pool of buffers, so they both have the overhead of buffer ....
D.E. Culler, L.T. Liu, R.P Martin, and C.O. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35--43, February 1996.
....size messages. These results are not specific for the system under evaluation, and are of quite general applicability. We have used a series of communication microbenchmarks to measure the parameters of the LogP model. These specialized benchmarks are similar to those employed by Culler et al. [5] to characterize the the Active Messages communication library on three different machines [12] What differentiates our work is the focus on the interaction between the messaging layer and the underlying network architecture. The paper is organized as follows. In section 2 we give a quick ....
....as a basis for further extensions addressing specific issues in algorithm design like the use of bulk data transfers [1] Less attention has been given to the model as a tool to drive architectural choices and evaluate the different components making up a parallel system. Recently, Culler et al. [5] have used the model to compare network interfaces. They measured the LogP parameters using microbenchmarks on three parallel machines characterized by different network architectures. They then derived from these data a more detailed performance characterization than the traditional one based on ....
[Article contains additional citation context not shown here]
D.E. Culler, L.T. Liu, R.P. Martin, C.O. Yoshikawa, "Assessing Fast Network Interfaces", IEEE Micro, 16(1), pp. 35--43, Feb. 1996.
....packets) The retransmitting implementations (columns 4 and 5) require more NI send buffers, because they cannot release send buffers until an acknowledgement arrives. These implementations always use the same window size (8 packets) 4. 1 Unicast Performance Table 4 shows the values of the LogP [10] parameters, the end toend latency, and the fetch and add latency for RXnoM ni , RX ni M ni , and RX h M h . Since multicast forwarding plays no role in these measurements, the table and this section refer to these implementations as RXno , RX ni , and RX h , respectively. Since ....
....with a host level sliding window protocol for reliability and flow control [9] Several papers describe NI supported multicast protocols [5, 14, 17, 25] We are the first to compare efficient NIlevel and host level multicasts and their impact on application performance. Araki et al. used LogP [10] measurements to several user level communication systems [1] with different reliability strategies. They compare systems with different programming interfaces (e.g. memorymapped communication and message passing) and do not consider multicast. We compare implementations of one interface and ....
D. Culler, L. Liu, R. Martin, and C. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35--43, February 1996.
....bandwidth (g) and the number of processors (P) Furthermore, it is assumed that the network has nite capacity, such that at most dL=ge messages can be in transit from any processor or to any processor at any given moment during program execution. Some interesting applications of LogP are shown in [27, 36, 3, 14]. The remainder of this paper is organized as follows. Section 2 presents three sequential FFT algorithms and some properties of the butter y graph that are used in section 3 to implement two variants of the transpose parallel FFT algorithm. While the rst one strictly separates the computation ....
David Culler, Lok Tin Liu, Richard P. Martin, and Chad Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35-43, February 96.
....Ethernet networks are still the most commonly used network interconnect for parallel processing using a workstation cluster, whilst 64 and 128 bit internal busses can now operate in excess of 100MHz. Providing higher bandwidth networks for NoW parallel computing has been a focus for much research [Culler et al. 1996]. Another major limitation of NoW parallel computing is the relatively high latency for data transfer across a network [Culler et al. 1993] For example, 100 Mbit s Ethernet has a latency of the order of half a millisecond when protocol and operating system overheads are included [Dutton and ....
Culler, David E., Liu, Lok Tin, Martin, Richard P., and Yoshikawa, Chad O. (1996). Assessing fast network interfaces. IEEE Micro, pages 35--43.
....the computational overhead of the clock synchronization algorithm is not an issue for most message sizes. The di#erence between adaptive and non adaptive is negligible. From the same program we can also compute one way host to host latency, which is the parameter RTT 2 in the Berkeley LogP model [3]. This measures the time from when the sending node issues the send request until all the data arrives in the receiving node s memory. The impact of running clock synchronization code is about 0.3s to 0.6s for this microbenchmark, which is 0.2 in the worst case. The adaptive part of the ....
D. Culler et al. Assessing fast network interfaces. IEEE Micro, Feb. 1996.
....G for modeling the gap per byte for long messages, which are typically handled more efficiently. Other variants of LogP have also been proposed where the overhead at the sender and the receiver side is treated separately as o s and o r , and where some parameters depend on the message size [3, 5, 6]. For practical use of LogP, the actual parameters of a parallel computing platform have to be measured. The main problem is how to accurately measure the gap parameter. The measurement methods described in [3, 5] measure the gap by sending large sequences of messages in order to saturate the ....
....separately as o s and o r , and where some parameters depend on the message size [3, 5, 6] For practical use of LogP, the actual parameters of a parallel computing platform have to be measured. The main problem is how to accurately measure the gap parameter. The measurement methods described in [3, 5] measure the gap by sending large sequences of messages in order to saturate the communication links in which case the link capacity (as expressed via the gap) can be observed. This measurement procedure has two drawbacks. First, it is highly intrusive and may disturb other ongoing communication. ....
[Article contains additional citation context not shown here]
D. E. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35--43, Feb. 1996.
.... of the libraries, which lead to implementation variants and, possibly, to performance differences; basic communication performance results (throughputs and round trip latencies) and a more detailed quantitative assessment of the libraries, based on the LogP model [17] and the methodology of [18]. The LogP analysis has been performed to obtain a better understanding of if and how specific features or implementation variants show up in performance differences. The rest of the paper is basically organized according to the above aspects. Section II describes the SCI platform that has been ....
....Thus, communication of bulk data incurs high overhead in the sense of the LogP parameters of section VI. III. Communication Libraries A. SCI Active Messages Active Messages (AM) can be regarded as lightweight asynchronous remote procedure calls (RPCs) each of which is a request reply pair [18]. A request AM becomes active on the receiving end in that it invokes a user level, non blocking message handler to service the request and to send back a reply AM, which in turn is handled by the reply handler on the requesting node. AM typically are short, supporting a fixed set of primitive ....
[Article contains additional citation context not shown here]
David Culler, Lok Tin Liu, Richard P. Martin, and Chad O. Yoshikawa, "Assessing Fast Network Interfaces," IEEE MICRO, vol. 16, no. 1, pp. 35--43, Feb. 1996.
....Microbenchmarks In this section I characterize the performance of seven NIs using two microbenchmarks: round trip latency and bandwidth. These microbenchmarks capture the baseline performance of these NIs. An alternative approach would be to characterize the NIs using the Berkeley LogP model [30]. The LogP model characterizes NI accesses with three parameters: latency (L) overhead or processor occupancy (o) and bandwidth (g) However, I refrain from using this model because the latency and overhead components of this model do not uniformly capture the same metrics for all of my NIs. For ....
David Culler, Lok Tin Liu, Richard Martin, and Chad Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1), February 1996.
....the onesided throughput T (S) of the messaging system as perceived at the sender side. A similar parameter can be defined for the receiver side. ffl The number P of processors connected to the messaging system. Measuring the LogP parameters is tricky business. The interested reader may look at [CLMY96] An experience of such a measurement is reported in [ILM98] 2.3 BSP The LogP model is more accurate than simple throughput curves, however it still lacks a precise methodology for deriving performance predictions from parallel programs in a reasonable way. The ability of making such ....
D. Culler, L. Liu, R. Martin, and C. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35--43, February 1996.
....the onesided throughput T (S) of the messaging system as perceived at the sender side. A similar parameter can be defined for the receiver side. ffl The number P of processors connected to the messaging system. Measuring the LogP parameters is tricky business. The interested reader may look at [30]. An experience of such a measurement is reported in [45] 2.3 BSP The LogP model is more accurate than simple throughput curves, however it still lacks a precise methodology for deriving performance predictions from parallel programs in a reasonable way. The ability of making such predictions is ....
D. Culler, L. Liu, R. Martin, and C. Yoshikawa. Assessing Fast Network Interfaces. IEEE Micro, 16(1):35--43, February 1996.
No context found.
CULLER, D. E., LIU, L. T., MARTIN, R. P., AND YOSHIKAWA, C. O. Assessing Fast Network Interfaces. In IEEE Micro (Feb. 1996), vol. 16, pp. 35--43.
No context found.
CULLER, D. E., LIU, L. T., MARTIN, R. P., AND YOSHIKAWA, C. O. Assessing Fast Network Interfaces. In IEEE Micro (Feb. 1996), vol. 16, pp. 35--43.
No context found.
D. Culler, L. Liu, R. Martin, and C. Yoshikawa. Assessing Fast Network Interfaces. In IEEE Micro Magazine, Feb. 1996, pp. 35-43.
No context found.
D. Culler, L. Liu, R. Martin, and C. Yoshikawa. Assessing Fast Network Interfaces. In IEEE Micro Magazine, Feb. 1996, pp. 35-43.
....the RRB into the data structure of the receiving process. Additional copy operations are avoided as far as possible. 3.3 Message passing APIs Active Messages 2.0. Active Messages (AM) can be regarded as lightweight asynchronous remote procedure calls (RPCs) each of which is a request reply pair [5]. A request AM becomes active on the receiving end in that it invokes a user level, non blocking message handler to service the request and to send back 0.25 s delay 0.00 s 0.50 s 1.00 s 2.00 s 4.00 s send only, no delay Active Messages 8 7 6 5 4 3 2 0 10 20 30 40 50 60 Burst ....
....queue (UMQ) increasing the number of potential copy operations. Furthermore, the RRB as well as the UMQ have to be checked on initiation of a receive operation. 3.4 Experiments and results Short Message Performance. Short message performance was assessed using the LogP model described in [5]. The numbers for all three layers are shown in Table 1 and were derived from the graphs given in Figures 4, 5, and 6, respectively, while the roundtrip times (RTT) were measured seperately. They clearly reflect the implementation differences. Active Messages, being the layer with the leanest ....
D. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. Assessing Fast Network Interfaces. IEEE MICRO, 16(1):35--43, Feb. 1996.
....close to the minimal message layer for the hardware used in our performance comparison. Active Messages [5] was the first widely used ULN. Originally developed for parallel supercomputers, it was implemented on a wide variety of network interface hardware, including several fast cluster networks [13]; it was intended as a kind of assembly language for communication. Rather than expose any particular underlying queue structure, it provided a very simple RPC like programming interface. Small messages correspond essentially to the arguments passed in registers across a procedure call. Large ....
....performance results presented above, we look at a simple cross section of the latency for sending a single packet. The time measures are derived from the LogP [12] conceptual model of the VIA architecture. Additionally, we apply the simple extensions to the LogP model as suggested by Culler et al. [13]. A summary of the parameters and our results for a 4 byte packet are given in Table 2. We determined the values of the four parameters using micro benchmarks developed for VIA; we assume a virtual interface has already been established and connected. The shaded values are those we believe to be ....
D. E. Culler, L. Tin Liu, R. P. Martin and C. Yoshikawa, Assessing Fast Network Interfaces, IEEE Micro, vol. 16, (no. 1), Feb 1996, pp. 35-43
....a 64K I O space that is separate from memory space. Unfortunately, not all I O operations are observable using this count as operations to memory mapped devices appear in memory space, not I O space. The number of non halted cycles. The number of interrupts. We use the ubench benchmark [13] to measure the LogGP values for our communication systems. This benchmark first measures the send overhead, by direct mea 1 The memory bus is only critical when the cache is not large enough to hold the messages being transferred. For small messages, the cache was sufficient, and all ....
....differential between AM VIA and AM UDP is decreasing, we break and into their component costs, including instructions and I O operations for both user and kernel space. We ignore because previous work has shown that most applications can effectively overlap with other computation [13]. Furthermore, Figure 6 shows that is relatively constant for both AM VIA and AM UDP and so cannot be the cause of the decreasing performance differential. The measured also serves as a check on our methodology. Figure 6 shows that, as expected, stays relatively constant because the NIs and ....
[Article contains additional citation context not shown here]
D. E. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. Assessing Fast Network Interfaces. In IEEE Micro, volume 16, pages 35--43, Feb. 1996.
....a 64K I O space that is separate from memory space. Unfortunately, not all I O operations are observable using this count as operations to memory mapped devices appear in memory space, not I O space. # The number of non halted cycles. # The number of interrupts. We use the ubench benchmark [13] to measure the LogGP values for our communication systems. This benchmark first measures the send overhead, o# by direct measurement. Next, it measures gap, g. From these two parameters and knowledge of the protocol, we then infer the receive overhead, o # . Finally, we can derive the latency, L, ....
....differential between AM VIA and AM UDP is decreasing, we break o# and o# into their component costs, including instructions and I O operations for both user and kernel space. We ignore L because previous work has shown that most applications can effectively overlap L with other computation [13]. Furthermore, Figure 4 shows that L is relatively constant for both AM VIA and AM UDP and so cannot be the cause of the decreasing performance differential. The measured L also serves as a check on our methodology. Figure 4 shows that, as expected, L stays relatively constant because the NIs and ....
[Article contains additional citation context not shown here]
D. E. Culler, L. T. Liu, R. P. Martin, and C. O. Yoshikawa. Assessing Fast Network Interfaces. In IEEE Micro, volume 16, pages 35--43, Feb. 1996.
....a 64K I O space that is separate from memory space. Unfortunately, not all I O operations are observable using this count as operations to memory mapped devices appear in memory space, not I O space. The number of non halted cycles. The number of interrupts. We use the ubench benchmark [13] to measure the LogGP values for our communication systems. This benchmark rst measures the send overhead, o s by direct measurement. Next, it measures gap, g. From these two parameters and knowledge of the protocol, we then infer the receive overhead, o r . Finally, we can derive the latency, L, ....
....di erential between AM VIA and AM UDP is decreasing, we break o s and o r into their component costs, including instructions and I O operations for both user and kernel space. We ignore L because previous work has shown that most applications can e ectively overlap L with other computation [13]. Furthermore, Figure 4 shows that L is relatively constant for both AM VIA and AM UDP and so cannot be the cause of the decreasing performance di erential. The measured L also serves as a check on our methodology. Figure 4 shows that, as expected, L stays relatively constant because the NIs and ....
[Article contains additional citation context not shown here]
Culler, D. E., Liu, L. T., Martin, R. P., and Yoshikawa, C. O. Assessing Fast Network Interfaces. In IEEE Micro (Feb. 1996), vol. 16, pp. 35-43.
No context found.
D.E. Culler, L.T. Liu, R.P. Martin and C.O. Yoshikawa, Assessing Fast Network Interfaces, IEEE Micro, Vol. 16, No. 1, pp. 35-43, Feb. 1996.
No context found.
Culler, D., Liu, L.T., Martin, R.P., Yoshikawa, C.O., Assessing Fast Network Interfaces, IEEE Micro, 16(1), (1996), pp 35-43.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC