Results 1 - 10
of
22
Software Overhead in Messaging Layers: Where Does the Time Go?
- In Proceedings of the Sixth Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI
, 1994
"... Despite improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most systems. In this study, we identify the sources of this overhead by analyzing software costs of typical communication protocols built atop the a ..."
Abstract
-
Cited by 68 (10 self)
- Add to MetaCart
Despite improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most systems. In this study, we identify the sources of this overhead by analyzing software costs of typical communication protocols built atop the active messages layer on the CM-5. We show that up to 50--70% of the software messaging costs are a direct consequence of the gap between specific network features such as arbitrary delivery order, finite buffering, and limited fault-handling, and the user communication requirements of in-order delivery, end-to-end flow control, and reliable transmission. However, virtually all of these costs can be eliminated if routing networks provide higher-level services such as in-order delivery, end-to-end flow control, and packet-level fault-tolerance. We conclude that significant cost reductions require changing the constraints on messaging layers: we propose designing networks and network interfaces...
Packet Routing In Fixed-Connection Networks: A Survey
, 1998
"... We survey routing problems on fixed-connection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, k-relation routing, routing to random destinations, dynamic routing, isotonic routing ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We survey routing problems on fixed-connection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, k-relation routing, routing to random destinations, dynamic routing, isotonic routing, fault tolerant routing, and related sorting results. We also provide a list of unsolved problems and numerous references.
Do Faster Routers Imply Faster Communication?
- In First International Workshop, PCRCW'94, volume 853 of LNCS
, 1994
"... . Despite significant improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most parallel systems. In this study, we identify the sources of this overhead by relating user communication services to particular ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
. Despite significant improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most parallel systems. In this study, we identify the sources of this overhead by relating user communication services to particular network hardware features. Based on a detailed analysis of the active messages layer on the CM-5, we assign the software messaging cost to specific user communication services and network features. Our study shows that 50--70% of the software cost of messaging can be attributed to providing end-to-end flow control, in-order delivery, and reliable transmission services. This overhead is a direct effect of specific network features -- arbitrary delivery order, finite buffering, and limited fault-handling -- and is unlikely to be eliminated through improved software implementations. We conclude that reducing this software overhead requires changing the constraints on messaging layers...
Exploiting Two-Case Delivery for Fast Protected Messaging
- In HPCA
, 1998
"... We propose and evaluate two complementary techniques to protect and virtualize a tightly-coupled network interface in a multicomputer. The techniques allow efficient, direct application access to network hardware in a multiprogrammed environment while gaining most of the benefits of a memory-based n ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
We propose and evaluate two complementary techniques to protect and virtualize a tightly-coupled network interface in a multicomputer. The techniques allow efficient, direct application access to network hardware in a multiprogrammed environment while gaining most of the benefits of a memory-based network interface. First, two-case delivery allows an application to receive a message directly from the network hardware in ordinary circumstances, but provides buffering transparently when required for protection. Second, virtual buffering stores messages in virtual memory on demand, providing the convenience of effectively unlimited buffer capacity while keeping actual physical memory consumption low. The evaluation is based on workloads of real and synthetic applications running on a simulator and partly on emulated hardware. The results show that the direct path is also the common path, justifying the use of software buffering. Further results show that physical buffering requirements ...
The Impact of Packetization in Wormhole-Routed Networks
, 1993
"... Packetization is used in a variety of commercial multicomputers because of its potential performance advantages: higher throughput and a better distribution of message latencies. However, packetization has two significant drawbacks, 1) fragmentation and reassembly overhead and 2) increased traffic v ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Packetization is used in a variety of commercial multicomputers because of its potential performance advantages: higher throughput and a better distribution of message latencies. However, packetization has two significant drawbacks, 1) fragmentation and reassembly overhead and 2) increased traffic volume for routing and sequencing information. In this paper, we examine the performance benefits of packetization in existing dimension-order routed networks and in likely future router designs including adaptive routing and virtual lanes. Our studies show that packetization has a mixed effect on performance in dimension-order routers. Packetizing uniform-sized traffic reduces network throughput dramatically. However, if the traffic is a bimodal distribution of sizes, packetization reduces the variance of latencies for short messages, and increases the network's overall throughput. On the other hand, packetization has no significant impact on the performance of advanced networks with adaptive routing and virtual lanes. Advanced routers without packetization give nearly identical performance to the corresponding packetizing networks under uniform-sized or bimodal traffic. Packetization may be unnecessary in such networks. 1
Least Common Ancestor Networks
, 1993
"... Least Common Ancestor Networks (LCANs) are introduced and shown to be a class of networks that include fat-trees, baseline networks, SW-banyans and the router networks of the TRAC 1.1 and 2.0, and the CM-5. Some LCAN properties are stated and the permutation routing capabilities of an important subc ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Least Common Ancestor Networks (LCANs) are introduced and shown to be a class of networks that include fat-trees, baseline networks, SW-banyans and the router networks of the TRAC 1.1 and 2.0, and the CM-5. Some LCAN properties are stated and the permutation routing capabilities of an important subclass are analyzed. Simulation results for three permutation classes verify the accuracy of an iterative solution for a randomized routing strategy.
Workloads and Performance Metrics for Evaluating Parallel Interconnects
- 27, Summer-Fall
, 1994
"... Introduction From the earliest days of distributed and parallel systems, researchers have been using simulation and a variety of workloads to evaluate parallel interconnects [15, 16, 14]. While simulation facilitates rapid evaluation of alternatives, the absence of parallel systems means that the w ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Introduction From the earliest days of distributed and parallel systems, researchers have been using simulation and a variety of workloads to evaluate parallel interconnects [15, 16, 14]. While simulation facilitates rapid evaluation of alternatives, the absence of parallel systems means that the workloads used are an approximation of actual usage. This approximation is critical because it directly affects simulation results and thus evaluation of the parallel interconnect. In recent years, the increasing use of massively-parallel systems has dramatically changed the situation for parallel interconnect workloads. With several major vendors of massively-parallel systems and many large installations, there is at last a significant base of parallel application programs [3]. While the usage of these systems is still in rapid flux, clearly identifiable uses and performance requirements for parallel interconnects have emerged. Such information presents an opportunity for the communi
Efficient Techniques for Fast Nested Barrier Synchronization
- In ACM Symposium on Parallel Algorithms and Architectures
, 1995
"... Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barr ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barriers, these schemes provide significant speedups in the execution of such programs on MIMD computers. The first scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. However, this scheme increases the code size. The second scheme uses a more expensive integer-tree to support an exponential number of nested barriers without increasing the code size. Using hardware already available on commercial MIMD computers, this scheme can support more than four billion levels of nesting. 1 Introduction The data parallel programming model allows a natural way of expressing the large degree of parallelism involved in most computationally inte...
Efficient Broadcasting Procedures for Constrained Reconfigurable Meshes
, 1996
"... Broadcast operations on reconfigurable meshes that only use column or row buses of known lengths are easily simulated by constrained reconfigurable meshes that restrict the lengths of bus components to a practical number of bus segments in a single cycle. Frequently, however, reconfigurable mesh alg ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Broadcast operations on reconfigurable meshes that only use column or row buses of known lengths are easily simulated by constrained reconfigurable meshes that restrict the lengths of bus components to a practical number of bus segments in a single cycle. Frequently, however, reconfigurable mesh algorithms make use of arbitrarily shaped and sized non-branching buses. This paper presents an optimal single source and an efficient multiple source broadcasting procedure for arbitrary linear buses under the constrained reconfigurable mesh model.
A Cache Coherence Protocol for the Bidirectional Ring Based Multiprocessor
- In International Conference on Parallel and Distributed Computing and Systems
, 1999
"... In this paper, a new cache protocol for ring based shared memory multiprocessors is discussed and analyzed. The proposed protocol uses multicasting of rings to reduce the message traversal length. The simulation results show that the proposed protocol with a bidirectional ring improved the system pe ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper, a new cache protocol for ring based shared memory multiprocessors is discussed and analyzed. The proposed protocol uses multicasting of rings to reduce the message traversal length. The simulation results show that the proposed protocol with a bidirectional ring improved the system performance by 8% to 30% as compared to Barroso's protocol using a unidirectional ring. Assuming the bidirectional ring structure in both cases, the proposed protocol yields up to a 13% performance improvement over Barroso's protocol.

