DMCA
A High-Speed Network Interface for Distributed-Memory Systems: Architecture and Applications (1996)
Venue: | ACM Trans. on Computer Systems |
Citations: | 4 - 0 self |
Citations
997 |
Performance Fortran Forum. High performance Fortran language speci version 1.1
- High
- 1994
(Show Context)
Citation Context ...e rows, or the rows and columns of the distributedmemory system. Block-cyclic distributions are widely used by applications and are for example supported by languages such as High-Performance Fortran =-=[17]-=-. I/O of a distributed data structure becomes harder as the partitioning is finer. A good measure of the granularity is the size of blocks of data that are contiguous both on the network and in the sy... |
573 | Supporting real-time applications in an integrated services packet network: Architecture and mechanism. In
- Clark, Shenker, et al.
- 1992
(Show Context)
Citation Context ...uses more bandwidth. This anomaly could be avoided on an interconnect that supports the reservation of bandwidth for critical connections, as is for example being explored in the networking community =-=[9, 3]-=-. Figure 16 gives the breakdown of the time spent during the transfer to the HIB. The times are for a node in the last row of the iWarp system. The iWarp node performs the reshuffling and then waits f... |
348 | An Analysis of TCP Processing Overhead
- Clark, Jacobson, et al.
- 2002
(Show Context)
Citation Context ...t overlap send/receives with processing and interactions with the distributed-memory system. We also optimized the UDP/TCP/IP implementation itself using standard techniques such as header prediction =-=[10]-=-. The above optimizations keep the communication overhead within acceptable bounds. On a DEC Alpha workstation 3000/400, it takes about 300 microseconds from the time a user-level application issues a... |
140 | iWarp: an integrated solution to high-speed parallel computing.
- Borkar
- 1988
(Show Context)
Citation Context ...peed network standards in various stages of development by standards bodies. These include ATM (Asynchronous Transfer Mode) [16] and Fibre Channel [26]. Meanwhile, distributed-memory computer systems =-=[6, 23, 27, 32, 44]-=- are becoming the architecture of choice for many supercomputer applications. The reason is that they are inherently scalable, and provide relatively inexpensive computing cycles compared with traditi... |
92 | Hypercube supercomputers.
- Hayes, Mudge
- 1989
(Show Context)
Citation Context ...ver the internal interconnect to the I/O node, which forwards it to the external device, such as a network or disk. Input follows the inverse path. This approach to I/O is very common, e.g. the NCube =-=[22]-=-, and the Intel iPSC [32] and Paragon [25] machines follow the same approach. 4 Transport protocol processing Protocol processing (e.g. TCP or UDP over IP) is one of the potential bottlenecks in netwo... |
91 | Task Parallelism in a High Performance Fortran Framework.
- Gross, O'Hallaron, et al.
- 1994
(Show Context)
Citation Context ...tional efficiency. Program generators can be application-specific (e.g. Apply (image processing) [21] and Assign (signal processing) [36]), or more general (e.g. the Fx parallelizing FORTRAN compiler =-=[19]). iWarp s-=-ystems communicate with the outside world through I/O nodes that are linked into the torus at the "edge" of the array. Figure 3 shows the example of a HIPPI interface connected to the iWarp ... |
87 | Supporting systolic and memory communication in iWarp
- Borkar, Cohn, et al.
- 1990
(Show Context)
Citation Context ...y are configured as a torus. The communication system supports high-speed interprocessor communication for a variety of communication models, including systolic communication and memory communication =-=[7]-=-. In systolic communication, the CPU writes data directly onto the interconnect, thus minimizing communication latency. Memory communication is supported through the use of spools, on-chip DMA engines... |
76 |
Design and Evaluation of Primitives for Parallel I/O
- Bordawekar, R, et al.
- 1993
(Show Context)
Citation Context ...shuffling parallelizes very well, many links can be used at the same time. As a result, the distributed-memory system can reshuffle data efficiently. A similar approach has been proposed for disk I/O =-=[5]-=-. The creation of large messages inside the distributed memory-system is done by making the interaction with the network interface a two step process. First, data is reshuffled from the distribution t... |
58 | Interprocessor Collective Communication Library (InterCom).
- Barnett, Gupta, et al.
- 1994
(Show Context)
Citation Context ...arity of the distributions to achieve good reshuffling performance; this is important, given our performance goals. Note that the data reorganization is in effect a collective communication operation =-=[18, 4]-=-. 9 Applicability to Other Systems Wehave presented a description of the architecture and implementation of a high-bandwidth network interface for the iWarp system. In this section we examine how the ... |
56 | Multiprocessor File System Interfaces
- Kotz
- 1993
(Show Context)
Citation Context ...ificant hurdle when doing I/O over the HIPPI interface, and how it limits the throughput. Data reorganization has been proposed to achieve high bandwidth access to disks in distributed-memory systems =-=[5, 29, 38]-=-. For example, in [5] the authors study the problem of implementing high-speed file I/O in the Intel Touchstone Delta. They observed that in order to achieve good performance it is important to send l... |
53 | A Systematic Approach to Host Interface Design for High-Speed Networks
- Steenkiste
- 1994
(Show Context)
Citation Context ...tized over larger packets and because these operations make heavy use of a critical resource: the memory bus. The key to making these operations efficient is to streamline the flow of data during I/O =-=[41]-=-, so that the number of times that the data is touched is minimized. Example optimizations include the elimination of redundant data copy operations and the calculation of the checksum while data is b... |
47 |
Survey of traffic control schemes and protocols
- Bae, Suda
- 1991
(Show Context)
Citation Context ...uses more bandwidth. This anomaly could be avoided on an interconnect that supports the reservation of bandwidth for critical connections, as is for example being explored in the networking community =-=[9, 3]-=-. Figure 16 gives the breakdown of the time spent during the transfer to the HIB. The times are for a node in the last row of the iWarp system. The iWarp node performs the reshuffling and then waits f... |
39 | Software Support for Outboard Buffering and Checksumming
- Kleinpaste, Steenkiste, et al.
- 1995
(Show Context)
Citation Context ...ol processing, while the CAB provides support for per-byte operations: data transfer, checksumming and buffering. The CAB architecture used on the HIB is similar to the Gigabit Nectar workstation CAB =-=[42, 28]-=-, which provides support for per-byte operations for network communication on workstations. The operation of the network interface is similar to that of a sequential system, except that the data sourc... |
32 |
The touchstone 30-Gigaflop DELTA prototype
- Lillevik
- 1991
(Show Context)
Citation Context ...peed network standards in various stages of development by standards bodies. These include ATM (Asynchronous Transfer Mode) [16] and Fibre Channel [26]. Meanwhile, distributed-memory computer systems =-=[6, 23, 27, 32, 44]-=- are becoming the architecture of choice for many supercomputer applications. The reason is that they are inherently scalable, and provide relatively inexpensive computing cycles compared with traditi... |
29 |
Asynchronous Transfer Mode
- Prycker
- 1991
(Show Context)
Citation Context ...econd or 1.6 Gbit/second. In addition to HIPPI, there are a number of high-speed network standards in various stages of development by standards bodies. These include ATM (Asynchronous Transfer Mode) =-=[16]-=- and Fibre Channel [26]. Meanwhile, distributed-memory computer systems [6, 23, 27, 32, 44] are becoming the architecture of choice for many supercomputer applications. The reason is that they are inh... |
25 | A host interface architecture for high-speed networks
- Steenkiste, Zill, et al.
- 1992
(Show Context)
Citation Context ...ol processing, while the CAB provides support for per-byte operations: data transfer, checksumming and buffering. The CAB architecture used on the HIB is similar to the Gigabit Nectar workstation CAB =-=[42, 28]-=-, which provides support for per-byte operations for network communication on workstations. The operation of the network interface is similar to that of a sequential system, except that the data sourc... |
24 |
Cray T3D System Architecture Overview
- Adams
- 1993
(Show Context)
Citation Context ...rol unit of the stream manager. The data interface will have to be reimplemented using the native communication interface for the system, e.g. Nx on the Paragon [25] or remote put/get on the Cray T3D =-=[1]-=-. Using a different communication interface will also affect the implementation of the execution module since it optimizes data transfers. Finally, the system software of most distributed-memory syste... |
23 |
Latency and Bandwidth Considerations in Parallel Robotics Image Processing
- Webb
- 1993
(Show Context)
Citation Context ...on-specific data distributions. 7.3 Stereo-Vision In the stereo-vision application developed at CMU by Jon Webb, multi-baseline video images from four cameras are correlated to generate a depth image =-=[45]. As part -=-of that application, a "digital VCR" was implemented to display the four images on a framebuffer. The requirements called for real-time display along with the additional computations in the ... |
21 | Physical schemas for large multidimensional arrays in scientific computing applications
- Seamons
- 1994
(Show Context)
Citation Context ...ificant hurdle when doing I/O over the HIPPI interface, and how it limits the throughput. Data reorganization has been proposed to achieve high bandwidth access to disks in distributed-memory systems =-=[5, 29, 38]-=-. For example, in [5] the authors study the problem of implementing high-speed file I/O in the Intel Touchstone Delta. They observed that in order to achieve good performance it is important to send l... |
17 | The parallel protocol engine
- Kaiserswerth
- 1993
(Show Context)
Citation Context ...sequential. There is potential parallelism between transmit and receive processing (if one is transmitting and receiving at the same time), andACKand data processing can sometimes proceed in parallel =-=[37]-=-, but overall, useful parallelism is limited. For these reasons, it is desirable to have protocol processing performed in a central location, i.e. the network interface. A number of distributed system... |
17 | Analyzing communication latency using the Nectar communication processor
- Steenkiste
- 1992
(Show Context)
Citation Context ...o provide a simpler network interface and to minimize the amount of work that is assigned to it by performing some of the communication tasks on the distributed-memory system itself. Earlier research =-=[40]-=- shows that the time spent on sending and receiving network data is distributed over several operations such as copying data, buffer management, protocol processing, and interrupt handling, and differ... |
16 | The Assign parallel program generator
- O’Hallaron
- 1991
(Show Context)
Citation Context ...on and computation concurrently on individual cells to achieve additional efficiency. Program generators can be application-specific (e.g. Apply (image processing) [21] and Assign (signal processing) =-=[36]), or more-=- general (e.g. the Fx parallelizing FORTRAN compiler [19]). iWarp systems communicate with the outside world through I/O nodes that are linked into the torus at the "edge" of the array. Figu... |
13 |
Planning under uncertainty using parallel computing, Annals of Operations Research
- Dantzig
- 1988
(Show Context)
Citation Context ...tion Application plant, using a stochastic model [12] of the system. The stochastic optimization problem can be transformed into a deterministic problem using the certainty equivalence transformation =-=[15]-=-. Our implementation of the stochastic optimization problem [13] was distributed across a heterogeneous system: the Intel iWarp at the CMU campus and the Cray C-90 and TMCM2 at the Pittsburgh Supercom... |
10 |
Low-level vision on Warp and the Apply programming model
- Harney, Webb, et al.
- 1987
(Show Context)
Citation Context ...n the system, performing communication and computation concurrently on individual cells to achieve additional efficiency. Program generators can be application-specific (e.g. Apply (image processing) =-=[21]-=- and Assign (signal processing) [36]), or more general (e.g. the Fx parallelizing FORTRAN compiler [19]). iWarp systems communicate with the outside world through I/O nodes that are linked into the to... |
9 |
Spiral Kspace MRI of cortical activation.
- Noll, Cohen, et al.
- 1995
(Show Context)
Citation Context ... the brain. The current application represents the first step in this process: obtaining reconstructed, processed and rendered images based on Magnetic Resonance Imaging (MRI) data in a timely manner =-=[34]-=-. Our implementation is mapped on three architectures: the iWarp system 22 HIB Figure 18: I/O configuration in the MRI Image Reconstruction application at CMU performs pixel classification (brain/non-... |
8 |
High Speed Networking at Cray Research
- Nicholson
- 1991
(Show Context)
Citation Context ...processor or shared-memory multiprocessor supercomputers. However, while traditional sequential or shared-memory supercomputers such as the Cray have been able to make good use of the HIPPI bandwidth =-=[33]-=-, distributed-memory machines have been much less successful. The network interfaces of distributed-memory machines often have low sustained bandwidth, do not perform network protocol processing, or m... |
7 | Distributing a chemical process optimization application over a gigabit network
- Clay, Steenkiste
- 1995
(Show Context)
Citation Context ...tem. The stochastic optimization problem can be transformed into a deterministic problem using the certainty equivalence transformation [15]. Our implementation of the stochastic optimization problem =-=[13]-=- was distributed across a heterogeneous system: the Intel iWarp at the CMU campus and the Cray C-90 and TMCM2 at the Pittsburgh Supercomputer Center (PSC). Specifically, the iWarp is used to generate ... |
7 |
Programmed Communcation Service Tool Chain User’s Guide. Carnegie Mellon University, release 2.8 edition
- Hinrichs
- 1991
(Show Context)
Citation Context ...eams implementation for iWarp. 5.3.1 Data and control interface The data interface between the distributed-memory system and the network interface is based on the PCS and ESPL communication libraries =-=[24, 7]-=-. PCS is used to create application-specific connections, and ESPL is a fast spooling library that achieve bandwidths close to the 40 MByte/second link rate, even for short messages. To support stripi... |
5 | Data reshuffling in support of fast I/O for distributed-memory machines
- Bornstein, Steenkiste
- 1994
(Show Context)
Citation Context ...s architecture does not imply that each application has to provide the code to transfer data to and from the network interface. For example, libraries can be built for common data distributions (e.g. =-=[8]-=-). The components interact through a data interface and a control interface: . The data interface transfers data between the network interface and the application on the distributed memory system. Two... |
5 | Architecture implications of high-speed I/O for distributed- memory computers
- Gross, Steenkiste
- 1994
(Show Context)
Citation Context ...ributed over the private memories of the nodes. This means that the communication software has to perform scatter and gather operations to collect or distribute the data that makes up the data stream =-=[20]-=-. In networking terms, this is an architecture-specific data transformation that is part of the presentation layer. The three processing tasks that are hard to implement efficiently for distributed-me... |
5 |
A programmable HIPPI interface for a graphics supercomputer
- Singh, Tell, et al.
- 1993
(Show Context)
Citation Context ...allelism is limited. For these reasons, it is desirable to have protocol processing performed in a central location, i.e. the network interface. A number of distributed systems use a similar approach =-=[39]-=-. One protocol processing task that does parallelize well is the checksum calculation for the Internet protocols, and it could be performed efficiently on the distributed-memory system. Unfortunately,... |
5 |
Architecture-Independent Global Image Processing
- Webb
- 1990
(Show Context)
Citation Context ...stems have very diverse data distributions. Figure 12 shows some of the data mappings that are used by iWarp applications. The row-swath partitioning is used by the Adapt image processing environment =-=[46]-=-. The coarse-grain block partitioning is used by several image processing applications and fine-grain block partitioning is used in the iWarp implementation of the LAPACK library [30]. These three exa... |
4 |
Processing Element Design for a Parallel Computer
- Kaneko, Nakajima, et al.
- 1990
(Show Context)
Citation Context ...peed network standards in various stages of development by standards bodies. These include ATM (Asynchronous Transfer Mode) [16] and Fibre Channel [26]. Meanwhile, distributed-memory computer systems =-=[6, 23, 27, 32, 44]-=- are becoming the architecture of choice for many supercomputer applications. The reason is that they are inherently scalable, and provide relatively inexpensive computing cycles compared with traditi... |
4 |
alamos multiple crossbar network crossbar interfaces
- Los
- 1992
(Show Context)
Citation Context ...arch projects working on high-speed network I/O for distributed-memory systems have been described in the literature. A group at Los Alamos National Labs has developed the CrossBar Interconnect (CBI) =-=[43]-=-). The CBI is an outboard protocol processor that performs protocol processing for supercomputers connected to HIPPI networks. It has two full-duplex HIPPI connections: one to the supercomputer and on... |
3 |
Parallel fourier inversion by the scan-line method
- Noll, Webb, et al.
- 1995
(Show Context)
Citation Context ...ee architectures: the iWarp system 22 HIB Figure 18: I/O configuration in the MRI Image Reconstruction application at CMU performs pixel classification (brain/non-brain) and surface triangularization =-=[35]-=-, the C-90 at PSC performs scalar processing, and the Intel Paragon at CMU performs the rendering. The input to the iWarp component of the application consists of 52 MRI slices with pixel values repre... |
2 |
Scheduling in the Presence of Uncertainty: Probabilistic Solution of the Assignment Problem
- Clay
- 1991
(Show Context)
Citation Context ...manifold representing a plant. The generated data is then sent to the C-90 at PSC for analysis, and finally, the C90 and CM2 solve the resulting linear assignment problem using a heterogeneous solver =-=[11]-=-. We collected data for several application runs, corresponding to input sizes of 1k, 2k and 4k; the samples generated by iWarp in these runs correspond to 256 MB, 1 GB and 4 GB of data. The program d... |
2 |
Solution of large-scale modeling and optimization problems using heterogeneous supercomputing systems
- Clay, McRae
- 1991
(Show Context)
Citation Context ...ring Department at CMU. The application optimizes a system, for example a chemical 21 HIB Figure 17: I/O configuration in the Chemical Process Optimization Application plant, using a stochastic model =-=[12]-=- of the system. The stochastic optimization problem can be transformed into a deterministic problem using the certainty equivalence transformation [15]. Our implementation of the stochastic optimizati... |
2 |
Supercomputing with Transputers - Past, Present and Future
- Hey
- 1990
(Show Context)
Citation Context ...peed network standards in various stages of development by standards bodies. These include ATM (Asynchronous Transfer Mode) [16] and Fibre Channel [26]. Meanwhile, distributed-memory computer systems =-=[6, 23, 27, 32, 44]-=- are becoming the architecture of choice for many supercomputer applications. The reason is that they are inherently scalable, and provide relatively inexpensive computing cycles compared with traditi... |
2 |
Experiments with a Gigabit Neuroscience Application on the CM-2
- Kwan, Terstriep
- 1993
(Show Context)
Citation Context ... very different data and program structures on the distributed-memory system. Several projects have looked at the difficulty introduced by the data distribution and representation. Kwan and Terstriep =-=[31]-=- show how the data distribution and data representation on the CM2 is a significant hurdle when doing I/O over the HIPPI interface, and how it limits the throughput. Data reorganization has been propo... |
1 |
Kung and Jaspal Subhlok. A new approach to automatic parallelization of blocked linear algebra computations
- T
- 1991
(Show Context)
Citation Context ...sing environment [46]. The coarse-grain block partitioning is used by several image processing applications and fine-grain block partitioning is used in the iWarp implementation of the LAPACK library =-=[30]-=-. These three examples are instances of block-cyclic partitionings: the data set is divided in blocks, which are distributed in a cyclic fashion across either the rows, or the rows and columns of the ... |