| Intel Corporation. Paragon XP/S product overview, 1991. |
....system or run time system involvement in message construction, packetization, and protocol processing affects the cache. One way to alleviate this problem is to add a messaging processor. Two examples of systems that follow this approach are the Paragon XP S and the EDS. The Paragon XP S [23] system tries to minimize cache pollution by adding a second processor (with a separate on chip cache) specifically to handle messaging operations (Figure 2.7) Thus, the compute processor s cache is largely unaffected by communication; only the messaging processor s cache suffers. Memory ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....networks of parallel systems is the unique restriction on the scheduler imposed by wormhole switching [3] a popular technique used in these networks. Wormhole switching and its variations are widely used in a variety of parallel systems and more recently in system area networks [4] 5] [6], 7] 8] 9] Wormhole switching is distinguished by the fact that the granularity of flow control in the network can be smaller than a packet. This unit of flow control is called a flit. In order to not add to the per flit overhead, only the head flit (the first flit) of a packet contains ....
Intel Corporation, Paragon XP/S Product Overview. 1991.
....networks, both direct and indirect, are constructed out of switches connected together in a certain topology. Wormhole switching is a popular switching technique used in the implementation of switches in interconnection networks for parallel systems, and more recently, in system area networks [1 3, 10, 16, 18]. This paper considers the problem of fair and efficient scheduling of packets in wormhole networks for parallel systems. Even though the solution presented in this paper is designed for the unique requirements of wormhole switches, it may also be applied to wide area networks such as the ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....cache. That is, they directly involve the cache in message sends and or receives. An alternative design is to logically connect the network to primary memory, using DMA transfers for interprocessor communication. Such an approach is utilized by a wealth of machines, for instance the Intel Paragon [10], Cray T3D [6] MIT Alewife [11] Bull ECRC ICL Siemens EDS [17] Meiko CS 2 [9] and Caltech Mosaic C [13] as well as systems designed around the Inmos T9000 Transputer [12] The tradeoff between the two interfaces is at the highest level parallel performance versus sequential performance. ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....multiprocessors (DMMs) because of its simplicity, regularity, and suitability for VLSI implementation. DMMs can be based either on message passing or shared memory paradigms. There are quite a few 2D mesh based commercial or prototype systems built recently. Typical examples are the Intel Paragon [1] for a message passing system and the Stanford DASH [2] for a shared memory system. The Cray T3D [3] is also an example of a distributed shared memory multiprocessor that employs a 3 D mesh as an interconnection network. In a multitasking DMM (both shared and distributed memory) the operating ....
....and distributed memory) the operating system not only specifies which job will be executed next, but it also must specify which processors would execute the job. In the current implementations, while the processor allocation has been an integral part of the message passing operating system [1, 4], it has remained an optional feature in the distributed shared memory implementation [5] It is well known that inefficient allocation of processors and data in a distributed shared memory system causes increased memory access overhead due to the differences in local and remote memory access ....
[Article contains additional citation context not shown here]
Intel, Paragon XP/S: Product Overview, Intel Corporation, 1991.
....and network throughput. In this paper we evaluate the performance of a 2D torus network with wormhole routing and virtual channel flow control in shared memory multiprocessors. We selected a 2 D torus network with bidirectional links for our performance study, because it is a popular topology [5, 6, 7, 8]. Also, mesh networks without end around connections have significant performance degradations at the boundary nodes, even under uniform communication [6, 9] The performance of wormhole networks with virtual channels has been evaluated in various studies [4, 7, 10] in a message passing ....
Intel Corporation,. Paragon XP/S - Product Overview, 1991.
....effectively for parallel three dimensional graphics rendering, there must be mechanisms to rapidly composite and transfer images from processor nodes to a display device. On today s multicomputers, frame buffers are normally attached through I O channels, and HiPPI frame buffers are commonly used [6, 7, 13, 17, 21]. The bandwidth available to the frame buffer thus is limited by the HiPPI channel itself, as 3 Department of Computer Science, Princeton NJ 88544. bw,gws,doug,felten,li] cs.princeton.edu y 127 Center for Integrated Systems, Stanford CA 94305. hanrahan cs.stanford.edu Appeared in Parallel ....
Intel Corporation. Paragon XP/S Product Overview. Intel Corporation, 1991.
....An interconnection network is characterized by its topology, routing, and flow control[4] Network performance is also affected by the rate at which messages are injected into and consumed from the network. Recently proposed wormhole routing [4] is becoming the trend to build large scale systems [9]. Commercially available network topologies are meshes and hypercubes that fall into the class of generalized k ary n cubes (without wraparounds) where k is the number of nodes along each dimension and n is the number of dimensions. The maximum sustainable message injection consumption rate is ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....for a parallel file system. Among the possible network choices, HIPPI is CUlTently the most popular one, and most manufacturers of distributed memory parallel systems either provide or have announced a H1PPI connection (e. g, CM 2 [20] CM 5 [7] iSC 860 [12] NCube2 [14] iWarp [4] Paragon XP S [11], Maspar [2, 15] As far as an application on the parallel system is concerned, the exact characteristics of the external links do not matter, and the I O node provides an appropriate abstrac tion. We can think of the I O nodes as establishing the periphery of the parallel system, although ....
Intel Corporation. Paragon X/PS Product Overview, March 1991.
....on task assignment. Below, we briefly review three common routing algorithms, with different adaptivity, proposed by the researchers. A routing is minimal if the path selected is one of the shortest paths between the source and destination processor. Deterministic or oblivious routing used in [10, 11, 15] defines a single path from a source to a destination node and thus has zero adaptivity. Such routing is simple to implement and deadlock free. However, it does not make effective use of all communication links in a system. Even with an optimal task processor assignment, some link contention ....
....processor which is distance d away. The units of measurements for hcomp and hcomm need to be same in order to combine computation and communication cost directly. We assume that the host architecture supports concurrent computa tion and communication, which is common in present day systems [10]. In this paper, without loss of generality, we assume that contention for injection and consumption channels are negligible. Systems are assumed to support a multiport communication model so that a processor can perform communication on all its incoming and outgoing links simultaneously. The ....
Intel Corporation, Paragon XP/S Product Overview, 1991.
....utilization of network channels difficult. Packetization, breaking large messages into sets of smaller messages for transmission, has been suggested as a technique for improving the performance of wormhole routed networks. A number of commercial systems incorporate some form of packetization [13, 4, 8]. # # msg length = 8 # # msg length = 16 # # msg length = 32 # # msg length = 64 # # msg length = 128 # # msg length = 256 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 600 700 800 900 1000 load rate latency 256 Nodes 2D mesh 16x16 Dimension order routers Uniform Traffic ....
....how packetization affects the network service of other short messages. 3 Packetization and Dimension order Routers In this section, we investigate the effect of packetization on the performance of a dimension order router. The routers modeled are quite similar to those in the Intel Paragon [4] and the Symult 2010 [13] In the nonpacketizing network, the messages are transmitted in the traditional fashion. In the packetizing network, each message is broken into packets of 16, 32, or 64 flits and queued in the network input queue. This produces highly correlated traffic, so the resulting ....
Intel Corporation. Paragon xp/s product overview. Product Overview, 1991.
....workloads. Finally, section 5 concludes this paper summarizing our results and discussing possible future directions. 2 Background High performance routing networks, the subject of significant study over the last ten years, are currently in widespread use in machines such as the Intel Paragon [11], Intel iWARP [5, 25] NCUBE 2 [23] and the MIT J machine [15, 16] All of these multicomputer systems use direct networks, meaning that the computing nodes are embedded in the network topology, and as a result, some nodes are closer than others. In addition to use in multicomputers, direct ....
....latency distribution of short messages with and without long messages present. With long messages, we have a bimodal traffic load. Without, we have a uniform sized traffic load, a familiar point of reference. Dimension order routers without virtual lanes (such as those found in the Intel Paragon [11]) are used for both traffic loads. For this study, the long messages studied are 4 of the total number of messages and 512 flits long. The short messages in the bimodal and uniform size traffic load are each 24 flits. The total load rate is fixed at 11 of maximum wire capacity for both cases, ....
Intel Corporation. Paragon XP/S Product Overview. Product Literature, 1991.
....represent a wide range of choices in density of interconnection. We also focus only on routers that use wormhole routing, a low cost approach to flow control that allows small simple routers. Wormhole routers for k ary n cubes have been used in a variety of commercial and research machines [11, 27, 14, 2, 1, 24, 3]. Communication performance also depends critically on the routing algorithm used to map communications to hardware resources. Routing approaches can be divided into two categories: deterministic and adaptive routing. In deterministic routers, each message is routed along a fixed path, determined ....
....EMRC routing chips are also dimension order, wormhole routers [31] These chips are self timed and use byte wide channels to achieve 166 MB s. The typical path formation latency for the head of a packet is approximately 30ns. The Intel Paragon router is descended from the original Caltech MRCs [11, 18]. The Paragon router is a deterministic router and comparable to our designs, as it is implemented in a similar technology (0.8 micron CMOS gate array) and gives performance comparable to our designs. Published figures for its delay and channel bandwidth are 40 nanoseconds and 200 megabytes second ....
Intel Corporation. Paragon XP/S product overview. Product Overview, 1991.
....used. Throughout this chapter, we use the network model described below. Network Topology We basically study n dimensional mesh networks. We focus on low (two, three, or four) dimensional mesh networks because they are currently in 68 widespread use in many machines such as the Intel Paragon [14], Intel iWARP [8, 43] the MIT J machine [21, 24] Stanford DASH [35] and Tera Computer s TERA machine [5] Unless stated otherwise, the two and four dimensional networks used contain 256 nodes. The size was determined for reasonable amounts of simulation time. For simplicity, we study networks ....
....distribution of short messages with and without long messages present. With long messages, we have a bimodal traffic load. Without them, we have a uniform sized 126 traffic load, a familiar point of reference. Dimension order routers without virtual lanes (such as those found in the Intel Paragon [14]) are used for both traffic loads. For this study, the long messages studied are 4 of the total number of messages and 512 flits long. The short messages in both cases are 24 flits each. The total load rate is fixed at 11 of maximum wire capacity for both cases, equivalent to injection of one ....
Intel Corporation. Paragon XP/S Product Overview. Product Overview, 1991.
....iWarp applications perform output by sending data over the internal interconnect to the I O node, which forwards it to the external device, such as a network or disk. Input follows the inverse path. This approach to I O is very common, e.g. the NCube [22] and the Intel iPSC [32] and Paragon [25] machines follow the same approach. 4 Transport protocol processing Protocol processing (e.g. TCP or UDP over IP) is one of the potential bottlenecks in network communication. In this section we describe how the iWarp HIPPI interface supports protocol processing. 4.1 High level design While it ....
....be quite different. We expect to see changes in the implementation of the control and data interfaces, the execution moduleand the control unit of the stream manager. The data interface will have to be reimplemented using the native communication interface for the system, e.g. Nx on the Paragon [25] or remote put get on the Cray T3D [1] Using a different communication interface will also affect the implementation of the execution module since it optimizes data transfers. Finally, the system software of most distributed memory systems supports some form of multi programming. This will ....
Intel Corporation. Paragon X/PS Product Overview, March 1991.
....the benefits of adaptive routing against these significant costs in deciding whether or not to include adaptivity. 2 Background High performance routing networks, the subject of significant study over the last fifteen years, are currently in widespread use in machines such as the Intel Paragon [9], Intel iWarp [7, 30] Ncube 2 [26] and the MIT J Machine [11, 13] All of these multicomputer systems use direct networks, meaning that the computing nodes are embedded in the network topology, and as a result, some nodes are closer than others. In addition to use in multicomputers, direct ....
....output freed for subsequent communications. While no single router architecture is appropriate for all routing algorithms and possible design parameters, this basic architecture is attractive for a range of routing algorithms [14, 8, 28] and has in fact been used for a number of practical routers [14, 15, 29, 4, 31, 9]. Our canonical router architecture captures an attractive design for networks of modest dimension and modest numbers of virtual channels, providing a basis for direct comparisons amongst a set of routing algorithms. Beyond three or four dimensions and about four virtual channels, requiring a full ....
[Article contains additional citation context not shown here]
Intel Corporation. Paragon XP/S product overview. Product Overview, 1991.
No context found.
Intel Corporation. Paragon XP/S product overview, 1991.
No context found.
Intel Corporation. Paragon xp/s product overview, 1991.
No context found.
Intel Corporation, Paragon XP/S Product Overview (1991).
No context found.
Intel Corp., Paragon XP/S Product Overview. Intel Corp., 1991.
No context found.
Intel Corp. Paragon X/PS Product Overview, March 1991.
No context found.
Intel Corporation. Paragon XP/S Product Overview, 1991.
No context found.
Intel Corporation. Paragon XP/S Product Overview, 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC