| Cray Research, Inc. CRAY T3D System Architecture Overview, hr-04033 edition, Sept. 1993. |
....of the received packets to replying packets for status reports which indicate cancellations. The method, in which a replying packet includes the content of the received packet with a fifo entry command when 44 it has been cancelled, is called the return to sender method, and is widely used [2, 8, 60]. In the MBCF FIFOe scheme, the return to sender method is extended to keep the order of point to point packets intact. New Data 1 (Pnode2, Ptask5) Laddr1 (dst) MBCF FIFOe N n bytes (Pnode1, Ptask3) Ltask1, Laddr1) MBCF FIFOe Normal to from Laddr0 FIFO tail ptr FIFO top ptr FIFO ....
....this high performance and will attain lower latencies of communication than hardware support implementations with embedded processors, such as MBP2P. The hardware support solution is still effective in terms of relieving main processors of the load of communications processing. 8.6. 1 T3D The T3D [8] is one of earliest distributed memory parallel computers with a logical address based remoteDMA mechanism. The T3D has a dedicated network in which there is no packet loss. The first MBP was also able to perform similar logical address based remote memory accesses, and the MBP is an earlier ....
Cray Research, Inc. Cray T3D System Architecture Overview. Cray Research, Inc., March 1993.
....the cost of limiting message length to the number of cluster registers. Most messages fit easily in this size and larger messages can be packetized and reassembled with very low overhead. Automatic translation of virtual processor numbers to physical processor identifiers is used in the Cray T3D [7]. The use of virtual addresses as message destinations in the M Machine has two advantages. When combined with translation hardware, it provides protection for user initiated messages, without incurring the overhead of operating system invocation, as messages may not be sent to processors mapped ....
CRAY RESEARCH, INC. Cray T3D System Architecture Overview. Chippewa Falls, WI, 1993.
....this thesis makes in addressing these challenges. Finally, I outline the structure of the rest of this document. 12 1.1 Challenges 1.1. 1 A Class of Parallel Architectures The class of parallel computer architectures addressed in this thesis is represented by machines such as the Cray T3D [59] and T3E [62] and the NEC SX series of vector supercomputers. This class has several defining characteristics. Massively Parallel Processors (MPP) These machines are massively parallel; they are scalable to hundreds or thousands of processors. Distributed, Shared Memory (DSM) Each ....
....resources In this section I discuss mechanisms for naming distributed objects in NUMA DSMs. Such mechanisms have been provided at the hardware, library, and programming language level, and I discuss each in turn. 2.1. 1 Hardware support In conventional NUMA DSM architectures such as the Cray T3D [59] and T3E [62] the NEC SX series, etc. node manages its own local memory. Processes running on multiple nodes can conspire to allocate the same local memory locations to a distributed object; the name of the common location thus becomes the name of the distributed object. This approach requires ....
C. Research. Cray T3D system architecture overview, 1993.
....this thesis makes in addressing these challenges. Finally, I outline the structure of the rest of this document. 1.1 Challenges 1.1. 1 A Class of Parallel Architectures The class of parallel computer architectures addressed in this thesis is represented by machines such as the Cray T3D [59] and T3E [62] and the NEC SX series of vector supercomputers. This class has several defining charac teristics. Massively Parallel Processors (MPP) These machines are massively parallel; they are scalable to hundreds or thousands of processors. 12 Distributed, Shared Memory (DSM) Each ....
....resources In this section I discuss mechanisms for naming distributed objects in NUMADSMs. Such mechanisms have been provided at the hardware, library, and programming language level, and I discuss each in turn. 2.1. 1 Hardware support In conventional NUMA DSM architectures such as the Cray T3D [59] and T3E [62] the NEC SX series, etc. node manages its own local memory. Processes running on multiple nodes can conspire to allocate the same local memory locations to a distributed object; the name of the common location thus becomes the name of the distributed object. This approach requires ....
C. Research. Cray T3D system architecture overview, 1993.
....its inherent advantages like low latency communication and reduced communica tion hardware overhead [13] In addition to the basic wormhole routing switching, systems are gradually incorporating multiple communication ports and routing schemes with varying adaptivity. Intel Paragon [10] Cray T3D [4], and Stanford DASH [11] are some early representative systems in this trend. These sys tems provide low latency communication when the traffic in the system is low. However, with increase in communication traffic, messages undergo severe link contention and the system starts performing poorly. ....
Cray Research, Inc., Cray T3D System Architec- ture Overview, 1993.
....on the remote node needed to fulfill a get request are shown. Figure l(a) pictures the actions of a hardware supported get implementation. As can be seen, the NIC directly accesses the physical memory to service the request, bypassing the host CPU. Variations on this approach are implemented in [6, 8, 9, 5]. In Figure l(b) the actions required by the interrupt driven get implementation are illustrated. After the request message arrives at the NIC, it interrupts the processor. An interrupt handler is then invoked on one of the host processors which then builds a response message by copying the ....
Cray Research, Inc., Eagan, MN. Cray T3D System Archi- tecture Overview, March 1993.
.... register (e.g. to register 0, as in the PA RISC architectures [15] or provide programmable prefetch engines [6] or programmable stream buffers [19] Hardware only prefetching [2, 9, 12, 14, 29] thus has the advantage of being transparent, and some commercial machines include such mechanisms [5, 7, 28]. However, due to its speculative nature, care must be taken to keep from lowering application performance by increasing contention in the caches and wasting bus bandwidth on useless prefetches. Most prefetching research in the literature focuses on fetching data structures with regular access ....
Cray Research, Inc. CRAY T3D System Architecture Overview, hr-04033 edition, Sept. 1993.
....wide set variants. Also, we can easily implement and analyze new routing algorithms by simply defining the routing function. 2.2 The node internal structure The past few years have seen the introduction of many different types of parallel architectures. Many commercial machines, as the Cray T3D [7] and the Intel Paragon [16] are multicomputers that have nodes with multiple processors and adopt low dimensional wormholerouted cubes as their interconnection networks. Some nodes are specialized to handle parallel I O. Other important research prototypes, as the Stanford FLASH, provide a ....
Cray Research Inc. Cray T3D System Architecture Overview, 1 th edition, September 1993.
....the requests and the other to the replies, in order to avoid deadlocks caused by the coherency protocols. Other important academic prototypes that use low dimensional cubes are Alewife [11] and the J machine [12] This list also includes many of the most popular commercial machines. The Cray T3D [13] and T3E [14] use a threedimensional cube [13] and the topology of both the Intel Delta and Paragon is a bi dimensional cube [15] A fair comparison of the communication performance of these machines is not an easy task because they all have different technological characteristics. On the other ....
....order to avoid deadlocks caused by the coherency protocols. Other important academic prototypes that use low dimensional cubes are Alewife [11] and the J machine [12] This list also includes many of the most popular commercial machines. The Cray T3D [13] and T3E [14] use a threedimensional cube [13] and the topology of both the Intel Delta and Paragon is a bi dimensional cube [15] A fair comparison of the communication performance of these machines is not an easy task because they all have different technological characteristics. On the other hand, theoretical models of the interconnection ....
Cray Research Inc., Cray T3D System Architecture Overview, 1 th ed., September 1993.
....[76] and the Quadrics Alenia. The MIMD (multiple instruction streams, multiple data streams) model allows processors to execute different instruction streams, and is therefore more flexible. Many general purpose commercial parallel machines introduced in the last years are based on the MIMD model [31], 88] 153] 110] 69] 143] these include several commercial hypercube and mesh multicomputers [46] and the Kendall Square KSR 1 [24] 133] 136] The Connection Machine CM 5 [100] combines features of both MIMD and SIMD models. MIMD architectures can be further classified in ....
....systems for a long time, for example in the C.mmp system [162] Crossbar networks are also the basic building blocks used to form other dynamic networks such as multistage networks. Crossbars find application even within the individual nodes of static networks such as hypercubes and meshes [153] [31], where they facilitate fast routing of data through intermediate nodes. A conventional two sided crossbar consists of a set of input lines and a set of output lines, logically placed perpendicular to each other, with switches placed at each point where the lines cross, as shown in figure 2.4. ....
[Article contains additional citation context not shown here]
Cray Research Inc. Cray T3D System Architecture Overview, 1 th edition, September 1993.
....system is reliable and sender nodes have no copies for transmitted packets, this MBCF FIFOemethod can be implemented by copying contents of the requesting packets into cancellation reports. It shows that this flow control scheme is an extension of the return to sender method, which is widely used[9, 10, 11], to keep orders of point to point packets intact. 4.1.7 Memory Based SIGNAL (MBCF SIGNAL) MBCF SIGNAL command is accompanied with a remote invocation of the user specified program in the privilege of the target task. The invocation mechanism of the MBCF SIGNAL is similar to that of UNIX s ....
Cray Research, Inc.: Cray T3D System Architecture Overview. (March 1993).
....part of the proposed Message Passing Interface standard [41] The main characteristic of these techniques is that they separate interprocessor data transfers from producer consumer synchronization. A number of (physically) distributed memory machines such as the Fujitsu AP1000 [26] the Cray T3D [12], the Cray T3E [47] and the Meiko CS 2 [8] already offer efficient low level remote memory access (RMA) primitives which provide a processor with the capability of accessing the memory of another processor without the direct involvement of the latter. To preserve the original semantics, however, ....
....in these parameters has resulted in substantial savings in execution times on the Cray T3E multiprocessor machine. Our approach argues for separation of data transfer and synchronization and for optimization each of them using data flow analysis techniques. Several machines such as the Cray T3D [12], the Cray T3E [47] the Fujitsu AP1000 [26] and the Meiko CS 2 [8] offer remote memory access primitives that allow efficient implementation of the Put Synch primitives. In addition, one way communication is a key part of the proposed extensions to the Message Passing Interface standard [41] ....
Cray Research Inc. Cray T3D system architecture overview. 1993.
....environment. Much recent research has therefore focussed on extending parallel processing solutions to such networks of commodity workstations (NOWs) Traditionally, parallel processing machines have been built with processing nodes interconnected in regular topologies such as a mesh [12] torus [5], hypercube [11] multistage interconnection network (MIN) 22, 40] etc. Such regular topologies have important mathematical properties that make message communication easier better by making message routing simpler, lowering the average distance per communication, and or increasing the bisection ....
....with increase in n p . Thus, this table requires less than O(nDn p ) memory. A method for constructing k binomial trees with minimized contention on irregular switch based networks has also been proposed in the literature [16] a) 1] 2] 3] 2] 3] 4] 3] 4] 3] 4] 4] 4] 4] 4] [5] (b) 1] 2] 3] 2] 3] 4] 3] 4] 3] 4] 4] 4] 4] 4] 4] Figure 5: Examples of k binomial trees on a multicast set size of 16: a) the 3 binomial tree, and (b) the 4 binomial tree. 3.4.2 Multicasting using Switch Support Another method for improving multicast performance is to ....
Cray Research, Inc. Cray T3D System Architecture Overview, 1993.
....computing. Much recent research has therefore focussed on extending parallel processing solutions to such networks of commodity workstations (NOWs) Traditionally, parallel processing machines have been built with processing nodes interconnected in regular topologies such as a mesh [14] torus [7], hypercube [13] multistage interconnection network (MIN) 26, 47] etc. Such regular topologies have important mathematical properties that make message communication easier better by making message routing simpler, lowering the average distance per communication, and or increasing the bisection ....
....can be modeled as m pipelined multicasts of 1 packet each. First, the delay between the arrival of consecutive packets of a multicast at a node is calculated. Then, it is shown that this interval is dependent only on the number of 12 [1] 2] 3] 2] 3] 3] 3] 1 1 1 1 1 1 1 [4] 2 [7] 3 [5] 2 [8] 3 [6] 2 [9] 3 [6] 2 [9] 3 [6] 2 [9] 3 [5] 2 [8] 3 [6] 2 [9] 3 Figure 8: The break up of a 3 packet multicast over 7 destinations using a binomial multicast tree. children the root (of the multicast tree) has. Thus, a pipelined model of an m packet multicast ....
Cray Research, Inc. Cray T3D System Architecture Overview, 1993.
....irregular network model, in which each port of a switch may be connected to a processor cluster also. By processor cluster, we mean a multiprocessor with a regular interconnection topology, such as a bus based shared memory multiprocessor, a mesh connected scalable parallel computer like Cray T3D [6], or a multicomputer interconnected by a multistage interconnection network like SP2 [7] In networks of workstations and processor clusters, a multicast usually needs two steps: 1) the source node sends the multicast message to the destinations connected to a port of a switch directly and to ....
Cray Research, Inc. Cray T3D System Architecture Overview, 1993.
....N.C. 27708, U.S.A, sandeep cs.duke.edu] x A preliminary version of this paper appeared in Int l Parallel Processing Symp. 1995 [19] 2 as hypercubes [1] Examples of machines with such topologies include the MasPar MP 1 [3] Intel Paragon, MIT J Machine [6] Tera HORIZON [17] Cray T3D [4, 13], and Polymorphic Torus [9] A torus is a mesh with wrap around links. Although meshes and tori are generally regarded as close families, there are still some distinctions: i) As opposed to meshes, all nodes of a torus are topologically symmetric, ii) a torus has a smaller (about half) diameter ....
....but different messages to be delivered to each of the remaining P Gamma 1 processors. In this paper, complete exchange is achieved by a sequence of phases, where a phase consists of a subset of nodes communicating in a contention free manner. On parallel machines such as CM 5 [10] and Cray T3D [4, 13] which provide a dedicated network for barrier synchronization, these phases can be executed in a lockstep manner, one after another. However on machines which do not provide such mechanism, either phases could be separated by software barrier or they could be executed asynchronously. We x ....
Cray Research, Inc. Cray T3D System Architecture Overview, 1993.
....of parallel systems using such processor clusters as building blocks instead of single processors, thus allowing a modular and hierarchical approach to building large systems. Prominent examples of processorcluster based systems are the Stanford DASH [8] Intel Paragon [11] and Cray T3D [5]. Typically, the interconnections connecting the processor clusters of these systems are scalable meshes, tori, or multistage networks. This is referred to as the inter cluster network. The interconnection inside a processor cluster is referred to as the intra cluster topology. Current ....
....2 Processor cluster based systems Computing nodes having more than one processor on a single multi chip module or processor board are becoming increasingly available. In the recent past many parallel multiprocessor systems have been developed using such processor clusters e.g. the CRAY T3D [5], Intel Paragon [11] and the Stanford DASH [8] Most of these systems are two level architectures as shown in Figure 1. The processor clusters are interconnected through a scalable inter cluster network, e.g. 3D torus, 2D mesh. The cluster configuration can vary from a simple star connection as ....
[Article contains additional citation context not shown here]
Cray Reasearch Inc. Cray T3D System Architecture Overview, 1993.
....Support A major factor which affects the performance of the CCDP scheme is the effectiveness of the hardware support for prefetching on the system. The CCDP scheme can make use of the prefetch hardware on existing systems. In fact, we have previously implemented the scheme on the Cray T3D [8], which is a large scale DSM system with a shared address space and simple system level prefetch hardware support. However, the conventional prefetch hardware of the Cray T3D is not optimized to handle the two types of data prefetching operations used by the CCDP scheme. Although our Cray T3D ....
....each processor of a large scale non cache8 coherent DSM multiprocessor. 3.1 Architectural Model In this paper, we assume a large scale non cache coherent DSM multiprocessor. The processing nodes are interconnected by a high bandwidth network. This architecture is similar to that of the Cray T3D [8]. The organization of a node is shown in Figure 1. Each node consists of a processor, a data cache and its associated DCPFU, a local memory module, and various support circuitry. The collection of the local memory modules in all of the nodes form a logically shared global memory. The support ....
[Article contains additional citation context not shown here]
Cray Research, Inc. Cray T3D System Architecture Overview, March 1993.
No context found.
Cray Research, Inc. CRAY T3D System Architecture Overview, hr-04033 edition, Sept. 1993.
No context found.
Cray Research, Inc. Cray T3D System Architecture Overview, Technical Report, Sept. 1993.
No context found.
Cray Research, Inc., Eagan, MN. Cray T3D System Architecture Overview, March 1993.
No context found.
Cray Research, Inc. Cray T3D System Architecture Overview. Cray Research, Inc., March 1993.
No context found.
Cray Research, Inc. Cray T3D System Architecture Overview, 1993.
No context found.
Cray Research, Inc. Cray T3D System Architecture Overview, 1993.
No context found.
Cray Research. Cray T3D System Architecture Overview. Cray Research Inc., Tech. Rep. HR04033, 1993.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC