19 citations found. Retrieving documents...
Cray Research, Inc. CRAY T3D System Architecture Overview Manual. 1993. BIBLIOGRAPHY 65

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
RAMA: An easy-to-use, high-performance parallel file system - Miller, Katz (1997)   (2 citations)  (Correct)

....link was slower than 10 MB s hardly faster than a disk. On such a system, pseudo random placement as done in RAMA would be a poor choice because it would place too high a load on the interprocessor links. However, interconnection networks have become faster; each processor in the Cray T3D [5] is connected to its neighbors by six links each capable of transferring over 150 MB s. The gap between network and disk speeds will only widen, since network technology is electronic while disk speeds are limited by mechanics. As Fig. 12 shows, the message traffic created by RAMA does not place ....

Cray Research, Inc. Cray T3D system architecture overview manual, Sept. 1993, Publication number HR-04033.


A Framework for Parallel Job Scheduling - Subramanian (1995)   (Correct)

....section correspond roughly to the CRAY T3D. Do not take these figures too precisely; we only wish to convey a feel for their relative order of magnitude. 23 Thi91a] and MasPar s MP 2 [Mas91] Examples of MIMD machines are Thinking Machines CM 5 [Thi91b, L 92] and Cray Research s T3D [Oed93, Cra93] From this difference in instruction fetching, several other hardware differences follow as corollaries: Fine Grain vs Coarse Grain: In an SIMD machine, the back end processors do not need to fetch and decode instructions. Therefore, they do not need a control unit or an instruction cache at ....

....if a new job arrives, existing jobs can be squeezed to free some PEs. We did not consider re allocation because, typically, it is prohibitively expensive: massive amounts of data have to be rearranged between processors through a very slow router network. For example, consider a CRAY T3D [Cra93] where each PE has 8 Mwords of memory, and the router takes (optimistically) about 1 second to deliver one word from each PE. Therefore a whole rearrangement of memory takes of the order of 8 seconds which is a terribly long time for a machine with a 6.7 nanosecond clock. Unfortunately, ....

Cray Research, Inc., Eagan, MN. Cray T3D System Architecture Overview Manual, 1993.


A Compiler Abstraction for Machine Independent Parallel.. - Chamberlain, Choi.. (1998)   (1 citation)  (Correct)

....paradigm used to implement data transfer, allowing it to focus on machine independent communication optimizations. Ironman calls are implemented using the optimal communication mechanism of the target machine, be it message passing (SP2 [3] put and get based shared memory operations (T3D [15]) or cache coherent assignment (SGI PowerChallenge [1] and, like message passing libraries, the calls are made available as a custom library on each platform. Compilation with this library achieves data transfer specialized to the machine rather than forcing it into the one size fits all ....

....for a purely communicationoriented micro benchmark. Finally, we evaluate five benchmark programs that use the Ironman interface with the goal of evaluating the benefits of Ironman in applications. 3. 1 Methodology Experiments were run on two platforms: the Intel Paragon [10] and the Cray T3D [15] (see Figure 4) On the Paragon, we use the MPICH [14] implementation of MPI and the native NX communication library routines. On the T3D we use a vendor optimized version of PVM [16] CRI EPCC MPI [6] and the native SHMEM [4] library routines. All benchmark programs were run on dedicated ....

Cray Research Inc. Cray T3D System Architecture Overview Manual. Mendota Heights, MN, 1993.


Architectural Support for Compiler-Generated Data-Parallel Programs - Klaiber (1994)   (1 citation)  (Correct)

....computers. To fully realize the advantages of parallel processing, we need to design efficient communication mechanisms. Existing communication architectures span a spectrum ranging from message passing [Arlauskas 88, Intel 91a, Dally 90, TMC 91b] to remote memory access [Crowther et al. 85, Cray 93] shared memory [Sequent 87, Lenoski et al. 92, Agarwal et al. 91] and cache only architectures [Hagersten 92a, KSR 92] These communication architectures are often used directly by the programmer a fact that has influenced their design, much as assembly language programming has influenced ....

....processors reduces overall design time for the machine and results in faster time to market. In fact, most existing parallel computers, such as Intel s series of message passing machines [Arlauskas 88, Bokhari 90, Intel 91b, Intel 91a] the Thinking Machines CM 5[TMC 91b] or the Cray T3D [Cray 93] a NUMA machine) use processing nodes built around a commercial off the shelf microprocessor. Instead of requiring specialpurpose processor instructions, these machines control communication through hardware external to the processor. We conclude that the choice between custom or commodity ....

[Article contains additional citation context not shown here]

Cray Research, Inc., 2360 Pilot Knob Road, Mendota Heights, MN 55120. CRAY T3D System Architecture Overview Manual (HR-04033), 1993.


The Express Broadcast Network: A Network for Low-Latency.. - Bolding (1994)   (2 citations)  (Correct)

....source node. A failure on a link causes loss of communication to all of its descendants. Furthermore, the only node that can broadcast in this tree is the single source node. Cray Research s T3D uses a special network to implement barrier synchronization and broadcast primitives using static trees [2]. This allows multiple source nodes, but still suffers from singlepoint failures. To avoid these problems, the EBN incorporates a dynamic redundant tree structure. The EBN tree structure is dynamic in that any node may become the root of the broadcast tree at will. This is accomplished by ....

....around any number of faulty links to eventually reach all connected nodes in the network. This is shown in Figure 1, where the broadcast reaches the entire network despite the presence of faults. The EBN has, in its simplest form, two basic communication primitives: eurekas and barriers. Eurekas [2] are one to all broadcast messages and are used when the entire network is waiting for a single event. Barriers are used when the network is waiting for all nodes to experience a common event. Barriers can be implemented by hav Figure 1: A broadcast using the EBN. A message originates from the ....

Cray Research. Cray T3D System Architecture Overview Manual, 1993.


Efficient Techniques for Nested and Disjoint Barrier.. - Ramakrishnan, Scherson, .. (1999)   (4 citations)  (Correct)

....Since data parallel programs involve frequent barrier synchronizations, a computer intended to run data parallel programs must implement them efficiently. For this purpose, current MIMD computers, including the CM 5 and T3D, provide a dedicated barrier tree exclusively for barrier synchronizations [4, 15, 19]. 1.1 Limitations from Control Nesting Current MIMD computers provide just one barrier tree per user application. However, data parallel programs very often require the simultaneous use of more than one barrier synchronization tree, due to data dependent conditionals and loops, that have ....

....imposed by the barrier tree network may run contrary to these contiguity considerations. The only proposed solution that we are aware of involves using several barrier trees, one for each partition, and in each barrier tree, masking off the processors irrelevant to that synchronization [6, 4]. Apart from being wasteful in terms of hardware, this solution places an a priori limit on the number of partitions that can be created. In this paper, we present a design for an MDBS network which matches the data network topology, and is therefore free of the constraints imposed by barrier ....

Cray Research, Inc., Eagan, MN. Cray T3D System Architecture Overview Manual, 1993.


On The Implementation And Effectiveness Of Autoscheduling For.. - Moreira (1995)   (16 citations)  (Correct)

....to as distributed shared memory machines. As in the case of UMA machines, processors can have private caches, adding another level to the hierarchy of memory accesses. 9 Interconnection Network M P P P P M M M . Figure 2. 1 Centralized memory organization (Cray C90) The Cray T3D [17] is an example of a NUMA machine, while the Convex Exemplar [18] is a cluster NUMA machine. A survey of distributed shared memory systems can be found in [19] Different approaches are used to enhance performance by exploiting nonuniformity of memory accesses. The importance of the approach ....

.... 2] 0 Latency[ 3] 0 Latency[ 4] 0 Latency[ 5] 0 Latency[ 6] 0 Latency[ 7] 100215 Latency[ 8] 103453 Latency[ 9] 70088 Latency[ 10] 41828 Latency[ 11] 23233 Latency[ 12] 11318 Latency[ 13] 5380 Latency[ 14] 2374 Latency[ 15] 906 Latency[ 16] 396 Latency[ 17] = 86 Latency[ 18] 46 280 Latency[ 19] 3673 Total = 362996 Average = 8.718440 We note that the minimum latency observed is seven cycles, as expected (the Omega Gammah ; 2) network has three stages of switch, therefore the minimum latency is three cycles in the ....

[Article contains additional citation context not shown here]

Cray Research, Inc., Cray T3D System Architecture Overview Manual, 1993. Available from http://www.cray.com.


Scheduling Computationally Intensive Data Parallel Programs - Raghu Subramanian Systems   (Correct)

....processors of the back end through the tree network. In an MIMD machine, all processors have instruction fetching capabilities. Examples of SIMD machines are Thinking Machines CM 2 [9] and MasPar s MP 2 [18] Examples of MIMD machines are Thinking Machines CM 5 [31, 15] and Cray Research s T3D [21, 6]. Freezing an SIMD machine at any instant and looking at all the back end processors, one finds all of them executing exactly the same instruction (which was just broadcast to them by the front end) or perhaps idling. This architectural constraint makes it impossible to space slice the machine, ....

....if a new job arrives, existing jobs can be squeezed to free some PEs. We did not consider re allocation because, typically, it is prohibitively expensive: massive amounts of data have to be rearranged between processors through a very slow router network. For example, consider a CRAY T3D [6], where each PE has 8 Mwords of memory, and the router takes (optimistically) about 1 second to deliver one word from each PE. Therefore a whole rearrangement of memory takes of the order of 8 seconds which is a terribly long time for a machine with a 6.7 nanosecond clock. Unfortunately, most ....

Cray Research, Inc., Eagan, MN. Cray T3D System Architecture Overview Manual, 1993.


Memory Organization in Multi-Channel Optical Networks: NUMA.. - Xiao, Bennett   (Correct)

....multi channel optical networks offer both high bandwidth and scalable broadcast capability, particularly in a form of WDM known as broadcast and select. The absence of scalable broadcast in conventional networks has governed the choice of memory system configuration for many systems, e.g. [18, 1, 17, 6, 5], and has in particular favored cache coherent non uniform memory access (CC NUMA, or simply NUMA) over cache only memory access (COMA) architecture [27] In this paper, we investigate the impact of the availability of scalable broadcast on these two memory organization alternatives. Using ....

....convenient broadcast but are bandwidth limited, and cannot support the communication needs of large parallel systems. Many research and commercial multiprocessor systems use some kind of high bandwidth, packet switched network, such as two or three dimensional meshes or fat tree connections [18, 1, 17, 6, 5]. These networks allow high speed point to point communication, but one to all broadcast normally takes multiple point to point messages. 2.2 NUMA in Conventional Interconnection Networks In NUMA systems, each memory block is assigned a homenode according to a memory placement policy. A ....

Cray Research Inc. Cray T3D system architecture overview manual, HR-04033, 1993.


Efficient Techniques for Fast Nested Barrier.. - Ramakrishnan, Scherson, .. (1995)   (2 citations)  (Correct)

....Since data parallel programs involve frequent barrier synchronizations, a computer intended to run data parallel programs must implement them efficiently. For this purpose, current MIMD computers, including the CM 5 and T3D, provide a dedicated barrier tree exclusively for barrier synchronizations [2, 13, 16]. Current MIMD computers provide just one barrier tree per user application. However, data parallel programs very often require the simultaneous use of more than one barrier synchronization tree, due to data dependent conditionals and loops, that have barriers nested in them. Any reasonably large ....

Cray Research, Inc., Eagan, MN. Cray T3D System Architecture Overview Manual, 1993.


Cost Effective Fault Tolerance for Network Routing - Yost (1995)   (1 citation)  (Correct)

....node. A failure on a link causes loss of communication to all of its descendants. Furthermore, the only node that can broadcast in this tree is the single source node. Cray Research s T3D uses a special network to implement barrier synchronization and broadcast primitives using static trees [Cray Research 93] This allows multiple source nodes, but still suffers from single point failures. To avoid these problems, the EBN incorporates a dynamic redundant tree structure. Each node in the network maintains a listening mode, checking the incoming links for messages. Any node may become the root of ....

....30 In a faulty network with all minimum length paths blocked between two nodes, the propagation scheme automatically reorganizes the tree and extends its depth as needed using alternate paths. The EBN has, in its simplest form, two basic communication primitives: eurekas and barriers. Eurekas [Cray Research 93] are one to all broadcast messages and are used when the entire network is waiting for a single event. Barriers are used when the network is waiting for all nodes to experience a common event. Barriers can be implemented by having all nodes broadcast eureka messages if they have not experienced ....

Cray Research. Cray T3D System Architecture Overview Manual, 1993.


Architectural Mechanisms for Explicit.. - Ramachandran.. (1995)   (6 citations)  (Correct)

....an application might need to switch between protocols as its data sharing and access patterns change. These issues need further investigation and are part of our ongoing work. 7 Related Work There are many related projects that have similar goals to ours. One commercial product, the Cray T3D [8] provides shared memory but does not maintain cache coherence. It is totally up to the system software to ensure coherence of cached data. In terms of primitives for explicit communication, they offer mechanisms to access the remote memory, block transfer, and prefetch. Another shared memory ....

Cray Research, Inc., Minnesota. The Cray T3D System Architecture Overview Manual, 1993.


An Extended Dominating Node Approach to Broadcast and Global.. - Tsai, McKinley (1996)   (2 citations)  (Correct)

....time is approximately 0:45 microseconds. More recent MPCs exhibit much lower startup latencies. The Cray T3D, for example, has a startup latency (combined send and receive latencies) of approximately 1:5 microseconds and a raw data transfer rate of 300 megabytes sec on each data channel [7]. However, the ratio of startup latency to per hop network delay for the T3D is not much different than that of the nCUBE 2 (450 versus 378) One interpretation of this characteristic is that software overhead remains the dominant part of overall total communication latency and that multiple ports ....

Cray Research, Inc., CRAY T3D System Architecture Overview Manual, 1993.


A Compiler Abstraction for Machine Independent Parallel.. - Chamberlain, Choi.. (1997)   (1 citation)  (Correct)

....paradigm used to implement data transfer, allowing it to focus on machine independent communication optimizations. Ironman calls are implemented using the optimal communication mechanism of the target machine, be it message passing (SP2 [3] put and get based shared memory operations (T3D [15]) or cache coherent assignment (SGI PowerChallenge [1] and, like message passing libraries, the calls are made available as a custom library on each platform. Compilation with this library achieves data transfer specialized to the machine rather than forcing it into the one size fits all ....

....for a purely communicationoriented micro benchmark. Finally, we evaluate five benchmark programs that use the Ironman interface with the goal of evaluating the benefits of Ironman in applications. 3. 1 Methodology Experiments were run on two platforms: the Intel Paragon [10] and the Cray T3D [15] (see Figure 4) On the Paragon, we use the MPICH [14] implementation of MPI and the native NX communication library routines. On the T3D we use a vendor optimized version of PVM [16] CRI EPCC MPI [6] and the native SHMEM [4] library routines. All benchmark programs were run on dedicated ....

Cray Research Inc. Cray T3D System Architecture Overview Manual. Mendota Heights, MN, 1993.


Application Oriented Resource Management on Large Scale.. - Ekanadham, Moreira, Naik (1995)   (2 citations)  (Correct)

....be performed efficiently. 1 INTRODUCTION The recent years have seen a growth on the use of large scale parallel systems, those with hundreds and sometimes thousands of processors, for the execution of complex scientific codes. Examples of these systems include the IBM SP2 [7] Cray Research T3D [2], and Thinking Machines CM5 [10] These parallel systems are usually installed at supercomputing centers and shared by many users with varying demands. To accommodate multiple simultaneous users, and, thus, improve utilization of the machine, most current large scale parallel systems divide the ....

Cray Research Inc., Cray T3D System Architecture Overview Manual, Eagan, Minn., 1994.


Processor Allocation in Multiprogrammed Distributed-Memory.. - Naik, Setia, al. (1997)   (8 citations)  (Correct)

....the static partitioning of the processors into a fixed number of disjoint sets has been most often used to allocate resources to individual jobs. On more recent general purpose distributed systems, processors can be regrouped and processor partitions can be formed just prior to scheduling a job [1, 5]. On such Processor Allocation in Distributed Memory Systems 2 systems, adaptive partitioning policies have been considered, where the number of processors allocated to a job is determined based on the system state at job arrivals and departures. Several batch processing schedulers (e.g. EASY, ....

.... Allocation in Distributed Memory Systems 3 2 Characteristics of the Parallel Environment We consider a parallel computing environment that is fairly generic and is based on the characteristics of large scale parallel platforms, such as IBM s RS 6000 SP systems and Cray s T3D and T3E systems [1, 5]. We focus on these systems as they have proven to be commercially viable. The parallel system assumed in our study consists of a large number of processors each with its own local memory. In our simulation experiments the system has 512 identical processors, each having 512MB of local memory and ....

Cray Research Inc, Eagan, MN. Cray T3D System Architecture Overview Manual, 1994.


Autoscheduling in a Distributed Shared-Memory Environment - Jos'e Moreira (1994)   (8 citations)  (Correct)

....parallel processors. They combine the scalability of message passing architectures, with the easier programming model of shared memory. For a survey of distributed shared memory architectures, see [18] Detailed information on the architecture and programming model of the Cray T3D can be found in [5, 19, 20]. A technical description of the KSR 1, a cache only variant of distributed shared memory, is provided in [14] Examples of languages that support the specification of data distribution include Fortran D [6] Vienna Fortran 90 [2] High Performance Fortran [11, 15] and Cray MPP Fortran [20] ....

Cray Research Inc. CRAY T3D System Architecture Overview Manual, 1994.


An Application-driven Study of Parallel System.. - Sivasubramaniam.. (1999)   (1 citation)  Self-citation (Inc)   (Correct)

....factors often play an important role in its physical realization. Agarwal [11] and Dally [12] show that wire delays (due to increased wire lengths associated with planar layouts) of higher dimensional networks make low dimensional networks more viable. The 2 dimensional [50] and 3 dimensional [51], 52] toroids are common topologies used in current day networks, and it would be interesting to project link bandwidth requirements for these topologies. A metric that is often used to compare different networks is the bisection bandwidth available per 27 processor. On a k ary n cube, the ....

Cray Research, Inc., Minnesota, The Cray T3D System Architecture Overview Manual, 1993.


An Investigation of the Unified Parallel C Memory Consistency Model - Huang (2002)   (Correct)

No context found.

Cray Research, Inc. CRAY T3D System Architecture Overview Manual. 1993. BIBLIOGRAPHY 65

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC