| S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Technical report, Inc. Cray Research, 1996. |
....The processing performed by the memory in the Hamal architecture is therefore limited to simple single cycle atomic memory operations such as addition, maximum and boolean logic. These operations are useful for efficient synchronization and are similar to those of the Tera [Alverson90] and CrayT3E [Scott96] memory systems. 3.3.5 Memory Traps and Forwarding Pointers Three trap bits (T, U, V) are associated with every 128 bit data memory word. The meaning of the T bit depends on the contents of the memory word. If the word contains a valid data pointer, the pointer is interpreted as a forwarding ....
.... consecutive virtual addresses in a segment may be distributed among any power of two number of memory units [Alverson90] The Cray T3E features an address centrifuge which can extract user specified bits from a virtual address and use them to form the ID for the node on which the data resides [Scott96]. The Hamal processor contains no global segment or translation tables; virtual addresses are routed to physical nodes based exclusively on the upper address bits. To compensate for this somewhat rigid mapping and to allow applications to lay out an object in a flexible manner without performing ....
[Article contains additional citation context not shown here]
Steven L. Scott, "Synchronization and Communication in the T3E Multiprocessor", Proc. ASPLOS VII, 1996, pp. 26-36. 152
....computers, hardware routers, performance evaluation, VLSI design, simulation. 1 Introduction Multiprocessor performance has considerably increased during the last decade. On the one hand, distributed sharedmemory multiprocessors (DSMs) are becoming widespread (SGI Origin 2000 [16] Cray T3E [20]) On the other hand, nowadays message passing multicomputers constitute the frontier of computing power (ASCI Project [15] As processor computing power increases, communication performance should increase accordingly in order to adequately balance the system. Interconnection networks have also ....
S.L.Scott, "Synchronization and Communication in the T3E Multiprocessor ", Proc. ASPLOS VII, Cambridge, MA, October 1996
....class for all highly parallel machines built in the last years, there are still architectural differences in hardware and system software. Older systems are designed and used as a large coprocessor to a front end computer (e.g. T3D, 11] modern systems allow stand alone operation (e.g. T3E, [12]) In contrast to the coprocessor solutions, stand alone operation requires an operating system on the parallel computer. Evolving a parallel operating system from a conventional microkernel based operating system (e.g. Mach, Windows NT) seems to be a practible approach at first sight. But it is ....
S. L. Scott: "Synchronization and Communication in the T3E Multiprocessor", Operating Systems Review, vol. 30, no. 5, pp. 26-36, Cambridge, MA, December, 1996.
....Raven environment executing on a single processor, on a set of distributed processors, and on a shared memory multiprocessor. keywords: multiprocessor, distributed shared memory, verification, simulation, high level validation 1 Introduction Contemporary multiprocessors, such as the Cray T3E [2] and the SGI Origin [3, 4] have tens of millions of gates per node. This complexity makes the verification of the design a very problematic and time consuming endeavor. This verification problem is rooted in the exponential growth of IC technology. As design complexity and integration increase, ....
Steven L. Scott, "Synchronization and Communication in the T3E Multiprocessor," Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pp. 26--36, October 1996.
....6.0 0. 3 the SPMD paradigm (Simple Program Multiple Data) It was implemented using the MPI (Mes sage Passing Interface) message passing standard library [10] The main advantage using this library is that it is presently implemented in many computers, what guarantees the portability of the code [11, 12]. A distributed memory multicomputers, a CRAY T3E, was used to test this software. This computer is a very powerful and flexible parallel scalable system [11] It consists of 16 up to 2048 processors connected by a wide bandwidth bidirectional 3 D torus network. Each cell includes a Dec Alpha ....
....main advantage using this library is that it is presently implemented in many computers, what guarantees the portability of the code [11, 12] A distributed memory multicomputers, a CRAY T3E, was used to test this software. This computer is a very powerful and flexible parallel scalable system [11]. It consists of 16 up to 2048 processors connected by a wide bandwidth bidirectional 3 D torus network. Each cell includes a Dec Alpha 21164 microprocessor, local memory and control logic. The capacity of local memory can range from 64 Mbytes to 2 Gbytes. We have analysed the case of the GaAs ....
S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Tech. Rep., Inc. Cray Research, 1996.
.... the SPMD paradigm (Simple Program Multiple Data) 14] It was implemented using the MPI (Message Passing Interface) message passing standard library [15] The main advantage of using this library is that it is presently implemented in many computers, which guarantees the portability of the code [16, 17]. A CRAY T3E distributed memory multicomputer was used to test this software. This computer is a very powerful and flexible parallel scalable system [16] It comprises up to 2,048 processors connected by a wide bandwidth bidirectional 3 D torus network. Each cell includes a Dec Alpha 21164 ....
....The main advantage of using this library is that it is presently implemented in many computers, which guarantees the portability of the code [16, 17] A CRAY T3E distributed memory multicomputer was used to test this software. This computer is a very powerful and flexible parallel scalable system [16]. It comprises up to 2,048 processors connected by a wide bandwidth bidirectional 3 D torus network. Each cell includes a Dec Alpha 21164 microprocessor, local memory and control logic. The capacity of local memory can range from 64 Mbytes to 2 Gbytes. We have analyzed a gradual HBT device such ....
S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Technical report, Inc. Cray Research, 1996.
....against an equivalent Verilog test bench. We establish lower and upper bounds on the performance of the Raven environment executing on a single processor, on a set of distributed processors, and on a shared memory multiprocessor. 1 Introduction Contemporary multiprocessors, such as the Cray T3E [2] and the SGI Origin [3, 4] have tens of millions of gates per node. This complexity makes the verification of the design a very problematic and time consuming endeavor. This verification problem is rooted in the exponential growth of IC technology. As design complexity and integration increase, ....
Steven L. Scott, "Synchronization and Communication in the T3E Multiprocessor," Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pp. 26--36, October 1996.
....critical. Although the GLOW extensions described in the next chapter can be applied to many different topologies, I use the k ary n cube networks of SCI rings in 2 and 3 dimensions. Because of its scalability, this topology has great appeal in real world systems such as the CRAY T3D [21] and T3E [87], SGI Origin 2000 [62] Convex Exemplar 2000 [1] The rings use a 500 MHz clock; 16 bits of data can be transferred every clock cycle through every link, giving a total of 1GB sec bandwidth. This is equivalent to the actual IEEE 1596 standard which describes a 250 MHz network that sends 16 bits of ....
....of large systems. Typically, a large distributed shared memory multiprocessor comprises a number of nodes (16 or more) each containing one or more processors, caches and a part of the global memory. The nodes are connected with a high performance network. In real systems (CRAY T3D [21] and T3E [87], SGI Origin 2000 [62] Convex Exemplar 2000 [1] large diameter networks such as multi dimensional torri are preferred over large centralized switches (e.g. crossbar switches) since the latter can be significantly more expensive with current technology. GLOW is intended for systems with ....
[Article contains additional citation context not shown here]
Steven L. Scott, "Synchronization and Communication in the T3E Multiprocessor." In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), October 1996.
....network throughput. Another trend has been to accept a reduction in communication performance in tightly coupled multiprocessor networks in return for greater generality in router designs. Examples of this are in the designs of the FLASH message processor [42] and the Cray T3E multiprocessor [65]. FLASH is a tightly coupled shared memory multiprocessor consisting of a large number of processing nodes connected by a two dimensional mesh network. It uses a general purpose node controller called the MAGIC chip. The MAGIC chip trades off performance for flexibility and low hardware overhead ....
S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Proc. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems October 1996.
....we used for the tests, the Cray T3D, already provides dedicated hardware for remote read and write operations. The new Cray T3E uses enhanced virtual shared memory mechanisms to make more usable these shared memory mechanisms, and also to implement more efficiently message passing primitives [29]. Note, however, that remote memory operations can be exploited to transfer data when the sender (receiver) knows the address to which it has to write (read) In our support, this is the case of communications which migrate data tiles toward underloaded processors, and receive back the updates. ....
S. L. Scott, "Synchronization and Communication in the T3E Multiprocessor," in Proc. of the Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOSVII) , Oct. 1996.
....data structure before the lock has been acquired, while allowing nondependent instructions to execute. Delayed response device registers are different from traditional full empty bits (such as those used in dataflow computers[14] the fut and cfut of the J Machine[12] or the Cray T3E s E registers[16]) in several ways. First of all, DRDRs are used to detect conditions, such as whether or not a message has arrived, rather than the validity of a particular word of memory. This separation of the condition from the region of memory allows more flexibility, even when DRDRs are used only to oversee ....
Steven Scott. "Synchronization and Communication in the T3E Multiprocessor," In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII), pages 26--36, 1996.
....memory multicomputers using the MIMD strategy (Multiple Instruction Multiple Data) under the SPMD paradigm (Simple Program Multiple Data) A distributed memory multicomputers, a CRAY T3E, was used to test this software. This computer is a very powerful and flexible parallel scalable system [20]. It consists of 16 up to 2048 processors connected by a wide bandwith bidirectional 3 D torus network. Each cell includes a Dec Alpha 21164 microprocessor, local memory and control logic. The capacity of local memory can range from 64 Mbytes to 2 Gbytes. Our program was implemented using the MPI ....
....local memory can range from 64 Mbytes to 2 Gbytes. Our program was implemented using the MPI (Message Passing Interface) message passing standard library [21] The main advantage using this library is that it is presently implemented in many computers, what guarantees the portability of the code [20, 22]. All the simulator code, from mesh generation and refinement to the construction and solution of the systems of equations with the different preconditioners, has been parallelized. The load balancing obtained was almost optimum or even optimum in many cases. 4 APPLICATION TO ABRUPT InP InGaAs ....
S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Tech. Rep., Inc. Cray Research, 1996.
....performance when no locality is available by allowing the on chip caches to be bypassed. However, if the data to be loaded were in the data cache, then accessing that data via Eregisters would be sub optimal because the cachebackmap would first have to flush the data from data cache to memory [5] [6] [7] Figure 2 shows the measured one way communication bandwidth for different message sizes using MPI. The test program uses all of the 28 processors available in the system for parallel applications. There is always the same sender processor and one receiver processor that varies. The measures ....
S. L. Scott. "Synchronization and Communication in the T3E Multiprocessor", Proceeding of the ASPLOS VII, October 1996.
....the torus topology due to deadlocks. However, it would be interesting to observe how network resource control, in general, can be applied to such symmetric topology networks. Note that torus networks are currently being used in several commercial multicomputers such as Cray T3D [25] and T3E [99], and Fujitsu AP3000 [98] Figure 3.18 compares node throughput of 16x16 torus networks when RR and ALU biasing are used, respectively. We follow the virtual channel assignment method described in [100] where virtual channels are decided when packets are injected. It is slightly different from ....
.... messages of up to 4 Kbytes are guaranteed to be delivered within 20 msecs, which is much less than the delay bounds required for interactive video audio applications [18, 19] Note that existing multicomputer networks with round robin arbitration, for instance, the Intel Paragon [24] and Cray T3E [99], can theoretically provide delay guarantees of a few days for 1,024 node configurations. It should be emphasized that the delay bounds we derived above hold true independent of network states or other connections behaviors. In RCQ, such traffic isolation is supported 1 The frame transmission ....
S. L. Scott. "Synchronization and communication in the T3E multiprocessor," in Architectural Support for Programming Languages and Operating Systems (ASPLOSVII) , Cambridge, MA, Oct. 1996, pp. 26--36. Available from http://reality.sgi.com/ sls craypark/Papers/asplos96.html.
.... to perform some amount of computation; proposals vary on the granularity and intent of the computation, from simple operations on single bit operands, to synchronization support, to implementation of entire data structures [24] Systems using smart memory for synchronization include the Cray T3E [26], a shipping commercial system, and Cedar [18] an older research system in many ways similar to the T3E. The smart memory implements atomic operations such as compare and swap, fetch andincrement, and test and add. In early work in this area, described by Stone [29] arithmetic and logical ....
....complete (e.g. due to a cache miss or some consistency action) would not necessarily stall the CPU [11,13] In prefetching, data is moved into a cache before needed. Prefetching can be accomplished by having the programmer or compiler insert prefetch instructions for data ahead of its use [11,21,26], by having hardware regularly issue fetches for memory locations of some fixed stride (the stride and timing set by the programmer or even determined automatically) 6,26] or, simplest of all, by using long cache lines. 5.5 Latency Hiding Discussion Each of these approaches is quite different ....
[Article contains additional citation context not shown here]
S.L. Scott, "Synchronization and communication in the T3E multiprocessor," in Proc. of the Int. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.
No context found.
S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Technical report, Inc. Cray Research, 1996.
No context found.
S. L. Scott. `Synchronization and communication in the T3E multiprocessor'. Technical report, Inc. Cray Research (1996).
No context found.
S. L. Scott, "Synchronization and communication in the T3E multiprocessor," Tech. Rep., Inc. Cray Research, 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC