| V. Karamcheti and A. A. Chien. "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D," Proceedings of ISCA 95, Santa Margherita, Italy, pp. 298-307, 1995. |
.... trusts those consumers that have adequately responded to data requests and does not trust those who have not responded. This algorithm also has the property of unifying the seemingly diverse implementations of push based and pull based approaches often found in messagepassing layers [23]. When producers are the bottleneck for the data transfer, all consumers respond quickly to data requests, and the choice of destinations degenerates to a random choice from the entire set of consumers a push based approach. However, when the consumers are the bottleneck, each producer has ....
V. Karamcheti and A. A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In 22nd Annual International Symposium on Computer Architecture, pages 298--307, Santa Margherita Ligure, Italy, June 22--24, 1995.
....are unresponsive but the network a relatively large amount buffering (in VIA, the size host memory) Adding (traditional) barriers therefore does little help network, but makes nodes and cluster a whole less tolerant of unresponsiveness. Based results both CM 5 and T3D, Karamcheti and Chien [57] also conclude fan in is important problem in parallel computing. Their solution employ pull messaging, in which receivers pull messages from each sender (using remote reads) opposed more traditional push messaging. This ensures that messages transmitted faster than the receiver process them. ....
Vijay Karamcheti and Andrew Chien. comparison architectural sup- port for messaging the TMC CM-5 Cray T3D. Proceedings 22nd Annual International Symposium Computer Architecture (ISCA '95), pages 298--307, Santa Margherita Ligure, Italy, June 1995. Available from http://www-csag.ucsd.edu/papers/cm5-t3d-messaging.ps.
....4D 340 [Baskett et al. 1988] for messagepassing machines such as the Intel iPSC 860 [Berrendorf and Helin 1992] and for heterogeneous networks of workstations. While no implementation currently exists for shared memory machines with incoherent caches such as the Cray T3D [Arpaci et al. 1995; Karamcheti and Chien 1995], it would be possible to implement Jade on such machines. 5.1 Overview Strictly speaking, there are two Jade implementations: one for shared memory platforms and one for message passing platforms. While each implementation is tailored for its own specific computational environment, the ....
....use by subsequently created tasks. 5.3.2 Extensions for Incoherent Caches. In this section we have assumed that the hardware fully implements the abstraction of a single shared address space. Machines with incoherent caches, however, only partially implement this abstraction [Arpaci et al. 1995; Karamcheti and Chien 1995]. These machines automatically fetch and cache remote memory, but rely on software to keep the caches consistent. While no Jade implementation currently exists for machines with incoherent caches, we believe that Jade could be a useful programming language for such machines. The most difficult ....
Karamcheti, V. and Chien, A. 1995. A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd International Symposium on Computer Architecture. ACM, New York.
....so as not to involve the YMP. Also, the black art of the MPP environment variables is less than satisfying. In some cases, one must tune these variables just to get the program to run which we find unacceptable. We would like to see CRI support a fast native message passing environment such as FM[7]. On the positive side, the data network is much faster than on other MPPs; most notably the CM 5. Also, floating point performance was easier to obtain with the DEC Alpha processors than with the CM 5 vector units when programming in MIMD model. In general, vector unit memory management problems ....
Vijay Karamcheti and Andrew Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proceedings of ISCA, 1995.
....unique goal [12, 25, 30, 31] but FM is distinguished by its hardware context (Myrinet) and high performance. The Fast Messages project focuses on optimizing the software messaging layer that resides between lower level communication services and the hardware. It is available on both the Cray T3D [22, 23] and Myricom s Myrinet [6] Using the Myrinet, FM provides MPP like communication performance on workstation clusters. FM on the Myrinet achieves low latency, high bandwidth messaging for short messages delivering 32s latency and 16 MBytes s bandwidth for 128 byte packets (user level to ....
....SBus bandwidth, and hence are not a critical performance factors. 3 The Fast Messages Approach 3.1 Illinois Fast Messages (FM) 1. 0 Illinois Fast Messages (FM) is a high performance messaging layer which is available on several parallel platforms (Cray T3D and workstation clusters) [22, 23]. The design goal of FM is to deliver network hardware performance to the application level with a simple interface. FM is appropriate for implementors of compilers, language runtimes, communications libraries, and in some cases application programmers. Function Operation FM send ....
Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5t3d -messaging.ps .
....communication: number of message start ups (the first term in Equation 1) and node contention. The communication features the state of the art HPC platforms such as CM5, SP2, T3D, and workstation clusters interconnected by an ATM network or Myrinet are shown in Table 1. The table is based on [5, 6, 7, 8, 9, 10] and our own measurements. It should be noted that the numbers vary depending on the version of the software environment used for message passing. P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 T1 T2 T3 Message for P0 Message for P1 Message for P2 Message for P3 Figure 1. A scenario depicting ....
V. Karamcheti, and A. A. Chien, "` Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D," Proc. of ISCA'95, June 1995.
....bus NI, such as CNI 32 Q m . 5.5 Related Work To the best of my knowledge, this work is the first to systematically identify, examine, and explore the data transfer and buffering parameters that underlie the design of high performance NIs for fine grain communication. Karamcheti and Chien [58] compared the messaging support in TMC CM 5 and Cray T3D and concluded that requiring processor involvement for message reception can significantly degrade performance. I improve upon their work by exposing and examining the design space of data transfer and buffering parameters. Blumrich, et al. ....
Vijay Karamcheti and Andrew A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 298--307, 1995. 180
....name server. 1 To our best knowledge, no current generation implementation supports a full 64 bit address space. The communication module and the program load module may be built on top of any well defined low level messaging layer, such as TCP IP, Active Messages [130, 106] and Fast Messages [76]. They make actors an illusion of the completely connected network. Moreover, the hierarchical organization offers the runtime system some degree of network independence, and thus, portability. The node manager delivers messages from remote nodes, creates actors in response to remote requests, and ....
V. Karamcheti and A.A. Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proceedings of International Symposium of Computer Architecture, 1995.
....prefetch queue. The processor then explicitly extracts the prefetched word from the queue when it is needed. The prefetch queue can only store 16 words of data. When the prefetch queue is full, no more data can be prefetched until the prefetched words are extracted from the queue. Previous studies [1, 16] indicated that the overhead of interacting with the DTB Annex and the prefetch queue is significant. Software support for shared address space and data prefetching is provided. The programmer can use a compiler directive in the Cray MPP Fortran (CRAFT) language [8] to declare shared data and to ....
V. Karamcheti and A. Chien. A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22th International Symposium on Computer Architecture, pages 298--307, June 1995.
....dependence analyzers [12, 69] can handle non affine expressions, the other complexities involved in these access patterns prevent them from privatizing X to parallelize this loop. 2.1. 2 Communication Analysis Communication optimization for distributed memorymachines, such as the Cray T3D [5, 23, 39] and the SGI Origin [65] showed a need for gathering even more precise array access information for supporting efficient data movementand copying between distributed memories. For instance, for data movement in these systems, the communication analyzer often needs to selectively decide between ....
V. Karamcheti and A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. International Symposium on Computer Architecture, June 1995.
....of the communication bandwidth offered by the expensive fast interconnect to remain unutilized. With low maximum message rate per node, the probability of consumption delay can be small leading to limited benefits from higher number of consumption channels. Several research studies are ongoing [27, 19, 26, 6] for designing messaging protocols and hardware to reduce such overheads. In the near future we expect efficient protocols to offer t s in the range of few tens to hundreds of cycles. Similarly, in hardware based implementations of Distributed Shared Memory (DSM) systems with small messages for ....
V. Karamcheti and A. A. Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proc. of the Int. Symp. on Computer Architecture, pages 298--307, 1995.
....release CRI EPCC 1.3, developed by the Edinburgh Parallel Computing Centre in collaboration with Cray Research Inc. We have also conducted some tests by exploiting a kernel MPI library built on the top of the Fast Messaging (FM) layer developed at the University of Illinois at Urbana Champaign [17]. The experiments are concerned with a SUPPLE implementation of the benchmark illustrated in Section 2 on a set of synthetic data sets presented in Section 4.1. We have also compared the SUPPLE implementation with an HPF style one. For the HPF implementation we have used the Cray CRAFT Fortran, ....
....their optimality in terms of percentage with respect to the optimum time, then for = 0:038 this percentage ranges from 6 to 9:4 , while for = 0:3 it ranges between 0:88 and 1:69 . We have carried out some other tests by adopting an MPI kernel build on the top of the FM messaging layer [17]. More specifically, we used the Pull FM layer, which improves performance in the presence of a network traffic which may produce output contention. Note that, since the workload in our tests is concentrated, our support may cause output contention because many messages are sent to the same ....
V. Karamcheti and A. A. Chien, "A comparison of architectural support for messaging on the TMC CM-5 and Cray T3D," in Proceedings of the 22nd Annual International Symposium on Computer Architecture, New York, June 22--24 1995, pp. 298--307, ACM Press.
.... Mhz MDP 43 cyc[10] Streaming Injection 1024 max round trip null RPC CS 2 40 Mhz 20 S[39] Channel SPARC 24.6 S[23] DMA w active message Hardware Table Lookup 174 S[21] PARMACS macros ping pong 206 S[20] mpsc library mesg exhange T3D 150 Mhz 21064 600nS[26] Shared Memory 2048 max remote read 2:76 S [40] Fast Messages F I Specific 16 byte Fetch and Increment Hardware Support 120 S[26] Interrupt Driven User Level Message Message Handler T 88100MP dispatch 20 cyc microthreading remote load [18] NOW HP9000 735 50 S [24] LAN based 125Mhz PA RISC 7150 sockets on Active Message cluster of ....
Vijay Karamcheti, Andrew A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", in ISCA 1995, pp. 298--308.
....Second, with limited buffering and large bursts of messages a common occurrence in loosely synchronized parallel applications a processor must constantly monitor ULNI status changes and remove messages from the limited ULNI buffers to avoid clogging up the network. Karamcheti and Chien [28] have shown that processor performance can degrade significantly if it is required to constantly monitor ULNI status in this fashion. Third, since ULNIs provide direct access to the network without operating system intervention, it must be virtualized to allow multiple processes to access the ....
Vijay Karamcheti and Andrew A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 298--307, 1995.
....algorithms where our approach is based on queueing models and attempts to model the constant factors that are of concern to application programmers and machine designers. A number of researchers have examined application performance in an empirical setting. For example, Karamcheti et al. [16] studied the network interface architectures in the CrayT3D and TMC CM 5 and examined several messaging implementations for reducing output contention effects. Holt et al. [13] studied the performance of cache coherent distributed shared memory machines using four parameters similar to LogP. They ....
V. Karamcheti and A. A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", Proc. of the ISCA'95, Santa Margherita Ligure Italy, 1995.
....issues. Specific techniques include: ffl Aggressive flow sensitive interprocedural analysis [17, 18] ffl Directed cloning and optimization (procedure and object inlining) 6, 20] ffl Compiler managed locality and memory latency management [27] ffl Efficient, robust communication primitives [12, 14] ffl Hybrid stack heap execution (efficient multithreading) 14, 19] ffl View Caching [13] 1.2 Application Suite In this paper, we use a suite of seven irregular applications to evaluate parallel programming support in ICC . Table 1 briefly describes the applications. Although spanning ....
Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.
....other hand, if a mes saging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s common case performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel com puting [5, 12, 13, 14] have led to the conclusion that a low level messaging layer should provide the following key guarantees: Reliable delivery, In order delivery, and Control over scheduling of communication work (decoupling) As mentioned in the previous section, studies of communication software costs [12] ....
V. Karamcheti and A. A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, pages 298-307, 1995. Available from http://w-csag. cs. uiuc. edu/ papers/cmS-t 3d-messaging. ps.
....codes for execution. However, several recent studies on NCC machines imply that using a shared memory model with one sided communication primitives may be a better way to program an NCC machine than using a message passing model with two sided communication. For instance, some experimental studies [3, 13, 25] with micro benchmarking indicated that one sided implementations are likely more efficient on NCC machines than two sided implementations because they utilize the architectural features of the machines more efficaciously . In addition, other studies [8, 15] concluded that (1) Put Get is ....
V. Karamcheti and A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. International Symposium on Computer Architecture, June 1995.
....API, but also to applications written to a wide range of higher level communication APIs such as MPI, SHMEM Put Get and Global Arrays [5, 7] To implement these higher level APIs efficiently, FM must provide the right set of delivery guarantees: too weak or too strong will reduce performance. See [6, 8] for more detailed discussion of these issues. FM provides the following guarantees to enable simple, high performance implementations of a wide range of userlevel APIs: in order delivery, reliable delivery, and . decoupling of the host processor and the network. Together, these ....
Karamcheti, V. and A.A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. in International Symposium on Computer Architecture. 1995.
....may block which allows forwarding on the stack Table 1: Various thread interaction schemas in the hybrid stack heap execution model. 4. 5 Fast Communication and Thread Scheduling To support fine grained, distributed programs efficiently, the Concert implementation is built atop Fast Messages (FM) [20], which utilizes novel implementation techniques such as receiver initiated data transfer to support high performance messaging in the face of irregular communication that is unsynchronized with ongoing computation (a consequence of our dynamic programming model) These low overhead, robust ....
Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.
....The runtime system exposes specialized versions [22] of important runtime primitives, such as remote method invocation and synchronization via futures [15] to the compiler to exploit compile time information. In addition, communication is realized via low overhead messaging layers: Fast Messages [23, 24] on the CRAY T3D and Active Messages [48] on the TMC CM 5. In addition to the standard features of the programming model, IC CEDAR also utilizes general placement directives of collection of objects (similar to map arrays in HPF) for spatial based object distribution and grouping. 3. IC CEDAR ....
....accesses are to objects already accessed in the same time step. Although multithreading in the execution model is effective to hide the communication and remote invocation latency, the processor overhead of communication and synchronization dominates. Even with a low overhead communication layer [23], an access to a remote object involves sending and receiving two messages (request and reply) and the execution of a remote accessor handler, costing nearly 12 microseconds of total processor overhead on the T3D. In addition, a remote invocation causes processor synchronization overhead when the ....
Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.
....processor, synonyms can occur. If a run time table of Annex entries is kept, performing the lookup in software would likely cause greater delay than simply updating the Annex register (23 cycles) So it is their conclusion that only one Annex register is required. This is corroborated by work in [8, 10]. The performance of cached reads, uncached reads, and remote writes was also presented. An uncached read costs roughly 610 ns, while a cached read required 765 ns to complete. A remote write needs roughly 880 ns. All of these measurements were performed on accesses to an adjacent node, for ....
....through operating system invocation, achieves the highest transfer rate at roughly 140 MB s for 512KB reads. From Culler s measurements, for reads of more than 16 KB, the BLT mechanism achieves better bandwidth than the other methods. Another inspection of the Cray T3D hardware shell is done in [8]. In this work, Chien compares the T3D s communication hardware to that of the Thinking Machines CM 5. Because the CM 5 is focused on supporting the data parallel programming model, it requires processor intervention to process incoming messages. Therefore, if the processors are working ....
V. Karamcheti and A. Chien. A comparison of architectural support for messaging in the tmc cm-5 and cray t3d. In International Conference on Computer Architecture, pages 298--307, 1995.
No context found.
V. Karamcheti and A. A. Chien. "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D," Proceedings of ISCA 95, Santa Margherita, Italy, pp. 298-307, 1995.
No context found.
Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA '95), pages 298--307, Santa Margherita Ligure, Italy, June 1995. Available from http://www-csag.ucsd.edu/papers/cm5-t3d-messaging.ps.
No context found.
Vijay Karamcheti, Andrew A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", Proc. ISCA 1995, pp. 298-307.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC