41 citations found. Retrieving documents...
V. Karamcheti and A. A. Chien. "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D," Proceedings of ISCA 95, Santa Margherita, Italy, pp. 298-307, 1995.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
An Information-Based Approach to Distributed Systems Design - Arpaci-Dusseau..   (Correct)

.... trusts those consumers that have adequately responded to data requests and does not trust those who have not responded. This algorithm also has the property of unifying the seemingly diverse implementations of push based and pull based approaches often found in messagepassing layers [23]. When producers are the bottleneck for the data transfer, all consumers respond quickly to data requests, and the choice of destinations degenerates to a random choice from the entire set of consumers a push based approach. However, when the consumers are the bottleneck, each producer has ....

V. Karamcheti and A. A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In 22nd Annual International Symposium on Computer Architecture, pages 298--307, Santa Margherita Ligure, Italy, June 22--24, 1995.


Unresponsiveness-Tolerant Collective Communication - Pakin   (Correct)

....are unresponsive but the network a relatively large amount buffering (in VIA, the size host memory) Adding (traditional) barriers therefore does little help network, but makes nodes and cluster a whole less tolerant of unresponsiveness. Based results both CM 5 and T3D, Karamcheti and Chien [57] also conclude fan in is important problem in parallel computing. Their solution employ pull messaging, in which receivers pull messages from each sender (using remote reads) opposed more traditional push messaging. This ensures that messages transmitted faster than the receiver process them. ....

Vijay Karamcheti and Andrew Chien. comparison architectural sup- port for messaging the TMC CM-5 Cray T3D. Proceedings 22nd Annual International Symposium Computer Architecture (ISCA '95), pages 298--307, Santa Margherita Ligure, Italy, June 1995. Available from http://www-csag.ucsd.edu/papers/cm5-t3d-messaging.ps.


The Design, Implementation, and Evaluation of Jade - Rinard, Lam (1998)   (Correct)

....4D 340 [Baskett et al. 1988] for messagepassing machines such as the Intel iPSC 860 [Berrendorf and Helin 1992] and for heterogeneous networks of workstations. While no implementation currently exists for shared memory machines with incoherent caches such as the Cray T3D [Arpaci et al. 1995; Karamcheti and Chien 1995], it would be possible to implement Jade on such machines. 5.1 Overview Strictly speaking, there are two Jade implementations: one for shared memory platforms and one for message passing platforms. While each implementation is tailored for its own specific computational environment, the ....

....use by subsequently created tasks. 5.3.2 Extensions for Incoherent Caches. In this section we have assumed that the hardware fully implements the abstraction of a single shared address space. Machines with incoherent caches, however, only partially implement this abstraction [Arpaci et al. 1995; Karamcheti and Chien 1995]. These machines automatically fetch and cache remote memory, but rely on software to keep the caches consistent. While no Jade implementation currently exists for machines with incoherent caches, we believe that Jade could be a useful programming language for such machines. The most difficult ....

Karamcheti, V. and Chien, A. 1995. A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd International Symposium on Computer Architecture. ACM, New York.


Binary-Swap Volumetric Rendering on the T3D - Hansen, Krogh, Painter, de..   (Correct)

....so as not to involve the YMP. Also, the black art of the MPP environment variables is less than satisfying. In some cases, one must tune these variables just to get the program to run which we find unacceptable. We would like to see CRI support a fast native message passing environment such as FM[7]. On the positive side, the data network is much faster than on other MPPs; most notably the CM 5. Also, floating point performance was easier to obtain with the DEC Alpha processors than with the CM 5 vector units when programming in MIMD model. In general, vector unit memory management problems ....

Vijay Karamcheti and Andrew Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proceedings of ISCA, 1995.


High Performance Messaging on Workstations: - Illinois Fast Messages   (Correct)

....unique goal [12, 25, 30, 31] but FM is distinguished by its hardware context (Myrinet) and high performance. The Fast Messages project focuses on optimizing the software messaging layer that resides between lower level communication services and the hardware. It is available on both the Cray T3D [22, 23] and Myricom s Myrinet [6] Using the Myrinet, FM provides MPP like communication performance on workstation clusters. FM on the Myrinet achieves low latency, high bandwidth messaging for short messages delivering 32s latency and 16 MBytes s bandwidth for 128 byte packets (user level to ....

....SBus bandwidth, and hence are not a critical performance factors. 3 The Fast Messages Approach 3.1 Illinois Fast Messages (FM) 1. 0 Illinois Fast Messages (FM) is a high performance messaging layer which is available on several parallel platforms (Cray T3D and workstation clusters) [22, 23]. The design goal of FM is to deliver network hardware performance to the application level with a simple interface. FM is appropriate for implementors of compilers, language runtimes, communications libraries, and in some cases application programmers. Function Operation FM send ....

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5t3d -messaging.ps .


Portable and Scalable Algorithms for Irregular All-to-All.. - Liu, Wang, Prasanna (1996)   (4 citations)  (Correct)

....communication: number of message start ups (the first term in Equation 1) and node contention. The communication features the state of the art HPC platforms such as CM5, SP2, T3D, and workstation clusters interconnected by an ATM network or Myrinet are shown in Table 1. The table is based on [5, 6, 7, 8, 9, 10] and our own measurements. It should be noted that the numbers vary depending on the version of the software environment used for message passing. P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 T1 T2 T3 Message for P0 Message for P1 Message for P2 Message for P3 Figure 1. A scenario depicting ....

V. Karamcheti, and A. A. Chien, "` Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D," Proc. of ISCA'95, June 1995.


Design and Evaluation of Network Interfaces for System Area.. - Mukherjee (1998)   (Correct)

....bus NI, such as CNI 32 Q m . 5.5 Related Work To the best of my knowledge, this work is the first to systematically identify, examine, and explore the data transfer and buffering parameters that underlie the design of high performance NIs for fine grain communication. Karamcheti and Chien [58] compared the messaging support in TMC CM 5 and Cray T3D and concluded that requiring processor involvement for message reception can significantly degrade performance. I improve upon their work by exposing and examining the design space of data transfer and buffering parameters. Blumrich, et al. ....

Vijay Karamcheti and Andrew A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 298--307, 1995. 180


Thal: An Actor System For Efficient And Scalable Concurrent.. - Kim (1997)   (8 citations)  (Correct)

....name server. 1 To our best knowledge, no current generation implementation supports a full 64 bit address space. The communication module and the program load module may be built on top of any well defined low level messaging layer, such as TCP IP, Active Messages [130, 106] and Fast Messages [76]. They make actors an illusion of the completely connected network. Moreover, the hierarchical organization offers the runtime system some degree of network independence, and thus, portability. The node manager delivers messages from remote nodes, creates actors in response to remote requests, and ....

V. Karamcheti and A.A. Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proceedings of International Symposium of Computer Architecture, 1995.


Maintaining Cache Coherence through Compiler-Directed Data.. - Lim, Yew (1998)   (Correct)

....prefetch queue. The processor then explicitly extracts the prefetched word from the queue when it is needed. The prefetch queue can only store 16 words of data. When the prefetch queue is full, no more data can be prefetched until the prefetched words are extracted from the queue. Previous studies [1, 16] indicated that the overhead of interacting with the DTB Annex and the prefetch queue is significant. Software support for shared address space and data prefetching is provided. The programmer can use a compiler directive in the Cray MPP Fortran (CRAFT) language [8] to declare shared data and to ....

V. Karamcheti and A. Chien. A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22th International Symposium on Computer Architecture, pages 298--307, June 1995.


Compiling For Distributed Memory Multiprocessors Based On Access.. - Paek (1997)   (Correct)

....dependence analyzers [12, 69] can handle non affine expressions, the other complexities involved in these access patterns prevent them from privatizing X to parallelize this loop. 2.1. 2 Communication Analysis Communication optimization for distributed memorymachines, such as the Cray T3D [5, 23, 39] and the SGI Origin [65] showed a need for gathering even more precise array access information for supporting efficient data movementand copying between distributed memories. For instance, for data movement in these systems, the communication analyzer often needs to selectively decide between ....

V. Karamcheti and A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. International Symposium on Computer Architecture, June 1995.


Alleviating Consumption Channel Bottleneck in Wormhole-Routed.. - Basak, Panda (1997)   (3 citations)  (Correct)

....of the communication bandwidth offered by the expensive fast interconnect to remain unutilized. With low maximum message rate per node, the probability of consumption delay can be small leading to limited benefits from higher number of consumption channels. Several research studies are ongoing [27, 19, 26, 6] for designing messaging protocols and hardware to reduce such overheads. In the near future we expect efficient protocols to offer t s in the range of few tens to hundreds of cycles. Similarly, in hardware based implementations of Distributed Shared Memory (DSM) systems with small messages for ....

V. Karamcheti and A. A. Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proc. of the Int. Symp. on Computer Architecture, pages 298--307, 1995.


SUPPLE: an Efficient Run-Time Support for Non-Uniform Parallel .. - Orlando, Perego (1996)   (2 citations)  (Correct)

....release CRI EPCC 1.3, developed by the Edinburgh Parallel Computing Centre in collaboration with Cray Research Inc. We have also conducted some tests by exploiting a kernel MPI library built on the top of the Fast Messaging (FM) layer developed at the University of Illinois at Urbana Champaign [17]. The experiments are concerned with a SUPPLE implementation of the benchmark illustrated in Section 2 on a set of synthetic data sets presented in Section 4.1. We have also compared the SUPPLE implementation with an HPF style one. For the HPF implementation we have used the Cray CRAFT Fortran, ....

....their optimality in terms of percentage with respect to the optimum time, then for = 0:038 this percentage ranges from 6 to 9:4 , while for = 0:3 it ranges between 0:88 and 1:69 . We have carried out some other tests by adopting an MPI kernel build on the top of the FM messaging layer [17]. More specifically, we used the Pull FM layer, which improves performance in the presence of a network traffic which may produce output contention. Note that, since the workload in our tests is concentrated, our support may cause output contention because many messages are sent to the same ....

V. Karamcheti and A. A. Chien, "A comparison of architectural support for messaging on the TMC CM-5 and Cray T3D," in Proceedings of the 22nd Annual International Symposium on Computer Architecture, New York, June 22--24 1995, pp. 298--307, ACM Press.


Mechanisms for Efficient, Protected Messaging - Lee   (Correct)

.... Mhz MDP 43 cyc[10] Streaming Injection 1024 max round trip null RPC CS 2 40 Mhz 20 S[39] Channel SPARC 24.6 S[23] DMA w active message Hardware Table Lookup 174 S[21] PARMACS macros ping pong 206 S[20] mpsc library mesg exhange T3D 150 Mhz 21064 600nS[26] Shared Memory 2048 max remote read 2:76 S [40] Fast Messages F I Specific 16 byte Fetch and Increment Hardware Support 120 S[26] Interrupt Driven User Level Message Message Handler T 88100MP dispatch 20 cyc microthreading remote load [18] NOW HP9000 735 50 S [24] LAN based 125Mhz PA RISC 7150 sockets on Active Message cluster of ....

Vijay Karamcheti, Andrew A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", in ISCA 1995, pp. 298--308.


A Survey of User-Level Network Interfaces for System Area Networks - Mukherjee (1997)   (7 citations)  (Correct)

....Second, with limited buffering and large bursts of messages a common occurrence in loosely synchronized parallel applications a processor must constantly monitor ULNI status changes and remove messages from the limited ULNI buffers to avoid clogging up the network. Karamcheti and Chien [28] have shown that processor performance can degrade significantly if it is required to constantly monitor ULNI status in this fashion. Third, since ULNIs provide direct access to the network without operating system intervention, it must be virtualized to allow multiple processes to access the ....

Vijay Karamcheti and Andrew A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 298--307, 1995.


LoGPC: Modeling Network Contention in Message-Passing Programs - Moritz, Frank (1998)   (13 citations)  (Correct)

....algorithms where our approach is based on queueing models and attempts to model the constant factors that are of concern to application programmers and machine designers. A number of researchers have examined application performance in an empirical setting. For example, Karamcheti et al. [16] studied the network interface architectures in the CrayT3D and TMC CM 5 and examined several messaging implementations for reducing output contention effects. Holt et al. [13] studied the performance of cache coherent distributed shared memory machines using four parameters similar to LogP. They ....

V. Karamcheti and A. A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", Proc. of the ISCA'95, Santa Margherita Ligure Italy, 1995.


Evaluating High Level Parallel Programming Support for.. - Chien, al. (1997)   (3 citations)  (Correct)

....issues. Specific techniques include: ffl Aggressive flow sensitive interprocedural analysis [17, 18] ffl Directed cloning and optimization (procedure and object inlining) 6, 20] ffl Compiler managed locality and memory latency management [27] ffl Efficient, robust communication primitives [12, 14] ffl Hybrid stack heap execution (efficient multithreading) 14, 19] ffl View Caching [13] 1.2 Application Suite In this paper, we use a suite of seven irregular applications to evaluate parallel programming support in ICC . Table 1 briefly describes the applications. Although spanning ....

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.


Efficient Support of Location Transparency in Concurrent.. - Wooyoung Kim (1995)   (6 citations)  (Correct)

....is M E M O R Y M A N A G E M E N T M O D U L E Name Server ACTOR INTERFACE Node Manager Dispatcher NETWORK INTERFACE COMMUNICATION MODULE PROGRAM LOAD MODULE Figure 2: Internal structure of the runtime kernel. straight forward as long as a well defined messaging layer is supported (for example, [34, 19, 20]) Messages in Hal have some unique properties. In particular, all actor messages have a destination mail address and a method selector. Many of them may also contain a continuation address. These properties are exploited in the implementation of communication module by customizing the CMAM layer ....

V. Karamcheti and A.A. Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proceedings of International Symposium of Computer Architecture, 1995.


The Impact of Data Transfer and Buffering Alternatives on.. - Mukherjee (1998)   (6 citations)  (Correct)

....that transferring messages in cache block units and buffering messages in coherent memory space can improve performance. However, they neither examined alternative block transfer or buffering mechanisms nor evaluated the key parameters that affect the performance of such NIs. Karamcheti and Chien [21] compared the messaging support in TMC CM 5 and Cray T3D and concluded that requiring processor involvement for message reception can significantly degrade performance. We improve upon their work by exposing and examining the design space of data transfer and buffering parameters. Blumrich, et al. ....

Vijay Karamcheti and Andrew A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 298--307, 1995.


Managing Concurrent Access for Shared Memory Active Messages - Lumetta, Culler (1998)   (21 citations)  (Correct)

....of time, a behavior termed blocking. Non blocking algorithms [9, 10, 17, 19] hence guarantee that some process makes progress in a finite amount of time, which implies that they do not enforce mutual exclusion. The remaining algorithms do not use locks but can still result in blocking behavior [2, 12, 22]. We follow Valois [22] and adopt the term lock free for this third category. Non blocking algorithms are advantageous on multiprogrammed systems, since locks interact poorly with timesharing. These algorithms follow a common design strategy and are simpler than their optimized locking ....

....deliver superior results for one to one communication, but can be detrimental for more complex communication patterns. Polling each additional queue incurs a significant fraction of total message overhead in user level communication layers [15] Both Brewer et al. 2] and Karamcheti and Chien [12] address concurrent message queues on the Cray T3D, a NUMA machine, with algorithms very similar to ours. Their algorithms use remote FETCH INCREMENT support to claim queue entries from a static queue, but rely on the receiver to reset the queue after all entries have been claimed and processed. ....

[Article contains additional citation context not shown here]

V. Karamcheti and A. A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Int. Symp. on Comp. Arch., pp. 298--307, June 1995.


Synchronization and Communication in the T3E Multiprocessor - Scott (1996)   (52 citations)  (Correct)

....communication, Shmem [9] Shmem is a shared memory based message passing library that supports direct memory to memory transfers without involving the operating system. Researchers at Illinois have also found the shared memory instrumental in achieving good messaging performance [22]. The interconnection network has also proven to be a strength. The 3D torus is wiring efficient [1] and scales well to large numbers of processors, providing sub microsecond access latencies and a bisection bandwidth of over 70 GB s with 1024 processors. The T3D is the only machine with a ....

....processor. Since these are special hardware resources, they must be protected by the operating system. The message queue also requires OS involvement on the receiving side, as user and OS messages share the same queue, significantly increasing message latency. The Illinois messaging implementation [22] did not use the dedicated messaging hardware. The DTB Annex allows a single DTB entry to map a physical page on all processors in a parallel program 3 , but every processor must use the same mapping. So while DTB coverage is significantly amplified, memory management is inflexible; moving a ....

[Article contains additional citation context not shown here]

Karamcheti, V. and A. A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", Proc. 22nd International Symposium on Computer Architecture, pp 298-307, 1995.


High Performance Messaging on Workstations: Illinois Fast.. - Pakin, Lauria, Chien (1995)   (275 citations)  (Correct)

....unique goal [12, 25, 30, 31] but FM is distinguished by its hardware context (Myrinet) and high performance. The Fast Messages project focuses on optimizing the software messaging layer that resides between lower level communication services and the hardware. It is available on both the Cray T3D [22, 23] and Myricom s Myrinet [6] Using the Myrinet, FM provides MPP like communication performance on workstation clusters. FM on the Myrinet achieves low latency, high bandwidth messaging for short messages delivering 32 s latency and 16 MBytes s bandwidth for 128 byte packets (user level to ....

....SBus bandwidth, and hence are not a critical performance factors. 3 The Fast Messages Approach 3.1 Illinois Fast Messages (FM) 1. 0 Illinois Fast Messages (FM) is a high performance messaging layer which is available on several parallel platforms (Cray T3D and workstation clusters) [22, 23]. The design goal of FM is to deliver network hardware performance to the application level with a simple interface. FM is appropriate for implementors of compilers, language runtimes, communications libraries, and in some cases application programmers. Function Operation FM send ....

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps .


A Compiler-Directed Cache Coherence Scheme Using Data Prefetching - Lim, Yew (1997)   (Correct)

....to be prefetched. Each prefetch instruction transfers one 64 bit word of data from the memory of a remote PE to the local PE s prefetch queue. The processor then extracts the prefetched word from the queue when it is needed. The prefetch queue can only store 16 words of data. Previous studies [1, 9] indicated that the overhead of interacting with the DTB Annex and the prefetch queue is significant. Software support for shared address space and data prefetching is provided. The programmer can use a compiler directive in the Cray MPP Fortran (CRAFT) language [6] to declare shared data and to ....

V. Karamcheti and A. Chien. A comparison of architectural support for messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22th International Symposium on Computer Architecture, pages 298--307, June 1995.


Global Address Space, Non-Uniform Bandwidth: A Memory System.. - Stricker, Gross (1997)   (Correct)

....we discuss the important parameters relevant to the memory and communication system interface in Section 3. In addition to the technical reference material of the vendors (DEC 8400 [8, 6, 9, 7] Cray T3D [8, 1, 3] Cray T3E [12, 4] other research groups evaluated some aspects of these machines [2, 10, 15]. An empirical study comparing the two Alpha processors based on standard benchmarks provides useful insights using performance metrics not related to the memory system [5] Our goal is to measure and compare the memory systems of these modern parallel systems. Since it is nowadays usual to use a ....

V. Karamcheti and A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and Cray T3D. In Proc. 22nd Intl. Symposium on Computer Architecture, pages 298--307, Santa Marguerita di Ligure, June 1995. ACM.


Benefits of Processor Clustering in Designing Large Parallel.. - Basak, Panda (1996)   (Correct)

....in the range of few to tens of microseconds, can severely limit the rate at which messages can be sent or received from a processor [17] This can lead to much of the communication bandwidth offered by the expensive fast interconnect to remain unutilized. Several research studies are ongoing [18, 12] for designing messaging protocols and hardware to reduce such overheads. However, even with efficient protocols these overheads cannot be eliminated. Even though the overheads get lowered, these will continue to be reasonably high compared to network speeds. For example, on NCUBE 2 messaging ....

....are attached to each node (cluster) of the torus and share a common hardware interface to the network. Each processor has a private memory. For communication across processors we used the remote memory access routines: shmem put( and shmem get( offering the maximum communication bandwidth [2, 18]. 9.2 Traffic patterns applications We selected three traffic patterns applications [19] Bit Permute Complement exchanges (BPC) Fast Fourier Transform (FFT) and LU matrix decomposition. These are briefly discussed below: 1. BPC (Bit permute complement) A communication round in BPC involves a ....

V. Karamcheti and A. A. Chien. A Comparison of Architectural Support for Messaging on the TMC CM-5 and the Cray T3D. In Proc. of the Int. Symp. on Computer Architecture, 1995.


Efficient Layering for High Speed Communication: Fast Message .. - Lauria, Pakin, Chien (1998)   (34 citations)  Self-citation (Chien)   (Correct)

....other hand, if a mes saging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s common case performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel com puting [5, 12, 13, 14] have led to the conclusion that a low level messaging layer should provide the following key guarantees: Reliable delivery, In order delivery, and Control over scheduling of communication work (decoupling) As mentioned in the previous section, studies of communication software costs [12] ....

V. Karamcheti and A. A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, pages 298-307, 1995. Available from http://w-csag. cs. uiuc. edu/ papers/cmS-t 3d-messaging. ps.


An Advanced Compiler Framework for Noncache-coherent Multiprocessors - Paek   Self-citation (Architecture)   (Correct)

....codes for execution. However, several recent studies on NCC machines imply that using a shared memory model with one sided communication primitives may be a better way to program an NCC machine than using a message passing model with two sided communication. For instance, some experimental studies [3, 13, 25] with micro benchmarking indicated that one sided implementations are likely more efficient on NCC machines than two sided implementations because they utilize the architectural features of the machines more efficaciously . In addition, other studies [8, 15] concluded that (1) Put Get is ....

V. Karamcheti and A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. International Symposium on Computer Architecture, June 1995.


FM-QoS: A Quality of Service Messaging Substrate for.. - Connelly (1999)   Self-citation (Chien Messaging)   (Correct)

....API, but also to applications written to a wide range of higher level communication APIs such as MPI, SHMEM Put Get and Global Arrays [5, 7] To implement these higher level APIs efficiently, FM must provide the right set of delivery guarantees: too weak or too strong will reduce performance. See [6, 8] for more detailed discussion of these issues. FM provides the following guarantees to enable simple, high performance implementations of a wide range of userlevel APIs: in order delivery, reliable delivery, and . decoupling of the host processor and the network. Together, these ....

Karamcheti, V. and A.A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. in International Symposium on Computer Architecture. 1995.


High Level Parallel Programming: The Illinois Concert.. - Chien, Dolby, Ganguly, .. (1998)   (3 citations)  Self-citation (Karamcheti)   (Correct)

....may block which allows forwarding on the stack Table 1: Various thread interaction schemas in the hybrid stack heap execution model. 4. 5 Fast Communication and Thread Scheduling To support fine grained, distributed programs efficiently, the Concert implementation is built atop Fast Messages (FM) [20], which utilizes novel implementation techniques such as receiver initiated data transfer to support high performance messaging in the face of irregular communication that is unsynchronized with ongoing computation (a consequence of our dynamic programming model) These low overhead, robust ....

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.


Optimizing COOP Languages: Study of a Protein Dynamics.. - Zhang, Karamcheti, Ng.. (1996)   Self-citation (Karamcheti Uiuc)   (Correct)

....The runtime system exposes specialized versions [22] of important runtime primitives, such as remote method invocation and synchronization via futures [15] to the compiler to exploit compile time information. In addition, communication is realized via low overhead messaging layers: Fast Messages [23, 24] on the CRAY T3D and Active Messages [48] on the TMC CM 5. In addition to the standard features of the programming model, IC CEDAR also utilizes general placement directives of collection of objects (similar to map arrays in HPF) for spatial based object distribution and grouping. 3. IC CEDAR ....

....accesses are to objects already accessed in the same time step. Although multithreading in the execution model is effective to hide the communication and remote invocation latency, the processor overhead of communication and synchronization dominates. Even with a low overhead communication layer [23], an access to a remote object involves sending and receiving two messages (request and reply) and the execution of a remote accessor handler, costing nearly 12 microseconds of total processor overhead on the T3D. In addition, a remote invocation causes processor synchronization overhead when the ....

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.


Communication Characterization of a Cray T3D - Vlaovic (1997)   Self-citation (Cray)   (Correct)

....processor, synonyms can occur. If a run time table of Annex entries is kept, performing the lookup in software would likely cause greater delay than simply updating the Annex register (23 cycles) So it is their conclusion that only one Annex register is required. This is corroborated by work in [8, 10]. The performance of cached reads, uncached reads, and remote writes was also presented. An uncached read costs roughly 610 ns, while a cached read required 765 ns to complete. A remote write needs roughly 880 ns. All of these measurements were performed on accesses to an adjacent node, for ....

....through operating system invocation, achieves the highest transfer rate at roughly 140 MB s for 512KB reads. From Culler s measurements, for reads of more than 16 KB, the BLT mechanism achieves better bandwidth than the other methods. Another inspection of the Cray T3D hardware shell is done in [8]. In this work, Chien compares the T3D s communication hardware to that of the Thinking Machines CM 5. Because the CM 5 is focused on supporting the data parallel programming model, it requires processor intervention to process incoming messages. Therefore, if the processors are working ....

V. Karamcheti and A. Chien. A comparison of architectural support for messaging in the tmc cm-5 and cray t3d. In International Conference on Computer Architecture, pages 298--307, 1995.


Coherent Network Interfaces for Fine-Grain Communication - Mukherjee, Falsafi, al. (1996)   (25 citations)  Self-citation (Cm)   (Correct)

....up into the network. CNI 16 Q m further simplifies software flow control in the messaging layer by allowing messages to smoothly overflow to main memory when the device cache fills. This avoids processor intervention for message buffering, which, otherwise, could significantly degrade performance [25]. Block Transfer. The increase in bandwidth obtained by transferring messages in whole cache block units has a major impact on performance. Gauss and moldyn do bulk transfers and appbt communicates with moderately large (128 byte) shared memory blocks. Gauss performs a one to all broadcast of a ....

Vijay Karamcheti and Andrew A. Chien. A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 298--307, 1995.


Supporting High Level Programming with High.. - Chien, Dolby.. (1997)   (3 citations)  Self-citation (Chien)   (Correct)

....may block which allows forwarding on the stack Table 1. Various thread interaction schemas in the hybrid stack heap execution model. 3. 5 Fast Communication and Thread Scheduling To support fine grained, distributed programs efficiently, the Concert implementation is built atop Fast Messages (FM) [24], which utilizes novel implementation techniques such as receiver initiated data transfer to support highperformance messaging in the face of irregular communication that is unsynchronized with ongoing computation (a consequence of our dynamic programming model) These low overhead, robust ....

V. Karamcheti and A. A. Chien. A comparison of architectural support for messaging on the TMC CM5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/ papers/cm5-t3d-messaging.ps.


FM-QoS: Real-time Communication Using Selfsynchronizing.. - Connelly, Chien (1997)   (7 citations)  Self-citation (Chien)   (Correct)

....but also to applications written to a wide range of higher level communication APIs such as MPI, SHMEM Put Get and Global Arrays [21, 4] To implement these higher level API s efficiently, FM must provide the 6 right set of delivery guarantees. Too weak or too strong will reduce performance. See [17, 25] for more detailed discussion of these issues. FM provides the following guarantees to enable simple, high performance implementation of a wide range of user level APIs: in order delivery, reliable delivery, and . decoupling of the host processor and the network. Together, these guarantees ....

V. Karamcheti and A. A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In International Symposium on Computer Architecture, 1995.


View Caching: Efficient Software Shared Memory for Dynamic.. - Karamcheti, Chien (1997)   (2 citations)  Self-citation (Chien)   (Correct)

....of these components; application knowledge about data access semantics drives this customization, focusing particularly on decreasing message traffic and required synchronization. Reducing synchronization enables use of one sided messages that can be efficiently supported in software (see [13]) or directly in hardware using the put get support available in several current day parallel machines. To keep customization manageable, our framework only permits selection from among a predefined set of component implementations by specifying values for a series of parameters. As we shall see ....

....System [6] which consists of an optimizing compiler and a high performance runtime. Programmer annotations guide selection of view caching protocols. We consider four alternative DSM implementations: 1. Object consistent DSM (messaging) OC M) protocols use a messaging interface to the hardware [13]. 2. Object consistent DSM (put get) OC P) protocols utilize T3D s remote memory access capabilities. 3. View caching (messaging) VC M) 4. View caching (put get) VC P) All implementations use identical messaging and put get interfaces, so performance differences are entirely attributable 1 ....

V. Karamcheti and A. A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995.


Runtime Mechanisms for Efficient Dynamic Multithreading - Karamcheti, Plevyak, Chien (1996)   (7 citations)  Self-citation (Chien)   (Correct)

....that protocols involving sends from within message handlers must be deadlock free. This is another place where the availability of a compiler helps reduce the cost of primitive runtime mechanisms: a compiler can enforce the required discipline. The T3D implementation of the Fast Messages interface [30] makes use of hardware support for fetchand increment and remote memory access [34] to perform buffer management and data transfer without involving the destination processor. This decouples the sending processor from destination processor activity, improving communication performance. The ....

....buffering resources (e.g. by reclaiming consumed buffers) and any delay in this participation holds up the senders. The performance degradation due to unresponsive receivers and output contention even with modest fan in can be severe, increasing send overheads by up to an order of magnitude [30]. Our solution exploits hardware support on the T3D (also present in several current and likely future machines) to build a distributed message queue with lazy receiver initiated data transfer which decouples senders from receivers and eliminates output contention. Using the T3D s atomic swap ....

[Article contains additional citation context not shown here]

Karamcheti, V., and Chien, A. A. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture (1995).


Exploring Structured Adaptive Mesh Refinement (SAMR).. - Ganguly, Bryan.. (1997)   Self-citation (Uiuc)   (Correct)

....utilizes stack based sequential execution and creates threads lazily only when required. The runtime system and exposes a hierarchy of high performance COOP primitives [11] to allow compile time specialization. In addition, communication is realized via low overhead messaging layers: Fast Messages [12, 13] on the CRAY T3D and Active Messages [24] on the TMC CM 5. 3 Application Structure Our implementation of SAMR code was originally written in C (making use of Fortran libraries) and then ported to the Illinois Concert system for parallel execution. We first discuss the basic structure of the ....

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.


Optimizing COOP Languages: Study of a Protein Dynamics.. - Zhang, Karamcheti, Ng.. (1996)   Self-citation (Chien)   (Correct)

....accesses incurs significant processor overhead for communication and synchronization. A remote object access involves sending and receiving two messages and a handler execution, costing nearly 12 microseconds of total processor overhead on the T3D even with a low overhead communication layer [9]. In addition, a remote invocation causes processor synchronization overhead when the current computation blocks, requiring a thread context switch. Without effective node level reuse, the overhead of communication and synchronization limits the parallel efficiency of the force kernel to 40 on 16 ....

V. Karamcheti and A. A. Chien. A comparison of architectural support for messaging on the TMC CM5 and the Cray T3D. In ISCA'95, 1995.


Fast Messages (FM): Efficient, Portable Communication.. - Pakin, Karamcheti, Chien (1997)   (8 citations)  Self-citation (Karamcheti)   (Correct)

....the other hand, if a messaging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s common case performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel computing [12, 28, 29, 30] have led to the conclusion that a low level messaging layer should provide the following key guarantees: ffl Reliable delivery, ffl Ordered delivery, and ffl Control over scheduling of communication work (decoupling) Previous studies of communication cost in the CM 5 multicomputer system [28] ....

.... 1 data transfer SOURCE DESTINATION remote stores 2 extract 3 Figure 2: Push messaging While push messaging minimizes latency at low network loads, performance can degrade if there is output contention or if the receiver allows its incoming buffers to fill by not servicing the network often enough [29]. If messages arrive at a receiver faster than the receiver s memory can process them or the receiver extracts them, the writes back up into the network, adding to network contention and increasing average latency. This effect has been observed in many irregular parallel computations, and provides ....

[Article contains additional citation context not shown here]

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the International Symposium on Computer Architecture, 1995. Available from http://www-csag.cs.uiuc.edu/papers/cm5-t3d-messaging.ps.


A Simulation Access Language and Framework with Applications to.. - Cheng (2004)   (1 citation)  (Correct)

No context found.

V. Karamcheti and A. A. Chien. "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D," Proceedings of ISCA 95, Santa Margherita, Italy, pp. 298-307, 1995.


Unresponsiveness-Tolerant Collective Communication - Pakin (2001)   (Correct)

No context found.

Vijay Karamcheti and Andrew A. Chien. A comparison of architectural support for messaging on the TMC CM-5 and the Cray T3D. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA '95), pages 298--307, Santa Margherita Ligure, Italy, June 1995. Available from http://www-csag.ucsd.edu/papers/cm5-t3d-messaging.ps.


A Lightweight Idempotent Messaging Protocol for Faulty - Brown (2002)   (1 citation)  (Correct)

No context found.

Vijay Karamcheti, Andrew A. Chien, "A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D", Proc. ISCA 1995, pp. 298-307.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC