28 citations found. Retrieving documents...
T.T. Kwan, B.K. Totty and D.A. Reed, Communication and computation performance of the CM-5, in: Proc. Supercomputing '93, (1993), 192-201.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

k-ary n-trees: High Performance Networks for Massively.. - Petrini, Vanneschi (1997)   (1 citation)  (Correct)

....adopted by several parallel computers as the Connection Machine CM 5 [10] the Data Diffusion Machine [15] and the Meiko CS 2 [14] Unfortunately, not much is known on the communication performance of the fat trees. Most of the literature deals with the CM 5 and focuses on raw network performance [7] [12] 13] Typical communication patterns include simple sends and ping pong between pairs of nodes. Block permutations of data and grid shifts have been shown to have little or no contention on the CM 5. This makes the data network very efficient for regular communication patterns commonly used ....

T. T. Kwan, B. K. Tatty, and D. A. Reed. Communication and Computation performance of the CM-5. In Supercomputing '93, pages 192--201, November 1993.


Image Feature Extraction on Connection Machine CM-5 - Viktor Prasanna And (1994)   (1 citation)  (Correct)

....of size m between a pair of PNs takes T d m d time using the data network. 3. Suppose each PN has m units of data to be routed to a single destination using the data network and the set of all destinations is a permutation, then the data can be routed in T d m d time. It has been measured [5], the startup time T d is around 40 90 sec which depends on the use of communication primitives. These times are measured by sending a 0 byte message between two PNs using the data network. For regular data communication pattern, d is at the range of 0.100 to 0.123 sec byte [5] This model favors ....

....has been measured [5] the startup time T d is around 40 90 sec which depends on the use of communication primitives. These times are measured by sending a 0 byte message between two PNs using the data network. For regular data communication pattern, d is at the range of 0.100 to 0. 123 sec byte [5]. This model favors communicating long messages to communicating large number of short messages. A unit of data is defined as a fixed size data structure to contain image data (a contour pixel, a label etc. in our analysis. Using this model, we can quantify the communications time and predict the ....

T. Kwan, B. Totty, and D. Reed, "Communication and Computation Performance of the CM-5," Proc. of Supercomputing '93, pages 192-201, 1993.


Network Performance under Physical Constraints - Petrini, Vanneschi (1997)   (Correct)

....the property that the overall communication bandwidth remains constant at each level. Other references to fat trees include [5] 6] Unfortunately, not much is known on the communication performance of the fat trees. Most of the literature deals with the CM 5 and focuses on raw network performance [7] [8] 9] Thanks to their simplicity and expandability, lowdimensional cubes have been adopted as interconnection networks by many massively parallel machines. In the Stanford Dash there are two distinct cubes that support the cache coherence mechanisms [10] one is dedicated to the requests and ....

T. T. Kwan, B. K. Tatty, and D. A. Reed, "Communication and Computation performance of the CM-5," in Supercomputing'93, pp. 192--201, November 1993.


Efficient Personalized Communication on Wormhole Networks - Petrini, Vanneschi (1997)   (Correct)

....processors to route the message in the ascending and descending phases. Other references to fat trees include [30] 31] Unfortunately, not much is known on the communication performance of the fat trees. Most of the literature deals with the CM 5 and focuses on raw network performance [32] [33] 34] Typical communication patterns include simple sends and ping pong between pairs of nodes. Block permutations of data and grid shifts have been shown to have little or no contention on the CM 5. This makes the data network very efficient for regular communication patterns commonly used ....

T. T. Kwan, B. K. Tatty, and D. A. Reed, "Communication and Computation performance of the CM-5," in Supercomputing'93, pp. 192--201, November 1993.


Communication Performance of Wormhole Interconnection Networks - Petrini (1997)   (Correct)

....processors to route the message in the ascending and descending phases. Other references to fat trees include [70] 88] Unfortunately, not much is known on the communication performance of the fat trees. Most of the literature deals with the CM 5 and focuses on raw network performance [94] [103] 107] Typical communication patterns include simple sends and ping pong between pairs of nodes. Block permutations of data and grid shifts have been shown to have little or no contention on the CM 5. This makes the data network very efficient for regular communication patterns commonly ....

Thomas T. Kwan, Brian K. Tatty, and Daniel A. Reed. Communication and Computation performance of the CM-5. In Supercomputing'93, pages 192--201, November 1993.


The Cranium Network Interface Architecture: Support for Message.. - McKenzie (1997)   (Correct)

....that are presented here. Network interfaces that were previously studied included those in the Thinking Machines CM 5, the Intel Paragon, the MIT J machine, the Motorola Star T, the Intel Touchstone Delta and the UW Meerkat. 6.4.1 Study #1: CM 5 vs. Paragon This study by Kwan, Totty and Reed [94] focused on the measurements gathered from the CM 5 and the Paragon. Through the use of simple throughput and latency benchmarks, the authors demonstrated that the CM 5 achieves a throughput rating of 8 MB sec out of 20 MB sec maximum and the Paragon achieves 20 MB s out of 200 MB sec maximum. ....

Thomas T. Kwan, Brian K. Totty and Daniel A. Reed. Communication and computation performance of the CM-5. Proc. of Supercomputing 93, Portland OR, November 1993, pp. 192-201.


Polygon Rendering For Interactive Visualization On Multicomputers - Ellsworth (1996)   (4 citations)  (Correct)

....and calculation occurs when each triangle is redistributed as soon as it is transformed, sending each in its own message. This is extremely expensive in most systems, where the cost per byte of small messages is much larger than the cost for large messages (as shown in Appendix B, Bokh90] and [Kwan93]) Instead, several triangles should be buffered and then sent in a single message. The optimum size of the messages is discussed in section 4.4.7. Two previous multicomputer implementations that overlap the redistribution are Pixel Planes 5 [Fuch89] and Crockett and Orloff s iPSC 860 ....

Kwan, Thomas T., Brian K. Totty, and Daniel A. Reed, "Communication and Computation Performance of the CM-5," in Proceedings of Supercomputing '93, Portland, Oregon, November 15--17, 1993, IEEE Computer Society Press, Los Alamitos, CA, 1993, pp. 192-- 201.


K-ary N-trees: High Performance Networks for Massively.. - Petrini, Vanneschi (1995)   (1 citation)  (Correct)

....processors to route the message in the ascending and descending phases. Other references to fat trees include [HH89, Ken91] Unfortunately, not much is known on the communication performance of the fat trees. Most of the literature deals with the CM 5 and focuses on raw network performance [KTR93, LTD 92, MB94] Typical communication patterns include simple sends and ping pong between pairs of nodes. Block permutations of data and grid shifts have been shown to have little or no contention on the CM 5. This makes the data network very efficient for regular communication patterns ....

Thomas T. Kwan, Brian K. Tatty, and Daniel A. Reed. Communication and Computation performance of the CM-5. In Supercomputing'93, pages 192--201, November 1993.


Machine independent Analytical models for cost evaluation.. - Pasetto, Vanneschi (1996)   (Correct)

....data from a single channel and send data to a single channel at a time. The model can be extended to use a family of functions indexed by the number of simultaneous input and output communications required. We use this function to model specific node architectures: for example CM 5 nodes [KTR93] cannot do computation and communication in parallel, on such a machine the definition of P is simply the sum of all the parameters. Transputer based systems instead [Hey89, Cok91] have a direct memory access engine for each communication channel (4 input and 4 output channels) this leads to ....

....is the number of PEs in dimension j) and i is the dimension through which we move data; this parameter heavily depends on network flow control and routing algorithm. We don t explicit include contention in the network performance model because recent studies and experiments[BYA89, Dal90, AmKV93, KTR93, SGC93] show that, if the communication pattern does not exhibit hot spots, message latency raises very slowly until a specific load level (dependent on routing and topological solutions) is reached. If the load increases above the threshold the network quickly saturates. This means that, if we ....

[Article contains additional citation context not shown here]

Thomas T. Kwan, Brian K. Tatty, and Daniel A. Reed. Communication and Computation Performance of the CM--5. In Supercomputing '93, pages 192--201, November 1993.


Thal: An Actor System For Efficient And Scalable Concurrent.. - Kim (1997)   (8 citations)  (Correct)

....the TMC CM 5. 000 001 010 011 100 101 110 111 Figure 6.2: The communication topology of the implementation of the broadcast primitive. broadcast on the data network was more efficient when sending bulk data because the bandwidth of the data network is much higher than that of the control network [84]. Since Active Messages are not buffered [131] sending bulk data from one node to another requires a three phase protocol. The source sends size information and the destination acknowledges with a buffer address. Then the source sends data without any concern about overflow. However, because ....

T. T. Kwan, B. K. Totty, and D. A. Reed. Communication and Computation Performance of the CM-5. In Proceedings of Supercomputing '93, pages 192--201, 1993.


Scalable Parallel Implementations of Perceptual Grouping on.. - Prasanna, Wang (1994)   (Correct)

....single destination and the set of all destinations is a permutation, then the data can be routed in T d m d time using the data network. 4. Broadcasting a message containing m units of data from a PN to all PNs can be performed in T c m c time using the control network. It has been measured [4], the startup time T d is around 60 90 sec and T c is around 2 sec. These times are measured by sending a 0 byte message between two PNs using the data network or broadcasting a 0 byte message from a PN to all the PNs using the control network. The transmission rate c has been observed to be ....

....time T d is around 60 90 sec and T c is around 2 sec. These times are measured by sending a 0 byte message between two PNs using the data network or broadcasting a 0 byte message from a PN to all the PNs using the control network. The transmission rate c has been observed to be 1. 25 sec byte [4]. For regular data communication pattern, d is at the range of 0.100 to 0.123 sec byte [4] Using this model, we can quantify the communication times and predict the running times of our implementations. A unit of data is defined as a fixed size data structure to contain image data in our ....

[Article contains additional citation context not shown here]

T. Kwan, B. Totty, and D. Reed, "Communication and Computation Performance of the CM5, " Supercomputing '93, pp. 192-201, 1993.


Benchmarking the Computation and Communication Performance of.. - Kivanc Dincer (1996)   (1 citation)  (Correct)

....elimination code and give the corresponding real and estimated execution times in order to show the accuracy of the estimated performance figures. Related Work There are numerous articles in the literature about benchmarking different aspects of recent parallel architectures or supercomputers [3, 4, 11, 12, 13, 14, 16]. There are also several benchmark suits specially developed to provide a common ground to test the performance of different high performance computers [1, 2, 10, 15] Some of them investigate the use of real application programs, while others employ short kernel codes to evaluate the performance, ....

T.T. Kwan, B.K. Totty, and D.A. Reed. Communication and Computation Performance of the CM-5. In Proc. of Supercomputing 1993, pages 192--201 (1993).


Synchronized MIMD Computing - Kuszmaul (1994)   (3 citations)  (Correct)

....60 cycles for realistic packets with polling and requires hundreds of cycles with interrupts. Because of the prohibitive cost of interrupts, all of our experiments use polling. At 60 cycles per 16 byte packet, the payload bandwidth is limited to 8.8 megabytes per second. Kwan, Totty, and Reed [KTR93] measured the actual one way bandwidth at 8.3 megabytes per second using Thinking Machines message passing library. These numbers only cover the case in which a processor is sending or receiving, however. When a processor is both sending and receiving, the bidirectional bandwidth is somewhere ....

....bandwidth matching, the network remains uncongested even though less polling occurs. Optimum performance requires both bandwidth matching and limited polling. Note that Strata sustains more bandwidth for all pairs than Kwan et al. saw for individual messages, 10.66 versus 10.4 megabytes per second [KTR93]. The net improvement over CMMD without barriers is about 390 . Although limited polling can improve performance, it is not very robust. When other cuts of the network, such as the bisection, become bottlenecks, limited polling causes congestion. We expect that limited polling is appropriate ....

T. T. Kwan, B. K. Totty, and D. A. Reed. Communication and computation performance of the CM-5. In Proceedings of Supercomputing '93, pages 192--201, November 1993.


Language And Compiler Mechanisms For Parallel Programming With .. - Raghavachari (1998)   (Correct)

....communication in Ace uses the CM 5 s Active Message library (CMAML) 95] For reference, the overhead of sending an Active Message request is approximately 2.03 s, that of processing a reply is 1.90 s, and the achievable bandwidth for large transfers (using CMAML scopy) is 8. 4 MBytes s [54]. The CM 5 has a dedicated network for collective communication, and therefore, implements these operations efficiently. Broadcasts can achieve a bandwidth of 0.8 MBytes s, reductions have a latency of approximately 18.3 s, and barrier synchronization, a latency of 5 s [54] On the CM 5, all ....

....scopy) is 8.4 MBytes s [54] The CM 5 has a dedicated network for collective communication, and therefore, implements these operations efficiently. Broadcasts can achieve a bandwidth of 0. 8 MBytes s, reductions have a latency of approximately 18.3 s, and barrier synchronization, a latency of 5 s [54]. On the CM 5, all application code as well as the runtime system code was compiled with gcc O2 (version 2.6.2) The Cray T3E [86] is a distributed memory machine with some hardware support for a shared address space. In contrast to the CM 5, communication is essentially a one way operation; with ....

T. Kwan, B. Totty, and D. Reed. Communication and computation performance of the CM-5. In Supercomputing '93, Nov. 1993.


An Efficient Data Parallel Algorithm For 2-D Convolutions - Dykes   (Correct)

....the lowest level have a theoretical bandwidth of 20 MB s, messages traversing up to the second level have a theoretical minimum bandwidth of 10 MB s, and messages communicated through the third level and beyond have a theoretical minimum bandwidth of 5 MB s. CM 5 performance studies by Kwan, et.al. [9] and Ponnusamy, et.al. 13] found that in the absence of contention the message transmission latencies and bandwidths are independent of partition size and network levels crossed. Under very light loads, a simple send and receive had a bandwidth of approximately 8.3 MB s in both studies. Under full ....

....The CM 5 Control Network is responsible for communication patterns that may involve all the processors in a single operation, including synchronization, broadcast, reduction, and error signaling. Its design is optimized for low latency rather than high bandwidth. Measurements from Kwan, et.al. [9] show Control Network broadcast, reduction, and synchronization operations to be independent of partition size. Fastest of these operations is synchronization, requiring only 5 s. Rapid synchronization is critical to data parallel programs because all nodes are synchronized at the beginning of ....

[Article contains additional citation context not shown here]

T. T. Kwan, B. K. Totty and D. A. Reed, Communication and computation performance of the CM-5, Proc. Supercomputing '93, (1993) 192-201.


Image Feature Extraction on Connection Machine CM-5 - Viktor Prasanna (1994)   (1 citation)  (Correct)

....of size m between a pair of PNs takes T d m d time using the data network. 3. Suppose each PN has m units of data to be routed to a single destination using the data network and the set of all destinations is a permutation, then the data can be routed in T d m d time. It has been measured [5], the startup time T d is around 40 90 sec which depends on the use of communication primitives. These times are measured by sending a 0 byte message between two PNs using the data network. For regular data communication pattern, d is at the range of 0.100 to 0.123 sec byte [5] This model ....

....been measured [5] the startup time T d is around 40 90 sec which depends on the use of communication primitives. These times are measured by sending a 0 byte message between two PNs using the data network. For regular data communication pattern, d is at the range of 0.100 to 0. 123 sec byte [5]. This model favors communicating long messages to communicating large number of short messages. A unit of data is defined as a fixed size data structure to contain image data (a contour pixel, a label etc. in our analysis. Using this model, we can quantify the communications time and predict the ....

T. Kwan, B. Totty, and D. Reed, "Communication and Computation Performance of the CM-5," Proc. of Supercomputing '93, pages 192-201, 1993.


Parallelization of Perceptual Grouping on Distributed Memory.. - Cho-Li Wang (1995)   (Correct)

....further data communication. The algorithm takes advantages of the power of the CM 5 control network in performing broadcast, synchronization, and cooperative operations. In CM5, the above operations can be performed in constant time even as the machine size increases to fairly large numbers [5]. However, theses operations take O(logP Theta T d ) communication time in most distributed memory machines using generic software approaches. If the workload in a processor is not a dominant part of the total parallel execution time, large speed ups can not be obtained if we use the same ....

T. Kwan, B. Totty, and D. Reed, "Communication and Computation Performance of the CM5, " Proc. of Supercomputing '93, pp. 192-201, 1993.


Performance of the CM-5 Scalable File System - Thomas Kwan (1994)   (11 citations)  Self-citation (Kwan Reed)   (Correct)

....an asymptote. Because the CMMD synchronous broadcast mode uses both the data and control networks to coordinate data transmission and synchronization of the processors [2] and because the broadcast function of the CM 5 s control network has a bandwidth of approximately 800 kilobytes second [9], this limitation is a major cause for the asymptotes of Figure 3b. Figure 3a shows the data rate for synchronous broadcast writes. In contrast to synchronous broadcast reads, where the data rate quickly approaches an asymptote, the performance of broadcast writes continues to increase with array ....

Kwan, T. T., Totty, B. K., and Reed, D. A. Communication and Computation Performance of the CM-5. In Supercomputing 1993 (Nov 1993), pp. 192--201.


An Efficient Data Parallel Algorithm for 2-D Convolutions - Sandra Dykes Xiaodong   (Correct)

No context found.

T.T. Kwan, B.K. Totty and D.A. Reed, Communication and computation performance of the CM-5, in: Proc. Supercomputing '93, (1993), 192-201.


Comparative Evaluation and Case Studies of Shared-Memory.. - Data-Parallel Execution..   (Correct)

No context found.

T. T. Kwan, B. K. Totty and D. A. Reed, "Communication and computation performance of the CM-5", Supercomputing 93, IEEE Computer Society Press, November 1993, pp. 192-201.


Comparative Evaluation and Case Studies of Shared-Memory.. - Data-Parallel Execution..   (Correct)

No context found.

T. T. Kwan, B. K. Totty and D. A. Reed, "Communication and computation performance of the CM-5", Supercomputing 93, IEEE Computer Society Press, November 1993, pp. 192-201.


Distributed Image Edge Detection Methods and Performance - Xiaodong Zhang Hong   (Correct)

No context found.

T. T. Kwan, B. K. Totty and D. A. Reed, "Communication and computation performance of the CM-5", Supercomputing 93, IEEE Computer Society Press, November 1993, pp. 192-201.


May 11, 1994 1 Measurements of Active Messages - Performance On The   (Correct)

No context found.

Thomas T. Kwan, Brian K. Totty, and Daniel A. Reed. Communication and Computation Performance of the CM-5. In Proceedings of Supercomputing'93 (Portland, OR, November 1993).


Performance Analysis of Wormhole Routed k-ary n-trees - Petrini, Vanneschi (1998)   (Correct)

No context found.

T. T. Kwan, B. K. Tatty, and D. A. Reed, "Communication and Computation performance of the CM-5," In Supercomputing'93, pages 192--201, November 1993.


Communication and Computation Patterns of Large Scale Image.. - Sandra Dykes (1994)   (Correct)

No context found.

T. T. Kwan, B. K. Totty and D. A. Reed, "Communication and Computation performance of the CM-5," Proceeding of Supercomputing '93, (Nov.1993), pp. 192-201.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC