13 citations found. Retrieving documents...
Eric L. Boyd, John-David Wellman, Santosh G. Abraham, and Edward S. Davidson. Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads. In Proc. of International Conference on Supercomputing 93, pages 240--250, 1993.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Experience with Fine-Grain Communication in EM-X.. - Sato, Kodama..   (Correct)

....mechanisms provided in parallel machines, but also the relative usefulness of various mechanisms provided in the system, as evidenced by their impact on application performance. Sparse matrix computations have previously been used in the context of evaluating uniprocessor cache performance. Boyd [2] proposed a method to evaluate the communication performance using a simple parallel sparse matrix multiplication as a synthetically generated workload. We begin our discussion in Section 2 by giving a brief overview of the EM X architecture and its programming environment. Section 3 describes ....

Eric L. Boyd, John-David Wellman, Santosh G. Abraham, and Edward S. Davidson. Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads. In Proc. of International Conference on Supercomputing 93, pages 240--250, 1993.


A Comparative Evaluation of Techniques for Studying.. - Anand.. (1994)   (3 citations)  (Correct)

....and or the hardware would also incur the cost of re designing the input models. 4 Experimentation 4.1 Overview The experimentation technique for evaluating parallel systems uses real or synthetic workloads and measures their performance on actual hardware. For instance, several studies [22, 11, 47, 49] experiment with the KSR 1 hardware for evaluating its computation, communication and scalability properties. The scalability of the KSR 1 is studied in [47] using applications drawn from the NAS benchmark suite [9] Similarly, an experimental evaluation of the computation and communication ....

....to compare the KSR 1 and DASH multiprocessors. Experimentation has also been used to study the performance and scalability of specific system artifacts such as locality, synchronization, and interconnection network. The interconnection network and locality properties of the KSR 1 are studied in [22, 11, 49]. Lenoski et al. 36] study the performance and scalability of the cache, synchronization primitives and the interconnection network of DASH. They implement artificial workloads which exercise different synchronization alternatives and the prefetch capabilities of the DASH prototype, and measure ....

[Article contains additional citation context not shown here]

E. L. Boyd, J-D. Wellman, S. G. Abraham, and E. S. Davidson. Evaluating the communication performance of MPPs using synthetic sparse matrix multiplication workloads. In Proceedings of the ACM 1993 International Conference on Supercomputing, pages 240--250, July 1993.


Micro Benchmark Analysis of the KSR1 - Rafael Saavedra (1993)   (17 citations)  (Correct)

....is 5 megabytes for one node without contention to about 4 megabytes when 32 nodes are active, and declining to less than 2.5 megabytes per second per node with a load of 60 nodes. 7. Related Work Recently several other researchers have been investigating the performance of the KSR1. Boyd et. al [1] show a method of measuring communications performance on multiprocessors using a synthetic workload based on matrix multiplication of generated matrices. Other researchers [10, 9] have also reported on experiments to measure the performance effects of specific features of the KSR1. The ....

Boyd, E., Wellman, J.D., Abraham, S., and Davidson, E., "Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads", Proc. of the 7th ACM Int. Conf. on Supercomputers, Tokio Japan, July 1993.


Modelling of Communication Contention in Multiprocessors - Tron, Plateau (1994)   (Correct)

.... 16] Boyd et al. developed a benchmark in the which user can control some communication parameters like the average number of point to point data communications per processor, the degree of sharing (the number of variables read but not owned by a processor) the computation to communication ratio [3]. However all these parameters are averages, and if the user controls the average traffic in the network, the distribution is not controlled. This benchmark uses synthetic sparse matrix multiplication. 2.3 Conclusion If classical benchmarks are useful to evaluate the efficiency of a single ....

....they are difficult to use for evaluation of communication in a parallel machine. Genesis tries to fill this gap, but it includes a generic method only for point to point communications. One initiative to introduce contention in evaluation of communication is synthetic sparse matrix multiplication [3], but it gives results only for one scheme of communication . 3 Methodology for Studying Communication in a Network Under Contention In our experiment, the load of a network is defined as the average load of each physical link in the network. The load of a physical link is defined as the number ....

E. L. Boyd, J. D. Wellman, S. G. Abraham, and E. S. Davidson. Evaluating the communication Performance of MPPs Using Synthetic Sparce Matrix Multiplication Workloads. In ICS93, 1993.


The Effects of Thread Placement on the KSR1 - Wagner Smirni   (Correct)

....data in a multi threaded application. The results indicate that strategic thread placement across multiple rings of the KSR1 can substantially reduce memory access time for shared data items. Related work includes studies which focus on the implementation of specific applications on the KSR1 [3, 4], analysis using micro benchmarks [5] and performance studies of the KSR1 [6,7] 2 Memory architecture The general KSR architecture is a multiprocessor system composed of a hierarchy of rings of proces ACE:0 Ring A ACE:0 Ring B 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ....

E. Boyd, J. Wellman, S. Abraham, and E. Davidson, "Evaluating the communication performance of MPPs using synthetic sparse matrix multiplication workloads," in Proceedings of the International Conference on Supercomputing, 1993.


Characterizing the Performance Space of Shared Memory.. - Rafael Saavedra (1993)   (6 citations)  (Correct)

....to the results we observe. After that, the memory micro benchmark is described, followed by the results from its use. 2. KSR1 and DASH Architectural Descriptions In this section we briefly present the architectures of the KSR1 and the Stanford Dash. More details about the KSR1 can be found in [Boyd93, KSR92, Rost93, Saav93b]. The Dash architecture is fully described in [Leno92a, Leno92b] The descriptions given here are included to help the reader understand the performance results we present later for these machine. 2.1. Architecture of the KSR1 The KSR1 is organized as a ring of rings, with up to thirty two ....

....far have data contention of degree no greater than two. We have performed other experiments where we increased the amount of contention and will report on them later. 6. Related Work Recently several other researchers have been investigating performance of the KSR1 and DASH machines. Boyd et. al [Boyd93] show a method of measuring communications performance on multiprocessors using a synthetic workload based on matrix multiplication of generated matrices. Singh et. al [Sing93] present the performance results of several kernel codes and some of the SPLASH benchmark suite on the KSR1 and DASH ....

Boyd, E., Wellman, J.D., Abraham, S., and Davidson, E., "Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads", Proc. of the 7th ACM Int. Conf. on Supercomputers, Tokio Japan, July 1993.


Data Distributions For Sparse Matrix Vector Multiplication - Romero, Zapata (1995)   (15 citations)  (Correct)

....in the distributed memory. The most important drawback of this approach is the large number of messages that are generated as a consequence of accessing a distributed data addressing table. In fact, the communications have a dominant impact on the performance of massively parallel processors [7]. Besides, this table occupies a relevant amount of memory. In order to enable the compiler to apply more optimizations and simplify the task of the programmer, Bick and Wijshoff [4] have implemented a restructuring compiler which automatically converts programs operating on dense matrices into ....

E.L. Boyd, J.D. Wellman, S.G. Abraham and E.S. Davidson, "Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads", ACM Int'l Conf. on Supercomputing, (Tokyo), July 1993.


Evaluating the Effect of Auto-update on the Kendall Square.. - Harzallah, Li, Sevcik (1993)   (Correct)

....it has been shown that delaying the poststore operation was found to increase the probability of benefiting from the broadcast, thereby improving the performance, the issue of how long this delay should be and whether this optimization can be automated is still an open question. Boyd et al. BWAD93] propose a methodology to evaluate the communication performance of MPP s. They consider a single ring KSR1 system, and construct a set of synthetic workloads for which they vary the average amount of communication per processor, the degree of sharing among the processors, and the computation to ....

E. L. Boyd, J. Wellman, S. G. Abraham, and E. S. Davidson. Evaluating the communication performance of MPPs using synthetic sparse matrix multiplication workloads. In International Conference on Supercomputing. ACM, July 1993.


Sparse Block and Cyclic Data Distributions for Matrix.. - Asenjo Romero (1995)   (3 citations)  (Correct)

.... is the large number of messages that are generated as a consequence of accessing a distributed data addressing table, and its associated overhead of memory (value based distributions [13] In fact, the communications have a dominant impact on the performance of massively parallel processors [6]. Besides, this table occupies a relevant amount of memory. In order to enable the compiler to apply more optimizations and simplify the task of the programmer, Bick and Wijshoff [4] have implemented a restructuring compiler which automatically converts programs operating on dense matrices into ....

E.L. Boyd, J.D. Wellman, S.G. Abraham and E.S. Davidson, "Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads", ACM Int'l Conf. on Supercomputing, (Tokyo), July 1993.


Data and Program Restructuring of Irregular Applications for.. - Karen Tomko (1994)   (7 citations)  Self-citation (Abraham)   (Correct)

....Sequent Symmetry and SGI Power Challenge Series, and cache only memory ar chitectures (COMA) like the Kendall Square Research KSR1 and KSR2. The Kendall Square Research KSR1 was used to evaluate our methods. We give a brief description of the architecture here, much of which has been taken from [19, 4]. The KSR1 is characterized by a hierarchical ring interconnection network and cache only memory architecture. Each cell, consisting of a 20 megahertz processor, a 512 kilobyte subcache, and a 32 megabyte local cache, is connected to a unidirectional pipelined slotted ring. Up to 32 processors may ....

Eric Boyd, John-David Wellman, Santosh G. Abraham, and Edward Davidson. Evaluating the communication performance of MPPs using synthetic sparse matrix multiplication workloads. In Proceedings of the International Conference on Supercomputing, pages 240--250, 1993.


Partitioning Regular Applications for Cache-Coherent.. - Karen Tomko (1994)   Self-citation (Abraham)   (Correct)

....in exclusive mode in order to be updated. A state diagram of the coherency protocol for the KSR1 is given in Figure 3. 3.1 KSR1 Architecture The Kendall Square Research KSR1 was used to evaluate our methods. We give a brief description of the architecture here, much of which has been taken from [31, 8]. The KSR1 is characterized by a hierarchical ring interconnection network and cache only memory architecture. Each cell, consisting of a 20 megahertz processor, a 512 kilobyte subcache, and a 32 megabyte local cache, is connected to a unidirectional pipelined slotted ring. Up to 32 processors may ....

Eric Boyd, John-David Wellman, Santosh Abraham, and Edward Davidson. Evaluating the communication performance of MPPs using synthetic sparse matrix multiplication workloads. In Proceedings of the International Conference on Supercomputing, pages 240--250, 1993.


Modeling Computation and Communication Performance of.. - Boyd, Abandah, Lee..   (2 citations)  Self-citation (Boyd Davidson)   (Correct)

....and the Thinking Machines CM5. 7] 8] We believe that it can also be extended to shared memory MPPs, such as the Kendall Square Research KSR2, the Convex Exemplar, and the Cray T3D [11] 12] 13] 14] by using techniques to expose and characterize the implicit communication, as demonstrated in [6][9][10] All experiments in this paper were run on an IBM SP2 with 32 Thin Node 66 POWER2 processors running the AIX 3.2.5 operating system. An overview of the SP2 architecture is given in Section 2. The single node SP2 (POWER2) MACS model is detailed in Section 3, with extensions to include the ....

E. L. Boyd, J. D. Wellman, S. G. Abraham, E. S. Davidson. "Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads," Proceedings of the 1993 International Conference on Supercomputing, July, 1993, pp. 240-250.


Modeling Load Imbalance and Fuzzy Barriers for Scalable.. - Eichenberger, Abraham (1995)   (2 citations)  Self-citation (Abraham)   (Correct)

....For example, an algorithm using near neighbor communication generates communication proportional to p d and p d=p when partitioned along one and two dimensions respectively, where d is the dimension size. Other architectural and hardware factors such as network topology and automatic update [14] also have a secondary effect on E vents . 4.1 Performance Distribution Assuming that the delays are independent identically normally distributed random variables X 1 ; Delta Delta Delta ; X Events with parameters event and oe 2 event : E vents number of communication events for one ....

....is a cache line of 16 words, and therefore the total number of communication events per processor (E vents ) is equal to 4dd y =16e. On the KSR1, we estimated the standard deviation of a single communication event (oe event ) by running our program for 56 processors and obtained 17 s [14]. Figure 9 illustrates the idle time generated by communication delays in the SOR program on the KSR1. We notice that the idle time indeed varies as described in Equation (9) For example, multiplying the data size (d y ) by four doubles the idle time spent at a synchronization barrier. 7.3 An ....

E. L. Boyd, J.-D. Wellman, S. G. Abraham, and E. S. Davidson. Evaluating the communication performance of mpps using synthetic sparse matrix multiplication workloads. Proceedings of the International Conference on Supercomputing, pages 240--250, July 1993.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC