| J.-C. Wang, T.-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. Technical Report CRPC-TR94502, Syracuse University, Syracuse, NY, 1994. |
.... of this algorithm were developed (still dependent upon network topology) such as the Optimal Circuit Switched, Hypercube, or Mesh Algorithm ( 38, 10, 25, 37, 12, 13, 14, 1, 32, 23, 34, 19, 24] the Pairwise Exchange (PEX) algorithm ( 43, 41, 42] and the general Linear Permutation algorithm ([45]) For our comparison, we consider the standard algorithm consisting of p steps, such that during step i, 0 i p Gamma 1) processor j sends data labelled for processor k = i Phi j directly to P k . Figure 3 presents the results of our comparison, providing empirical support for the notion ....
J.-C. Wang, T.-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. Technical Report CRPC-TR94502, Syracuse University, Syracuse, NY, 1994.
....algorithms was done by Goldman, Peters and Trystram [5] 1.1 Related works Several algorithms for personalized communications tend to be network or machine dependent [1, 4, 6] only some of them ( 12, 3] are scalable across a number of different platforms. Ranka and Wank [12, 13] see also [11, 8]) proposed two methods for solving the h relation. The first one uses a fixed pattern based on the linear permutation with pairwise exchanges. Other approach was a random scheduling that avoids node contention. Bader, Helman and J aJ a [3] proposed a two phases method that outperforms the ....
....the problem all elements are supposed to have the same size. Without loss of generality assume that h is an integral multiple of p. The processors will be labeled P 0 ; P p Gamma1 . 2. 1 Fixed pattern algorithm The linear permutation (LP) algorithm was proposed by Ranka, Lin and Wang [11]. The algorithm is suitable for a power of two number of processors. In the LP algorithm each processor P i sends the messages to processor P (i Phik) and receives the messages from P (i Phik) where 0 k p (see figure 1) For all processors, in parallel do j = i Phi k; P i receives the ....
J-C Wang, T-H Lin, and S. Ranka. Distributed scheduling of unstructured collective communication on the cm-5. Technical report, School of Computer and Information Sience, Syracuse University, July 1993.
....three fourths of the maximum user payload bandwidth per processor of 12 MB s per processor [28] This is consistent with the results achieved by other research teams that have achieved 6.4 MB s per processor (Culler at UC Berkley, 11] and 7. 72 MB s per processor (Ranka at Syracuse University, [46]) for similar data movements on the CM 5. Note that some of these cited results are for low level implementations using message passing algorithms. For large enough data sets, the SP 2 achieves greater than 24.8 MB s per processor for the matrix transpose algorithm, using a high performance switch ....
J.-C. Wang, T.-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. Personal Communication., 1994.
.... of this algorithm were developed (still dependent upon network topology) such as the Optimal Circuit Switched, Hypercube, or Mesh Algorithm ( 37, 10, 25, 36, 12, 13, 14, 1, 31, 23, 33, 19, 24] the Pairwise Exchange (PEX) algorithm ( 43, 41, 42] and the general Linear Permutation algorithm ([45]) For our comparison, we consider the standard algorithm consisting of p steps, such that during step i, 0 i p Gamma 1) processor j sends data labelled for processor k = i Phi j directly to P k . Figure 3 presents the 4 Note that the personalized communication is more general than a ....
J.-C. Wang, T.-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. Technical Report CRPC-TR94502, Syracuse University, Syracuse, NY, 1994.
....when the block size is 1 or 2. In cyclic distribution we had the worst performance. Also, the simple storage scheme gave the better results than two optimized schemes. Many to many personalized communication: For many to many personalized communication, linear permutation scheduling algorithm [7] using active messages [3] was used. We have results similar to those for local computation. That is, as mask density increases, the compact message scheme gives a better performance than the simple storage scheme. Generally, as the block size increases, the compact message scheme gives a better ....
J-C. Wang, T-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. In Proc. of Hawaii International Conference on System Sciences, 1993.
.... 8 4 4 4 32 1 8 2 2 2 2 4096 512 16 8 4 4 4 64 32 8 2 2 2 2 8192 2048 8 8 4 4 4 128 16 4 4 2 2 2 Many to Many Personalized Communication In our experiments the result vector in PACK (the input vector in UNPACK) was fixed to be distributed in block, and the linear permutation scheduling algorithm [9] using active messages [4] was used for many to many personalized communication. Readers are referred to [1] for a comparison of different communication scheduling algorithms. Results for the communication step are similar to those for the local computation. That is, as mask density increases the ....
J-C. Wang, T-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. In Proc. of Hawaii International Conference on System Sciences, 1993. 1 8 32 128 1K 1 8 32 128 1K 1 8 32 128 1K 1 8 32 1281K 1 8 32 128 1K
....Also, for exact timing on the CM 5, processor 0 was always selected as a hot processor. For the transportation primitive, linear permutation using active messages and a randomized distributed algorithm using active messages were used in the direct algorithm and the twostage algorithm, respectively [21, 22, 15]. Also, for the vector prefix reduction sum, a CM 5 global function was used [4] 6.1 Performance Evaluation of Ten Schemes In each category we used various hot degrees ffi = 0:0; 0:2; 0:4; 0:6; 0:8; 1:0 and fixed the value of H to 1 in order to compare the performances of ten different schemes ....
J-C. Wang, T-H. Lin, and S. Ranka. Distributed Scheduling of Unstructured Collective Communication on the CM-5. In Proc. of Hawaii International Conference on System Sciences, 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC