### Table 5.1: Characteristics of the parallel machines used in our study.

1996

### Table 1 presents the equivalent parallel machine implied b gt;

### Table 1. Timings (in seconds) for the NAS CG benchmark on 3 parallel machines. Parallel Number of Baseline Improvements Final

1995

"... In PAGE 12: ... We ran the benchmark problem on two massively parallel machines: the 1024{processor nCUBE 2 at Sandia and the 128{node processor Intel iPSC/860 at NASA/Ames. The timing results for the benchmark calculation are shown in Table1 . Five timings are given for each machine.... ..."

Cited by 32

### Table 1. The BSP cost parameters for a variety of shared and distributed-memory parallel machines.

1997

"... In PAGE 3: ... Similarly, [10] shows how careful construction of barriers can reduce the value of l. Table1 shows the values for l and g for a variety of parallel machines (the benchmarks used to calculate these constants are described in [7]). Returning to the problem of summing n values posed at the start of this section, it is natural to distribute the the data amongst the processors in n=p sized chunks, when n gt; p.... In PAGE 3: ... Combin- ing the cost of locally summing each processors n=p sized chunk of data with the cost of the summation of p values gives a total cost for summing n values on p processors of n=p + log p (1 + g + l). It is clear from this cost formula, and from the values of l and g in Table1 , that the logarithmic number of barrier synchronisations used in this algorithm dominate the cost unless n gt; p log p (1 + g + l). For a network of eight workstations, therefore, n must be greater than 20; 000; 000 elements before the computation time starts to dominate the communication time; even for an eight-processor Cray T3D, n must be greater than 4; 200.... In PAGE 8: ... In general g and p are functions of p but, for purpose- built parallel machines, they are sub-linear in p. For exam- ple, Table1 shows that g is approximately constant for the Cray T3E and l is logarithmic in p. Therefore, to provide a meaningful lower bound on the speedup, upper bounds on the values of l and g can be used as long as the dependence is not too great, as in the case of the Cray systems.... In PAGE 8: ... However, due to the shared bus nature of Ethernet, only a single pair of processors can be involved in communication at any time. This can be observed in Table1 as g / p and l / p log p for full h-relations; where the constants of proportionality are half the values of g and l for a two-processor configuration. The speedup can now be refined to: k1n log n k2 np log n + k1p2 log p2 + p2g2 + 0:5ng2 + 1:5l2p log p (6) For reasonably large p and n p2, this simplifies to:... ..."

Cited by 8

### Table 4. The speedups and maximum speedups for the test sample under the mappingrload-balancing methods on an SP2 parallel machine

"... In PAGE 22: ...4. Comparisons of the speedups under the mapping r load-balancing methods for the test sample The speedups and the maximum speedups under the mappingrload-balancing methods on the 10, 30, and 50 processors of an SP2 parallel machine for the test sample are shown in Table4 . From Table 4, we can see that if the initial mapping .... In PAGE 22: ... is performed by a mapping method for example AErORB and the same mapping method or a load-balancing method DD, MD, PCMPLB is performed for each refinement, the proposed load-balancing method has the best speedup among mappingrload-balancing methods. From Table4 , we can also see that if the initial mapping is performed by a . mapping method for example AErORB and the same mapping method or a load-balancing method DD, MD, PCMPLB is performed for each refinement, the proposed load-balancing method has the best maximum speedup among mappingrload-balancing methods.... ..."

Cited by 1