### Table 7.3: Values for the Single Node Model

### Table 2: Comparison between the model and simulation under single node accumulation tra c and single node broadcast tra c for mean delay ( ), blocking (pb), and maximum outgoing link utilization error.

1997

"... In PAGE 7: ... and simulation histograms are equivalent to the ones for the single node accumulation tra c pattern shown in Figure 5. In Table2 , we compare the model and the simulation mean delay, blocking probability and outgoing link utilization. We found good agreement between the model and simulation for the mean delay ( ), and outgoing link utilization.... ..."

Cited by 2

### Table XV. Single-Node Performance of the Linear Algebra Kernels

1998

Cited by 3

### Table XVI. Breakdown of the Single-Node Time of the Linear Algebra Benchmarks

1998

Cited by 3

### Table XVII. Single-Node Performance of the Application Kernels

1998

Cited by 3

### Table XVIII. Breakdown of the Single-Node Time of the Application Kernel Benchmarks

1998

Cited by 3

### Table 14: Breakdown of the single-node time of the linear algebra benchmarks.

"... In PAGE 30: ... To understand the overhead incurred on the codes generated by the pgf90 and pghpf compilers, we measured the time spent on segments of the code that would cause communication when running on parallel nodes, and compared them with those of the sequential codes. Table14 lists the time breakdowns for the three versions of each benchmark. Table 14 shows that the overhead of the... In PAGE 30: ... Table 14 lists the time breakdowns for the three versions of each benchmark. Table14 shows that the overhead of the... ..."

### Table 13: GEMM-ratios on a single node of the Parsytec GC/PP for the original level 3 BLAS model implementations from netlib. Underlying routines: BP-DGEMM and original netlib BLAS (second column), BP-DGEMM, POL-DGEMV and original netlib BLAS (third column).

in GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark

"... In PAGE 24: ...odel implementations. BP-DGEMM builds on the work in [8]. The remaining under- lying routines of the GEMM-based library are from the original level 1 and 2 BLAS model implementations. In Table13 we show the GEMM-ratios from a single node of Parsytec GC/PowerPlus. The comparison is here with the original level 3 BLAS model implementations from netlib.... In PAGE 24: ... The comparison is here with the original level 3 BLAS model implementations from netlib. In the right-most part of Table13 we show results where the original DGEMV has been replaced by POL-DGEMV. These clearly demonstrates the bene ts of using an optimized level 2 GEMV routine as well.... ..."

### Table 1: Results of computational benchmark for the mesh heat di usion ap- plication, running on a single node of the IBM SP2 using Fortran. Grid size is 640,000 points. Times are in milliseconds.

1996

"... In PAGE 13: ... The computational benchmark measures values for the fol- lowing times: Toverhead (start and terminate process), Tinit (initialize grid val- ues), Tcomp (calculate new values for all grid points), and Toutput (output re- sults). Results are given in Table1 . Observe that results for this benchmark are independent of the choice of archetype implementation.... In PAGE 29: ...Time (msecs) Toverhead 100 Tread const 15 Tinit 222 Tcomp 427 Tcheck converge 134 Tcopy values 62 Toutput 15 Table1 0: Results of computational benchmark for the mesh-spectral Poisson solver application, running on a single node of the IBM SP2 using Fortran. Grid size is 800 by 800 points.... In PAGE 30: ...Measurement n = 1 n = 4 n = 9 n = 16 n = 25 n = 36 Toverhead 3850 9630 15950 27380 48330 86150 Tset mesh 0 0 0 0 0 0 Tblk to one 2928 1957 1365 1219 954 1241 Tone to blk 2951 2474 2051 2026 2222 1792 Tbdry exchg 0 11 13 14 16 20 Tbcast 0 4 14 35 75 145 Tdata bounds 0 0 0 0 0 0 Tglobal max dp 0 12 25 53 96 171 Tintersect 0 0 0 0 0 0 Tlocal pos 0 0 0 0 0 0 Tlocal to global 0 0 0 0 0 0 Tpack 0 0 0 0 0 0 Tunpack 0 0 0 0 0 0 Table1 1: Results of communication benchmark for the mesh-spectral Poisson solver application, running on the speci ed number of nodes on the IBM SP2 using Fortran M, without the crossbar switch. Grid size is 800 by 800 points.... In PAGE 34: ... The model also correctly predicts that overall the mesh-spectral version of the application per- forms better. Time (secs) on n nodes, not including host node n = 1 n = 4 n = 9 n = 16 n = 25 n = 36 Mesh-Spectral Expected Elapsed Time 512 152 91 81 98 142 Mesh-Spectral Actual Elapsed Time 521 151 91 78 95 130 Mesh Expected Elapsed Time 496 149 94 78 91 91 Mesh Actual Elapsed Time 522 148 87 69 58 62 Table1 3: Elapsed times for the mesh and mesh-spectral Poisson solver appli- cations, running on the speci ed number of nodes (plus a \host quot; node for the mesh version) on the IBM SP2 using Fortran M, without the crossbar switch. Problem size is 800 by 800 grid points and 1000 steps.... In PAGE 35: ...roblem size is 800 by 800 grid points and 1000 steps. Times are in seconds. See Table 13 for corresponding table. Time (secs) on n nodes, not including host node n = 1 n = 4 n = 9 n = 16 n = 25 n = 36 Mesh-Spectral Expected Process Time 508 142 75 54 49 55 Mesh-Spectral Actual Process Time 516 143 75 54 55 63 Mesh Expected Process Time 490 141 80 59 59 54 Mesh Actual Process Time 517 140 73 47 34 32 Table1 4: Process times for the mesh and mesh-spectral Poisson solver appli- cations, running on the speci ed number of nodes (plus a \host quot; node for the mesh version) on the IBM SP2 using Fortran M, without the crossbar switch. Problem size is 800 by 800 grid points and 1000 steps.... In PAGE 36: ...rams for this application appear in Appendix B.2. Applying our performance analysis for a particular implementation and architecture requires results from executing both benchmarks on an appropriate system. For the SP2, we can reuse the computational benchmark results shown in Table1 0, since the target architecture is the same. We must, however, rerun the communication benchmark using the MPI-based archetype implementation.... In PAGE 37: ...Measurement n = 1 n = 4 n = 9 n = 16 n = 25 n = 36 Toverhead 3930 6030 10280 17130 26400 39130 Tset mesh 0 0 0 0 0 0 Tblk to one 3218 2235 1374 1276 1172 1026 Tone to blk 3243 1977 1829 1666 1750 1698 Tbdry exchg 0 4 4 4 4 5 Tbcast 0 0 1 1 2 3 Tdata bounds 0 0 0 0 0 0 Tglobal max dp 0 1 1 2 3 4 Tintersect 0 0 0 0 0 0 Tlocal pos 0 0 0 0 0 0 Tlocal to global 0 0 0 0 0 0 Tpack 0 0 0 0 0 0 Tunpack 0 0 0 0 0 0 Table1 5: Results of communication benchmark for the mesh-spectral Poisson solver application, running on the speci ed number of nodes on the IBM SP2 using Fortran with MPI, without the crossbar switch. Grid size is 800 by 800 points.... In PAGE 37: ... Times are in milliseconds. Measurement Time (msecs) Toverhead 150 Tread const 5 Tinit 271 Tcomp 941 Tcheck converge 265 Tcopy values 244 Toutput 40 Table1 6: Results of computational benchmark for the mesh-spectral Poisson solver application, running on a single 166 MHz Pentium using Fortran. Grid size is 800 by 800 points.... In PAGE 38: ...Measurement n = 1 n = 2 n = 4 n = 6 n = 8 n = 9 Toverhead 3000 7530 9930 14830 18000 18150 Tset mesh 0 0 0 0 0 0 Tblk to one 3957 3805 3307 2310 2243 2133 Tone to blk 4115 3738 4563 4961 5230 5134 Tbdry exchg 0 11 15 19 209 296 Tbcast 0 0 2 3 5 5 Tdata bounds 0 0 0 0 0 0 Tglobal max dp 0 2 4 6 9 12 Tintersect 0 0 0 0 0 0 Tlocal pos 0 0 0 0 0 0 Tlocal to global 0 0 0 0 0 0 Tpack 0 0 0 0 0 0 Tunpack 0 0 0 0 0 0 Table1 7: Results of communication benchmark for the mesh-spectral Poisson solver application, running on the speci ed number of nodes on a network of 166 MHz Pentiums using Fortran with MPI, communicating over 100 Mbps Ethernet.... In PAGE 39: ...For both architectures, our model predicts the scalability of the application pretty well, and it correctly predicts the expected performance di erence between the two architectures. Time (secs) on n nodes n = 1 n = 2 n = 4 n = 6 n = 8 SP2 Expected Elapsed Time 513 { 140 { { SP2 Actual Elapsed Time 520 { 142 { { Pentium Expected Elapsed Time 1222 632 337 243 387 Pentium Actual Elapsed Time 1308 712 362 288 342 Time (secs) on n nodes n = 9 n = 16 n = 25 n = 36 SP2 Expected Elapsed Time 74 56 54 61 SP2 Actual Elapsed Time 75 52 52 56 Pentium Expected Elapsed Time 457 { { { Pentium Actual Elapsed Time 379 { { { Table1 8: Elapsed times for the mesh-spectral Poisson solver application imple- mented in Fortran with MPI, running on the speci ed number of nodes on the IBM SP2 (without the crossbar switch) and a network of 166 MHz Pentiums (communicating over 100 Mbps Ethernet). Problem size is 800 by 800 grid points and 1000 steps.... In PAGE 41: ...n = 1 n = 2 n = 4 n = 6 n = 8 SP2 Expected Process Time 509 { 134 { { SP2 Actual Process Time 516 { 135 { { Pentium Expected Process Time 1219 625 327 229 370 Pentium Actual Process Time 1305 632 354 276 328 Time (secs) on n nodes n = 9 n = 16 n = 25 n = 36 SP2 Expected Process Time 64 38 28 22 SP2 Actual Process Time 64 39 27 24 Pentium Expected Process Time 439 { { { Pentium Actual Process Time 363 { { { Table1 9: Process times for the mesh-spectral Poisson solver application imple- mented in Fortran with MPI, running on the speci ed number of nodes on the IBM SP2 (without the crossbar switch) and a network of 166 MHz Pentiums (communicating over 100 Mbps Ethernet). Problem size is 800 by 800 grid points and 1000 steps.... In PAGE 44: ...rams appear in Appendix B.2. Di erent choices of NXPROCS and NYPROCS (the dimensions of the process grid) imply di erent data distributions; for example, if NXPROCS = 1, data is distributed by columns. To model the e ect of varying the data distribution in this way, we can reuse the computational benchmark results in Table1 0, but we must rerun the communication benchmark for each choice of (NXPROCS, NYPROCS). We ran the communication benchmark for the following con gurations of (NXPROCS, NYPROCS): (1,16), (2,8), (4,4), (8,2), and (16,1).... ..."

Cited by 3