### Table 2: Performance Pro le on a 256 processor Intel Touchstone Delta system (time in seconds)

"... In PAGE 11: ... Figure 3 shows the performance of Algorithm 1 on the Intel Delta system as a function of matrix size for di erent numbers of processors. Table2 gives details of the total CPU timing of the Newton iteration based algorithm, summarized in Table 1). It is clear that the Newton iteration (sign function) is most expensive, and takes about 90% of the total running time.... ..."

### Table 2: Performance Pro le on a 256 processor Intel Touchstone Delta system (time in seconds)

"... In PAGE 11: ... Figure 3 shows the performance of Algorithm 1 on the Intel Delta system as a function of matrix size for di erent numbers of processors. Table2 gives details of the total CPU timing of the Newton iteration based algorithm, summarized in Table 1). It is clear that the Newton iteration (sign function) is most expensive, and takes about 90% of the total running time.... ..."

### TABLE 2 Performance profile on 256 node Intel Touchstone Delta system.

1997

Cited by 32

### Table 1: The SDC algorithm with Newton iteration on 256-node Intel Touchstone Delta system.

"... In PAGE 8: ...14 iteration to converge. From Table1 , we see that for matrices of order 4000, the algorithm reached 7:19=23:12 31% e ciency with respect to PUMMA matrix multiplication, and 7:19=8:70 82% e ciency with respect to the underlying ScaLAPACK matrix inversion subroutine. Table 2 is the pro le of the CPU time.... ..."

### Table 1: Backward accuracy, timing in seconds and mega ops of Algorithm 1 on a 256 node Intel Touchstone Delta system. n

"... In PAGE 10: ... All com- putations were performed in real double precision arithmetic. Table1 lists the measured results of the backward error, the number of Newton iterations, the total CPU time and the mega ops rate. In particular, the second column of the table contains the backward errors and the number of the Newton iterations in parentheses.... In PAGE 10: ... We note that the convergence rate is problem-data dependent. From Table1 , we see that for a 4000-by-4000 matrix, the algorithm reached 7.19/23.... In PAGE 27: ...01 11570.10 Table1 0: Performance of matrix inversion (LU + Triangular inversion), blocksize=30. Delta 16 16 PEs 16 32 PEs n LU time TRI time M ops LU time TRI time M ops (seconds) (seconds) (total) (seconds) (seconds) (total) 1000 2.... In PAGE 28: ... Table1 1: Performance of QR decomposition method for solving the least squares problem (QR decomposition + Triangular solver), blocksize=6. Delta 16 16 PEs 16 32 PEs n QR time Solve time M ops QR time Solve time M ops (seconds) (seconds) (total) (seconds) (seconds) (total) 1000 4.... In PAGE 28: ...8376 8302.54 Table1 2: Performance of QR decomposition with column pivoting. Delta 16 16 PEs 16 32 PEs n time M ops time M ops (seconds) (total) (seconds) (total) 1000 6.... In PAGE 29: ... Table1 3: Backward accuracy, timing in seconds and mega ops of the SDC algorithm with Newton iteration on a 32 PEs with VUs CM-5. n kE21k1=kAk1 Timing M ops M ops GEMM-M ops INV-M ops (iter) (seconds) (total) (per node) (per node) (per node) 256 4e ? 14(18) 33.... In PAGE 29: ...72 18.87 Table1 4: Performance of matrix multiplication and matrix inversion, and QRP, CMSSL version 3.2 CM-5 GEMM Inversion QRP n time M ops time M ops time M ops (seconds) (per node) (seconds) (per node) (seconds) (per node) 256 0.... ..."

### Table 1: Backward accuracy, timing in seconds and mega ops of Algorithm 1 on a 256 node Intel Touchstone Delta system. n

"... In PAGE 10: ... All com- putations were performed in real double precision arithmetic. Table1 lists the measured results of the backward error, the number of Newton iterations, the total CPU time and the mega ops rate. In particular, the second column of the table contains the backward errors and the number of the Newton iterations in parentheses.... In PAGE 10: ... We note that the convergence rate is problem-data dependent. From Table1 , we see that for a 4000-by-4000 matrix, the algorithm reached 7.19/23.... In PAGE 27: ...01 11570.10 Table1 0: Performance of matrix inversion (LU + Triangular inversion), blocksize=30. Delta 16 16 PEs 16 32 PEs n LU time TRI time M ops LU time TRI time M ops (seconds) (seconds) (total) (seconds) (seconds) (total) 1000 2.... In PAGE 28: ... Table1 1: Performance of QR decomposition method for solving the least squares problem (QR decomposition + Triangular solver), blocksize=6. Delta 16 16 PEs 16 32 PEs n QR time Solve time M ops QR time Solve time M ops (seconds) (seconds) (total) (seconds) (seconds) (total) 1000 4.... In PAGE 28: ...8376 8302.54 Table1 2: Performance of QR decomposition with column pivoting. Delta 16 16 PEs 16 32 PEs n time M ops time M ops (seconds) (total) (seconds) (total) 1000 6.... In PAGE 29: ... Table1 3: Backward accuracy, timing in seconds and mega ops of the SDC algorithm with Newton iteration on a 32 PEs with VUs CM-5. n kE21k1=kAk1 Timing M ops M ops GEMM-M ops INV-M ops (iter) (seconds) (total) (per node) (per node) (per node) 256 4e ? 14(18) 33.... In PAGE 29: ...72 18.87 Table1 4: Performance of matrix multiplication and matrix inversion, and QRP, CMSSL version 3.2 CM-5 GEMM Inversion QRP n time M ops time M ops time M ops (seconds) (per node) (seconds) (per node) (seconds) (per node) 256 0.... ..."

### Table 2: Performance Pro le on 256-node Intel Touchstone Delta system. n Newton (%) QRP(%) QTAQ(%) Total CPU

"... In PAGE 8: ... From Table 1, we see that for matrices of order 4000, the algorithm reached 7:19=23:12 31% e ciency with respect to PUMMA matrix multiplication, and 7:19=8:70 82% e ciency with respect to the underlying ScaLAPACK matrix inversion subroutine. Table2 is the pro le of the CPU time. It is clear that the Newton iteration (i.... ..."

### Table 3: Load-Balancing Overhead for a 16 32 Mesh on the Intel Touchstone DELTA Time (msec) Section Algorithm Full Partial None Average

"... In PAGE 13: ... It should be noted that the latter value is not simply the average of the other three columns but is a weighted average, where the weights are based on the number of time steps executed for each type within a 24-hour period. The overhead data in Table3 breaks down load-balancing costs into four categories. The rst three correspond to the input exchange, output exchange, and state reorgani- zation operations described in Section 5.... In PAGE 15: ... In principle, these data could be cached on each processor in the swapping algorithm, avoiding the need for the reorganization. As can be seen by examining the partial radiation times in Table3 , however, the time required to reorganize the state data is less than the time required to exchange the input and output data on every time step. Hence, this situation is not expected to have a signi cant impact on performance.... ..."

### Table 1: The SDC algorithm with Newton iteration on 256-node Intel Touchstone Delta

1997

"... In PAGE 8: ...14 iteration to converge. From Table1 , we see that for matrices of order 4000, the algorithm reached 7:19=23:12 #19 31#25 e#0Eciency with respect to PUMMA matrix multiplication, and 7:19=8:70 #19 82#25 e#0Eciency with respect to the underlying ScaLAPACK matrix inversion subroutine. Table 2 is the pro#0Cle of the CPU time.... ..."

Cited by 32

### Table 2: Comparison of optimistic barrier and gsync() on Intel Touchstone Delta; the experiment measures

1995

Cited by 18