### Table 6 reports the performance of these Algorithms, referring to the single-program implementation. It is im- portant to note that our Algorithms can be implemented with reasonable time and memory requirements, allowing them to run on virtually any machine and needing no dedicated hardware.

2007

"... In PAGE 15: ... Table6 : Algorithms 2 and 3 - Time and memory performance 6.3 Computation of Flow Changes Local rank variations on a single path can be further explored examining how routes changed in a certain time interval.... ..."

### Table 1 Performance Numbers of single process and distributed programs

### Table 1 reports the time for performing a single execution of the fault-free program (a) and the time (b) required to perform the fault injection of the 3,000 faults defined above for each of the three programs. The experiments show that the analysis of each fault (c) requires from 70 to 96 times the time required to perform a single execution of the same program in normal mode operation.

1999

"... In PAGE 5: ... Table1 : time requirements. Please note that the time for each fault includes the one for recovery from previous faults effects, the execution of the program and the injection of the fault, and the observation of the output values.... ..."

Cited by 1

### Table 1: Results in Mega ops on parallel computers. In Table 1, it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion. The reason for this is that the single precision mathematical library routines are less optimized.

"... In PAGE 5: ... Finally, these codes have been used as a platform for the implementation of the uniprocessor version of Level 3 BLAS on the BBN TC2000 (see next Section). We show in Table1 the MFlops rates of the parallel matrix-matrix multiplication, and in Table 2 the performance of the LU factorization (we use a blocked code similar to the LAPACK one) on the ALLIANT FX/80, the CRAY-2, and the IBM 3090-600J obtained using our parallel version of the Level 3 BLAS. Note that our parallel Level 3 BLAS uses the serial manufacturer-supplied versions of GEMM on all the computers.... In PAGE 6: ... This package is available without payment and will be sent to anyone who is interested. We show in Table1 the performance of the single and double precision GEMM on di erent numbers of processors. Table 2 shows the performance of the LAPACK codes corresponding to the blocked LU factorization (GETRF, right-looking variant), and the blocked Cholesky factorization (POTRF, top-looking variant).... In PAGE 8: ... The second part concerned the performance we obtained with tuning and parallelizing these codes, and by introducing library kernels. We give in Table1 a brief summary of the results we have obtained: One of the most important points to mention here is the great impact of the use of basic linear algebra kernels (Level 3 BLAS) and the LAPACK library. The conclusion involves recommendations for a methodology for both porting and developing codes on parallel computers, performance analysis of the target computers, and some comments relating to the numerical algorithms encountered.... In PAGE 12: ... Because of the depth rst search order, the contribution blocks required to build a new frontal matrix are always at the top of the stack. The minimum size of the LU area (see column 5 of Table1 ) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table 1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors.... In PAGE 12: ... The minimum size of the LU area (see column 5 of Table 1) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors. Frontal matrices are stored in a part of the global working space that will be referred to as the additional space.... In PAGE 12: ... In a uniprocessor environment, only one active frontal matrix need be stored at a time. Therefore, the minimum real space (see column 7 of Table1 ) to run the numerical factorization is the sum of the LU area, the space to store the largest frontal matrix and the space to store the original matrix. Matrix Order Nb of nonzeros in Min.... In PAGE 13: ... In this case the size of the LU area can be increased using a user-selectable parameter. On our largest matrix (BBMAT), by increasing the space required to run the factorization (see column 7 in Table1 ) by less than 15 percent from the minimum, we could handle the ll-in due to numerical pivoting and run e ciently in a multiprocessor environment. We reached 1149 M ops during numerical factorization with a speed-up of 4.... In PAGE 14: ...ack after computation. Interleaving and cachability are also used for all shared data. Note that, to prevent cache inconsistency problems, cache ush instructions must be inserted in the code. We show, in Table1 , timings obtained for the numerical factorization of a medium- size (3948 3948) sparse matrix from the Harwell-Boeing set [3]. The minimum degree ordering is used during analysis.... In PAGE 14: ... -in rows (1) we exploit only parallelism from the tree; -in rows (2) we combine the two levels of parallelism. As expected, we rst notice, in Table1 , that version 1 is much faster than version 2... In PAGE 15: ... Results obtained on version 3 clearly illustrate the gain coming from the modi cations of the code both in terms of speed-up and performance. Furthermore, when only parallelism from the elimination tree is used (see rows (1) in Table1 ) all frontal matrices can be allocated in the private area of memory. Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1].... In PAGE 15: ... Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1]. We nally notice, in Table1 , that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table 1.... In PAGE 15: ... We nally notice, in Table 1, that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table1 . The main reason is that frontal matrices must be allocated in the shared area when the second level of parallelism is enabled.... In PAGE 18: ... block diagonal) preconditioner appears to be very suitable and is superior to the Arnoldi-Chebyshev method. Table1 shows the results of the computation on an Alliant FX/80 of the eight eigenpairs with largest real parts of a random sparse matrix of order 1000. The nonzero o -diagonal and the full diagonal entries are in the range [-1,+1] and [0,20] respectively.... In PAGE 19: ... A comparison with the block preconditioned conjugate gradient is presently being investigated.In Table1 , we compare three partitioning strategies of the number of right-hand sides for solving the system of equations M?1AX = M?1B, where A is the ma- trix BCSSTK27 from Harwell-Boeing collection, B is a rectangular matrix with 16 columns, and M is the ILU(0) preconditioner. Method 1 2 3 1 block.... In PAGE 26: ...111 2000 lapack code 0.559 Table1 : Results on matrices of bandwith 9.... In PAGE 30: ... We call \global approach quot; the use of a direct solver on the entire linear system at each outer iteration, and we want to compare it with the use of our mixed solver, in the case of a simple splitting into 2 subdomains. We show the timings (in seconds) in Table1 on 1 processor and in Table 2 on 2 processors, for the following operations : construction amp; assembly : Construction and Assembly, 14% of the elapsed time, factorization : Local Factorization (Dirichlet+Neumann), 23%, substitution/pcg : Iterations of the PCG on Schur complement, 55%, total time The same code is used for the global direct solver and the local direct solvers, which takes advantage of the block-tridiagonal structure due to the privileged direction. Moreover, there has been no special e ort for parallelizing the mono-domain approach.... ..."

### Table 1: Performance of the different algorithms as the minimum probability of success of answering a question varies. The average percentage of questions which can be answered at a single stage is fixed at 10%. The numbers reported are percentage of the performance of the optimal dynamic programming solution achieved, averaged across 30 independent problems.

1998

"... In PAGE 11: ... We computed the average performance across the 30 problems, and compared this performance with the performance obtained using the stochastic dynamic programming algorithm. Table1 shows the results of our experiments. The average performance of the greedy and index heuristics in each condition are expressed in terms of the percentage of the optimal perfor- mance.... In PAGE 11: ... The table also illustrates the improvement in performance obtained by both the one-step rollout and the selective two-step rollout algorithms, expressed in terms of per- centage of the optimal performance. As an example, the first column of Table1 gives the average performance across 30 problems with lower bound on the probability of successfully answering a question 0.2.... In PAGE 11: ... Furthermore, the two-step selective rollout achieved on average 81% of the optimal performance. The results in Table1 show that one-step rollouts significantly improve the performance of both the greedy and the index heuristics in these difficult stochastic combinatorial problems. In particular, the rollout algorithms recovered in all cases at least 50% of the loss of value due to... In PAGE 13: ... As before, the performance of the greedy and index heuristics improves as the experimental condition approaches the standard conditions of the quiz problem, where 100% of the questions can be answered at any time. The results confirm the trend seen in Table1 : even in cases where the heuristics achieve good performance, rollout strategies offer significant performance gains. Problem Density 0.... ..."

Cited by 39

### Table 6.5 shows the performance of the render program for each of the com- pilation strategies on a single-processor system using the replace-preemptive scheme and an interpretation penalty of 10. The values in the \Speedup over Traditional quot;

### Table 7 Comparison of compilation CPU time for three com- pilers. Times are in seconds. A single program containing 20 187 cards (14 716 excluding comments) which implement 101 subrou- tines was compiled. Times were taken on a Model 145 under VM/ 370 with compiler parameters appropriate for interactive com- puting.

in 660

### Table 3: Experimental results for SPECfp95. The values of slow-down factor are for the single selected region R. The four rightmost columns report the impact of the slowed down region R on the overall program performance in terms of relative execution time T and relative energy savings (original 100%).

### Table 4.2: Results of the experiments given for the unoptimized (U) and optimized (O) programs (P1 .. P6). Optimization used the greedy algorithm and generated the system with single-rule transitions. To further compare the performance, we run our optimization program on several randomly generated programs. Table 4.2 compares the results of analysis of the original and optimized program. We have used the greedy optimization method and allow only single-rule transitions. The optimization always resulted in a reduction of the number of rules to be red to reach a xed point. It was expected that the optimized programs will have more complex enabling conditions. This is particularly true for bigger example programs 75

1993