| George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990. |
....is such that it should be considered with the same care. 2.2 An Example of Real Numerical loop nest The purpose of this section is to illustrate the influence of communications on a real numerical loop nest. This loop nest is extracted from the routine EFLUX of FLO52 a Perfect Club benchmark [4]. C parallel DO 3 N=1,4 DO 3 J=2,JL DO 3 I=2,IL DW(I,J,N) FS(I,J,N) FS(I 1,J,N) 3 CONTINUE 8 0 5 10 15 20 25 30 20 15 10 5 0 1 1.5 2 # procs m Figure 5: Influence of Data Layout. C parallel DO 5 J=2,jl DO 5 I=2,IL XX = X(I,J,1) X(I 1,J,1) YX = X(I,J,2) ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....the same cache line is equal to L S C S . So, such phenomena have a very low probability to occur, but they might be very difficult to trace back in a program. Therefore they should be detected at compile time or run time. For instance, in subroutine HYD (and other subroutines) of the Perfect [10] benchmark ADM, ping pong phenomena were detected. 14 A 61 decrease of the average memory access time was obtained by slightly varying the base addresses of arrays P and PS in HYD (see figure 8.3.2.3f) 15 Spatial self interferences can also frequently occur because of the widespread use of ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....For the first category, we selected Gcc, Compress, Simulator, Contour, Grep, Latex and Segmentation 6 . These seven programs were only used to increase the load on the processor and no specific performance evaluation was done on those. For numerical codes, we picked three of the Perfect Club [5] benchmarks: FLO52, ARC2D, MDG. Since this study is focused on analyzing the behavior of multiple shared context threads, statistics are collected over parallel sections only, even though all instructions are executed. Considering our goal was to evaluate the best achievable performance with a ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....information is incorrect. As a consequence, the AMAT metric (Average Memory Access Time) instead of the CPI metric (Cycles Per Instruction) was used. Benchmarks The benchmarks used are all numerical codes: real application benchmarks ADM, MDG, BDN, DYF, ARC, FLO, TRF from the Perfect Club Suite [6], the Livermore Loops benchmark LIV, the NAS and Slalom benchmarks, and two numerical primitives Matrix Vector multiply MV and Sparse Matrix Vector multiply SpMV. Notations and Parameters A baseline cache configuration called Standard or Stand. has been used on many graphs. It corresponds to the ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....time of cache C 2 : A collection of 10 benchmarks has been used. Though it is difficult to choose a representative set of benchmarks, we tried to combine different types of codes to illustrate the maximum number of behaviors. Some numerical codes have been picked in the Perfect Club Suite [1] (benchmarks AP,ARC,BDNA,WS) some codes are Unix tools (CC is the gnu cc compiler, CPR is the compress utility, TEX is the LaTeX compiler) and other codes are numerical primitives (LL is a Lawrence Livermore loop, SPMV is sparse matrix vector multiply loop, and MM is matrix matrix multiply ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....any type of reuse can be disrupted. In general, the larger the reuse distance for a given datum, the higher the probability it is flushed from cache, because the more data are loaded before reuse occurs. Consider loop 2a which is a modified version of a loop in ARC2D, a Perfect Club benchmark [1]. The reuse distance associated with the spatial reuse of LDA(k) is equal to 1 iteration of loop k (unlikely to be disrupted) while the reuse distance associated with the temporal reuse of LDA(k) is equal to KU iterations of loop k (more likely to be disrupted, depending on the value of KU ) ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....2 (for the sake of clarity, the scale has been modified) 2.3.4 Putting it all together In this section, the number of cache misses of loop 2.3.4a is computed, based on the techniques presented in the previous sections. This loop is a modified version of a loop in ARC2D, a Perfect Club benchmark [16]. In this loop, several interference phenomena occur simultaneously (internal crossinterferences, external cross interferences of self dependences group dependences) The leading dimension of arrays F; U is KU . One parameter, ffi, is used to characterize the distance between the different DO ....
....L S C S . So, such phenomena have a very low probability to occur, but the associated performance degradation might be very difficult to trace back in a program. Therefore they should be detected at compile time or run time. For instance, in subroutine HYD (and other subroutines) of the Perfect [16] benchmark ADM, ping pong phenomena were detected. 10 The average memory access time was divided by 2, simply by slightly varying the base addresses of two arrays P and PS in subroutine HYD (see figure 3.2f) 11 Spatial self interferences can also frequently occur because of the widespread use ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Proceedings of IEEE Supercomputing'90 Conference, pages 254--266, 1990.
....Thanks to this tool, all array references in each code were instrumented. For more details on the issues of source code tracing see [17] Benchmarks The benchmarks used are all numerical codes: real application benchmarks ADM, MDG, BDN, DYF, ARC, FLO, TRF from the Perfect Club Suite [4], the Livermore Loops benchmark LIV, the NAS and Slalom benchmarks, and two numerical primitives Matrix Vector multiply MV and Sparse Matrix Vector multiply SpMV. Notations and Parameters A baseline cache configuration called Standard or Stand. has been used on many graphs. It corresponds to the ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254-- 266, 1990.
....were passed to the simulator. To come as close as possible to the conditions of a superscalar processor, we assumed one load store request is sent to the cache every cycle, without disruption (due to branches or else) Benchmarks and traces Seven benchmarks from the Perfect Club Suite [2] were used. For each benchmark, a 50 million entry trace was extracted. In our case, source code trace would have been the easiest solution, but because hardware implementation issues had to be finely studied, we decided against it and extracted objet code traces to get al..l load store references. ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
....time of cache C 2 : A collection of 10 benchmarks has been used. Though it is difficult to choose a representative set of benchmarks, we tried to combine different types of codes to illustrate the maximum number of behaviors. Some numerical codes have been picked in the Perfect Club Suite [1] (benchmarks AP,ARC,BDNA,WS) some codes are Unix tools (CC is the gnu cc compiler, CPR is the compress utility, TEX is the LaTeX compiler) and other codes are numerical primitives (LL is a Lawrence Livermore loop, SPMV is sparse matrix vector multiply loop, and MM is matrix matrix multiply ....
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
No context found.
George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Supercomputing '90, pages 254--266, 1990.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC