| K. S. M c Kinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Boston, MA, October 1996. |
.... 25] Earlier models assumed fully associative caches, but more recent techniques take limited associativity into account [10, 22] Researchers began reexamining conflict misses after a study showed conflict misses can cause half of all cache misses and most intra nest misses in scientific codes [18]. Data layout transformations such as array transpose and padding have been shown to reduce conflict misses in the SPEC benchmarks when applied by hand [14] Array transpose applied with loop permutation can improve parallelism and locality [5, 12, 19] Array padding can also help eliminate ....
K. S. M c Kinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996.
....the memory and I O capabilities of the machine. Fortunately some of this speed gap can be filled by proper use of caches. When running most vector codes on cache machines with work set larger than the cache size, the low level algorithm decomposition gives a high inter nest loop cache miss rate[12] (capacity misses between different loops doing calculations on the same data) Given the relative small caches of SMPs (typical 256 1024 KB per processor) this is the case for most large numerical problems. To improve locality (temporal and spatial) the algorithm have to be rescheduled ....
....of NAS FT that can benefit from the uniform shared memory available inside SMPs is presented in section 5.8. The paper will continue by introducing three different platforms used for the experiments (section 2) Furthermore, parallelizing techniques for SMPs (intra node optimizations [12]) are detailed (section 3) and applied to the two benchmarks used, NPB EP and NPB FT[13] in section 4 and 5. Performance results for the two benchmarks are presented in section4.3 and 6 respectively. Section 6.4 concludes with a performance and price comparison of two generations of Sparc based ....
K. S. McKinley and O. Temam. A Quantitative Analysis of Loop Nest Locality. In Proceedings of 7th International conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, 1996.
....strategy. It breaks down when groups become too large to handle, and we expect that we can significantly improve the analysis based on the of the ranges of loop variables. In this area we will be able to benefit from work which has been done on (nested) loop analysis, and loop restructuring, [11, 12]. When the compiler treats nested loops properly, we expect that we will be able to extend our partitioning to other parts of the memory hierarchy, for example multi level caches and virtual memory. The idea of exposing the cache to the compiler or programmer is not new. It has been proposed ....
K. S. McKinley and O. Temam. A Quantitative Analysis of Loop Nest Locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-VII, Boston, MA, Oct. 1996. ACM.
....outer loop parallelism by tiling the nest to to maintain locality on individual processors as the computation is divided among multiple processors. The tiling step strip mines and if necessary, permutes the nest to position an outermost parallel loop. 1 McKinley and Temam support this assumption [35], and McKinley et al. demonstrate that this memory model works well for uniprocessor caches [34] Figure 3: Optimize: Data Locality and Parallelization Algorithm INPUT: A loop nest L = fl 1 ; l k g OUTPUT: An optimized loop nest P ALGORITHM: procedure Optimize(L) compute RefGroup ....
K. S. M c Kinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Boston, MA, October 1996.
....map to the same set of cache locations, causing cache lines to be flushed from cache before they may be reused, despite sufficient capacity in the overall cache. Conflict misses have been found to be a significant source of poor performance in scientific programs, particularly within loop nests [18]. We believe compiler transformations can be very effective in eliminating conflict misses for scientific programs with regular access patterns. We evaluate two compiler transformations to eliminate conflict misses: inter and intravariable padding [3, 21] Unlike standard compiler transfor This ....
.... and McKinley show how to select tile sizes which avoid conflict misses using the Euclidean algorithm [7] McKinley and Temam perform a study of loop nest oriented cache behavior for scientific programs and conclude that conflict misses cause half of all cache misses and most intra nest misses [18]. Researchers have examined methods to eliminate conflict misses using hardware [11, 13] or 10 8 6 4 2 0 2 4 6 8 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 Shal Miss Rate Improv. 8 6 4 2 0 2 4 6 250 260 270 280 290 ....
K. S. M c Kinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996.
....map to the same set of cache locations, causing cache lines to be flushed from cache before they may be reused, despite sufficient capacity in the overall cache. Conflict misses have been found to be a significant source of poor performance in scientific programs, particularly within loop nests [18]. 3.1 Eliminating Conflict Misses Compiler transformations canbe very effective in eliminating conflict misses for scientific programs with regular access patterns. We evaluate two compiler transformations to eliminate conflict misses: 1) inter variable padding (modify variable base address) 2) ....
.... generated references as a means of discovering group reuse between references to the same array [9] McKinley and Temam performed a study of loop nest oriented cache behavior for scientific programs and concluded that conflict misses cause half of all cache misses and most intra nest misses [18]. Most researchers have concentrated on computation reordering transformations. Loop permutation and tiling are the primary optimization techniques [3, 9, 26] though loop fission (distribution) and loop fusion have also been found to be helpful [3] Researchers have previously examined changing ....
K. S. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedingsof the Eighth International Conferenceon Architectural Support for ProgrammingLanguages and Operating Systems(ASPLOS-VIII), Boston, MA, October 1996.
....misses and false sharing [3, 9, 28] Conflict misses are particularly important as cache sizes grow larger. In a recent study of uniprocessor cache behavior of scientific programs McKinley and Temam found most cache misses within loop nests and half of all misses to be caused by cache conflicts [42]. Data layout optimization techniques include modifying the distance between variables, changing the size of array dimensions, transposing array dimensions, partially linearizing array dimensions, rearranging fields of structures, and modifying dynamic memory allocation policies. Most ....
.... generated references as a means of discovering group reuse between references to the same array [17] McKinley and Temam performed a study of loop nest oriented cache behavior for scientific programs and concluded that conflict misses cause half of all cache misses and most intra nest misses [42]. Researchers examining memory system effects for parallel machines have mostly focused on program transformations to eliminate false sharing and co locate data and computation. Torrie et al. investigated cache behavior for compile parallelized programs and found memory costs were responsible for ....
K. S. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996.
....map to the same set of cache locations, causing cache lines to be flushed from cache before they may be reused, despite sufficient capacity in the overall cache. Conflict misses have been found to be a significant source of poor performance in scientific programs, particularly within loop nests [17]. In this paper, we show that compiler transformations can be very effective in eliminating conflict misses for scientific programs with regular access patterns. 1.1 Motivating Examples We begin by providing some simple examples to motivate the need for eliminating conflict misses. Consider the ....
.... generated references as a means of discovering group reuse between references to the same array [11] McKinley and Temam performed a study of loop nest oriented cache behavior for scientific programs and concluded that conflict misses cause half of all cache misses and most intra nest misses [17]. Most researchers exploring compiler optimizations to improve data locality have concentrated on computationreordering transformations derived from shared memory parallelizing compilers [23, 24] Loop permutation and loop tiling are the primary optimization techniques used [4, 6, 10, 14, 22] ....
K. S. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996.
....map to the same set of cache locations, causing cache lines to be flushed from cache before they may be reused, despite sufficient capacity in the overall cache. Conflict misses have been found to be a significant source of poor performance in scientific programs, particularly within loop nests [18]. We previously presented inter and intra variable padding, two compiler transformations to eliminate severe conflicts, misses which occur on every loop iteration [20] Unlike standard compiler transformations which restructure the computation performed by the program, these two techniques modify ....
....its ability to choose paddings to exploit group reuse. 5 Related Work Data locality has been recognized as a significant performance issue for both scalar and parallel architectures. In particular, conflict misses can cause half of all cache misses and most intra nest misses in scientific codes [18]. Conflicts may be eliminated with hardware [9, 11] or operating systems support [2, 3] For many scientific codes we can achieve similar or better results through inexpensive data layout transformations. Data layout transformations applied by hand has been shown to reduce conflict misses in the ....
K. S. M c Kinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996.
....closer together: the miss rate of SPEC 95 is 9.9 , and Perfect is 9.8 . However, the large miss rate of ARC2D in Perfect skews the average a bit. In previous work, we present the average graphs and discuss in detail the Perfect Benchmarks on SuperSparc20 traces for the same cache organization [MT96] Since the Sparc and Alpha results are similar, as we will discuss in Section 9, we do not present the average graphs for Perfect on the Alpha in this section. Appendix A contains these graphs and those for SPEC 95 for reference and comparison. Since on average both benchmarks yield similar ....
K. S. M c Kinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 94--104, Boston, MA, October 1996.
....of capacity misses only. Coleman et al. 9] have also proposed a model for computing certain types of cache conflicts and deducing near optimal block size value. The study is focused on self interferences though it also deals with cross interferences on a lesser extent. However, a recent study [10] shows that most conflict misses are cross interference misses and self interference misses do not occur so frequently. 1.2 Terminology Array element: An array element denotes an array entry such as A(5) for example. Temporal and Spatial Locality: 11] A data exhibits temporal locality if it is ....
K. S. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, October 1996.
....email: mckinley cs.umass.edu; O. Temam, Laboratoire PRiSM, Universite de Versailles, 45 Avenue des Etats Unis, 78000 Versailles France; email: temam prism.uvsq.fr. 2 Delta McKinley and Temam ories [Smith 1986; 1991] and compiler techniques that exploit cache memories [Coleman and McKinley 1995; McKinley et al. 1996; Lam et al. 1991; Mowry et al. 1992; Wolf and Lam 1991] Most of this work depends on loop nests to provide predictable and regular data accesses. Techniques to improve data cache performance typically target and model locality characteristics found in loop nests. For example, software and ....
....model locality characteristics found in loop nests. For example, software and hardware prefetching exploit the spatial locality of regular accesses in loop nests [Chen and Baer 1995; Callahan et al. 1991; Drach 1995; Klaiber and Levy 1991; Mowry et al. 1992] Many researchers [Li and Pingali 1992; McKinley et al. 1996; Mowry et al. 1992; Wolf and Lam 1991] model data locality by distinguishing four categories of locality which they use to drive loop optimizations: spatial reuse of adjacent locations in a cache block; temporal reuse of the same location; self reuse from the same data reference; and ....
[Article contains additional citation context not shown here]
McKinley, K. S. and Temam, O. 1996. A quantitative analysis of loop nest locality. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems. Boston, MA, 94--104.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC