| M. S. Lam, E. E. Rothberg and M. E. Wolf: The cache performance and optimizations of blocked algorithms. In Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 63-74, Palo Alto, California, April 1991. |
....loop tiling. Let us modify the loop nest by introducing a size parameter p as follows: for i : 1 to 8 do for j : 1 to p do a(6i 9j 7) a(6i 9j 7) 5 ; An exact dependence analysis, as described in [5] yields the dependence vector (3; Gamma2) This vector is also called a reuse vector in [11] and [12] Hence, any iteration (i 3; j Gamma 2) uses the array element computed during the iteration (i; j) A current objective, when mapping such a loop nest on a distributed memory multiprocessor, is to minimize the number of occurring communications. Therefore, mapping on a single ....
....the processor cache. Moreover, this allows to achieve temporal locality. A current method is to divide the iteration space into blocks, or tiles, and map them onto the processors. Good shape and size of the tiles allow communicationfree blocks and optimized use of the processor cache. As shown in [11], the optimal tile size is difficult to predict. We show in this simple example that our method helps for this prediction. In our example, communication free blocks are obtained by choosing a parallelogram shape of the tiles such that one side is generated by the dependence vector (3; Gamma2) ....
M. S. Lam, E. E. Rothberg and M. E. Wolf: The cache performance and optimizations of blocked algorithms. In Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 63-74, Palo Alto, California, April 1991.
....4.5 the CVT cache simulator is changed to perform like a cache that is able to perform software prefetching and a test is ran with it. However, most software optimizations aim at decreasing the number of capacity misses. The most commonly used software optimization is blocking, as described in [3, 4]. When the amount of data to be reused does not fit in cache, blocking restructures the loop so that computations are performed on sub blocks that do fit in cache. Though these techniques 0 deal with capacity misses, researchers have become aware of the fact that they also induce more complex ....
....fit in cache, blocking restructures the loop so that computations are performed on sub blocks that do fit in cache. Though these techniques 0 deal with capacity misses, researchers have become aware of the fact that they also induce more complex cache phenomena (conflict misses) as discussed in [1, 3]. Blocking in relationship with the CVT is described in section 4.3. Nonsingular loop transformations represent a more elaborate class of software optimizations for reducing the number of capacity misses. These transformations induce complex reference patterns that make cache behavior difficult to ....
[Article contains additional citation context not shown here]
M. S. Lam, E. E. Rothberg and M. E. Wolf The cache performance and optimizations of blocked algorithms
....loop tiling. Let us modify the loop nest by introducing a size parameter p as follows: for i : 1 to 8 do for j : 1 to p do a(6i 9j 7) a(6i 9j 7) 5 ; An exact dependence analysis, as described in [5] yields the dependence vector (3; Gamma2) This vector is also called a reuse vector in [11] and [12] Hence, any iteration (i 3; j Gamma 2) uses the array element computed during the iteration (i; j) A current objective, when mapping such a loop nest on a distributed memory multiprocessor, is to minimize the number of occurring communications. Therefore, mapping on a single ....
....stored in the processor cache. Moreover, this allows to achieve temporal locality. A current method is to divide the iteration space into tiles, and map them onto the processors. Good shape and size of the tiles allow communication free blocks and optimized use of the processor cache. As shown in [11], the optimal tile size is difficult to predict. We show in this simple example that our method helps for this prediction. In our example, communication free blocks are obtained by choosing a parallelogram shape of the tiles such that one side is generated by the dependence vector (3; Gamma2) ....
M. S. Lam, E. E. Rothberg and M. E. Wolf: The cache performance and optimizations of blocked algorithms. In Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pages 63-74, Palo Alto, California, April 1991.
....However, this latter method looses efficiency or fails when strides are too important, and cannot avoid periodic cache misses during a vector access. Besides, large cache lines induce interference phenomena which can degrade performance of software methods used to exploit temporal locality [9]. A solution to these problems is to resort to prefetching, i.e predict which data will soon be used and load them into upper levels of memory hierarchy before they are referenced, thereby avoiding cache misses. Prefetching can be managed either by hardware, software or a combination of both. ....
....rely on hardware mechanisms. Besides, software prefetching relies on compiler performance for predicting prefetch request issue dates, grouping requests and detecting regular accesses. Compiler performance Several recent works on detection and exploitation of temporal and spatial locality [9, 4] may greatly improve compiler capacity for prefetching. However, techniques for managing numerous regular and concurrent streams of references are still under development. Besides, difficult and not infrequent cases are poorly or not handled such as if statements within a loop nest, or ....
[Article contains additional citation context not shown here]
Monica S. Lam, Edward E. Rothberg and Michael E. Wolf: The cache performance and optimizations of blocked algorithms, Proc. ASPLOS'91, pp. 63-74.
....critical to improve the hit ratio. Numerical codes are now some of the most demanding programs in terms of execution time and memory usage. The existing literature related to the study of numerical codes behavior on cache memories focuses on regular do loops, i. e with linear references to arrays [8, 2]. There is an important set of numerical codes, sparse codes , which do not belong to this category. Sparse numerical codes like classic numerical codes are made of a collection of simple numerical primitives. We chose to study Sparse Matrix Vector multiply (SpMxV) because it is among the most ....
Monica S. Lam, Edward E. Rothberg and Michael E. Wolf: The cache performance and optimizations of blocked algorithms, Proc. of ASPLOS, 1991.
No context found.
M. S. Lam, E. E. Rothberg, M. E. Wolf: The cache performance and optimizations of blocked algorithms, Proceedings of 4th ASPLOS, 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC