26 citations found. Retrieving documents...
P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142--149, Feb 1999. 2.5, 2.8, 2.13, 2.3.2, 2.4, 3.3.2

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Cache Behavior Analysis without Profiling - Beyls, D'Hollander   (Correct)

....cache misses, the programmer needs to do it itself, possibly by changing the algorithms. Clearly, it is desirable that the compiler can optimize a programs cache behavior maximally, without programmer intervention. A large number of program transformations have been proposed to reduce cache misses[9, 8, 13, 15, 1, 2]. However, before the compiler can decide which transformations are pro table, it needs to know the cache behavior of the program region it would apply them to. In this paper, a method is devised which calculates the cache behavior of sequences of loop regions in the program. 1.2 Cache ....

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142-149, Feb 1999.


Cache Behavior Analysis without Profiling - Beyls, D'Hollander   (Correct)

....cache misses, the programmer needs to do it itself, possibly by changing the algorithms. Clearly, it is desirable that the compiler can optimize a programs cache behavior maximally, without programmer intervention. A large number of program transformations have been proposed to reduce cache misses[9, 8, 13, 15, 1, 2]. However, before the compiler can decide which transformations are profitable, it needs to know the cache behavior of the program region it would apply them to. In this paper, a method is devised which calculates the cache behavior of sequences of loop regions in the program exactly. 1.2 Cache ....

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142--149, Feb 1999.


Source Code Transformation based on Software Cost Analysis - Chung, Benini, De Micheli (2001)   (1 citation)  (Correct)

....one iteration of loop i,namelyintrinsic misses are: M#########= array #) 2 N # B (array # and #) 12) where CL is the cache line size in terms of words. Notice that the tile size B can be chosen such that there is no self interference using the tile size selection algorithms presented in [9, 10, 11]. Among them, we use the algorithm proposed in [10] The misses due to cross interference can be estimated using the ######### of arrays in loop k. The ratio of the space occupied by array # over the cache size, CS is B # B=CS. 156 ################ (a) Original version ### ## # ###### ### # ....

P.Panda, H. Nakamura, N. Dutt, and A. Nicolau, \Augmenting Loop Tiling with Data Alignment for Improved Cache Performance", IEEE trans. on Computers,vol. 48, No. 2, pp. 142-148, 1999


Evaluating the Impact of Memory System Performance on Software.. - Badawy, al. (2001)   (5 citations)  (Correct)

.... algorithms carefully select tile dimensions tailored to individual array dimensions so that no conflicts occur [11] Array padding expands leading array dimensions, increasing the range of non conflicting tile shapes [36] and improving the performance of tiled codes over a range of problem sizes [35, 38]. In this paper, we apply a combination of both algorithms to tile both 2D linear algebra and 3D PDE solvers [37, 38] 4.2 Reordering for Indexed Accesses Index arrays arise in scientific applications such as sparse mesh PDE solvers and molecular dynamics codes, where the access pattern is ....

R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--149, February 1999.


Reuse Distance as a Metric for Cache Behavior - Beyls, D'Hollander (2001)   (5 citations)  (Correct)

....role in minimizing the data access latency and main memory bandwidth demand. However, caches are not perfect and many techniques have been proposed to reduce the number of cache misses, both at the hardware and at the software level. Most of the proposed methods focus on eliminating conflict misses[3, 5, 9, 11, 13, 15]. Only some software techniques such as loop tiling, loop fusion and loop reversal focus on eliminating capacity misses. However, in real programs, such as spec95fp, it has been shown that up to 68 of the cache misses are capacity misses[8] This is consistent with our own measurements, presented ....

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142--149, Feb 1999.


The Processor-Memory Gap: Cache Remapping and Related Techniques - Beyls   (Correct)

....improvement, multi module caches have been proposed in many di erent forms, such as skewed associative caches[10] victim caches[5] assist caches[2] Software techniques to eliminate severe con ict misses have been discussed a lot in the literature. Examples of these are array padding[9] and data copying at run time[11] 3.3 Hiding Memory Latency When a cache miss is inevitable, the latency of the main memory access can be hidden by concurrently performing useful computations. One of the rst proposals to do this were non blocking caches, where computations continue as long as ....

....improvement of up to 10 over the next best technique and 450 over the original matrix multiplication. 0.2 0.22 0.24 0.26 0.28 0. 3 200 250 300 350 400 Matrix Dimension cache remapping padding copying LRW Figure 5: The performance of cache remapping and a number of other techniques[6, 11, 9] to improve cache performance of the tiled matrix multiplication. Cache remapping is at least as good and 10 better at maximum than the best alternative technique. It is 450 better than the original matrix multiplication, which is not shown here. ....

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142-149, Feb 1999.


Compiler Generated Multithreading to Alleviate Memory Latency - Beyls, D'Hollander (2000)   (Correct)

....tiling transformation. The largest part of remaining cache misses are con ict misses. The data fetch thread is able to resolve or hide the remaining misses, resulting in improved performance. The resulting performance is compared with other techniques to reduce the con ict misses[Lam et al..1991, Panda et al..1999, Rivera et al..1999, Temam et al..1993] in tiled algorithms. The con ict reducing techniques do not hide the cold and left over capacity misses. The only way to eliminate these misses is using some sort of prefetching. If prefetching is used without relocating the data, no improvements can be ....

....access latency of the L2 cache is 20 clock cycles and the access latency of the main memory is 65 clock cycles. The threaded version of the matrix multiplication is compared with the original program, a naively tiled version and three optimized versions corresponding to existing tiling algorithms[Panda et al..1999, Temam et al..1993, Lam et al..1991] Each algorithm was coded, compiled and simulated for matrices with dimensions between 20 and 400. In [Fig. 9] the average number of oating point operations per clock cycle is plotted. The plot is smoothed because of the irregularity in the data. The gure ....

[Article contains additional citation context not shown here]

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142-149, Feb 1999.


Cache Remapping to Improve the Performance of Tiled Algorithms - Beyls, D'Hollander   (Correct)

....increase the cache hit ratio. However, the low associativity of caches may lead to a high number of con ict misses and slow down execution so that only a fraction of the attainable performance is obtained. Additional ne tuning of the tiling transformation is needed to reduce the con ict misses[2, 6, 8, 11, 13]. Research nanced by the Flemish government under contracts IWT SB 991147 and GOA 12.0508.95) In this paper, cache remapping is o ered as a new technique to eliminate con ict misses in tiled algorithms. In addition, cache remapping produces no capacity misses and also cold misses are avoided for ....

....of the L2 cache is 20 clock cycles and the access latency of the main memory is 65 clock cycles. The cache remapping technique was compared with the original algorithm, a naively tiled algorithm not considering limited cache associativity and three optimized tiling algorithms, namely padding[8], copying[13] and LRW[6] Each algorithm was coded, compiled and simulated for matrix dimensions between 20 and 400. For the cache remapping algorithm, the tiles on the border of the iteration space were processed using the copying technique, because the pipelined 0 0.05 0.1 0.15 0.2 0.25 ....

[Article contains additional citation context not shown here]

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142-149, Feb 1999.


Evaluating the Impact of Memory System Performance on.. - Badawy, Aggarwal.. (2001)   (5 citations)  (Correct)

.... algorithms carefully select tile dimensions tailored to individual array dimensions so that no conflicts occur [11] Array padding expands leading array dimensions, increasing the range of non conflicting tile shapes [36] and improving the performance of tiled codes over a range of problem sizes [35, 38]. In this paper, we apply a combination of both algorithms to tile both 2D linear algebra and 3D PDE solvers [37, 38] 4.2 Reordering for Indexed Accesses Index arrays are needed for scientific applications such as sparse mesh PDE solvers and molecular dynamics codes, where the access pattern is ....

R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--149, February 1999.


Software Support For Improving Locality in Advanced Scientific Codes - Tseng (2000)   (Correct)

....and engineers. Because of trends in computer architectures, lessons learned here are also likely to prove very useful for other application domains, including image processing and high performance databases. 6 Related Work There has been much work on improving locality in scientific applications [3, 24, 25, 26, 39, 40, 57, 67, 69, 70, 79, 87]. Here we will focus on the work which is most relevant to our proposed research. A number of researchers have investigated tiling as a means of exploiting reuse. Lam, Rothberg, Wolf show conflict misses can severely degrade the performance of tiling [51] Wolf and Lam analyze temporal and ....

R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--149, February 1999.


Evaluating the Impact of Memory System Performance on .. - Aggarwal, Badawy.. (2000)   (5 citations)  (Correct)

....j, k) B(i, j 1, k) B(i, j, k 1) B(i, j, k 1) Figure 3 Tiled 3D Jacobi Example loop structure so that the innermost loops can fit in cache (due to fewer iterations) tiling allows reuse to be exploited on all the tiled dimensions. Tiling is very effective with linear algebra codes [12, 23, 24, 35, 37], and has been been extended to handle stencil codes used in iterative PDE solvers as well [42, 38, 44] A major problem with tiling is that limited cache associativity may cause data in a tile to be mapped onto the same cache lines, even though there is sufficient space in the cache. Conflict ....

....though there is sufficient space in the cache. Conflict misses will result, causing tile data to be evicted from cache before they may be reused [24] This effect is shown in Figure 2. Previous research found tile size selection and array padding can be applied to avoid conflict misses in tiles [12, 35, 37]. Tile size selection algorithms carefully select tile dimensions tailored to individual array dimensions so that no conflicts occur. For 2D arrays, the Euclidean remainder algorithm may be used to quickly compute a sequence of nonconflicting tile dimensions through a simple recurrence [12, 37] ....

R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--149, February 1999.


Tiling Optimizations for 3D Scientific Computations - Rivera, Tseng (2000)   (6 citations)  (Correct)

....5. These stencil codes can be found in benchmarkslike the the Livermore loops, and yield large improvements since they also perform 3 computation over 2 data. To exploit temporal reuse, compilers can simply skew the time step loop with respect to the inner stencil loops [33, 6, 3, 23]. Unfortunately, we believe these simple stencil kernels are in fact over simplified. A more realistic stencil code would actually have multiple loop nests within the timestep loop, in order to actually compute useful values. An example of a more realistic stencil code is shown in the middle of ....

....small to provided reuse for their necessarily large tiles. As explained earlier, their technique does not extend to multigrid solvers since these applications utilize a succession of smaller grid sizes. Panda et al. proposed applying padding in conjunction with 2D tiling to avoid conflict misses [23]. They first pick the largest tile size which fits in cache, then select pad sizes by exhaustively testing for conflicts within the tile, incrementing pads by one whenever a conflict is found. In comparison, our algorithm is more efficient because we generate non conflicting tile sizes directly ....

R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--149, February 1999.


Tiling Optimizations for 3D Scientific Computations - Rivera, Tseng (2000)   (6 citations)  (Correct)

....in Figure 5. These stencil codes can be found in benchmarks like the the Livermore loops, and yield large improvements since they also perform O(n 3 ) computation over O(n 2 ) data. To exploit temporal reuse, compilers can simply skew the time step loop with respect to the inner stencil loops [33, 6, 3, 23]. Unfortunately, we believe these simple stencil kernels are in fact over simplified. A more realistic stencil code would actually have multiple loop nests within the timestep loop, in order to actually compute useful values. An example of a more realistic stencil code is shown in the middle of ....

....small to provided reuse for their necessarily large tiles. As explained earlier, their technique does not extend to multigrid solvers since these applications utilize a succession of smaller grid sizes. Panda et al. proposed applying padding in conjunction with 2D tiling to avoid conflict misses [23]. They first pick the largest tile size which fits in cache, then select pad sizes by exhaustively testing for conflicts within the tile, incrementing pads by one whenever a conflict is found. In comparison, our algorithm is more efficient because we generate non conflicting tile sizes directly ....

R. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--149, February 1999.


A Stable and Efficient Loop Tiling Algorithm - Hsu, Kremer (2000)   (1 citation)  (Correct)

.... no self conflicts, and cross conflicts are minimized [22, 9, 8, 37, 32, 6] Some work has been done to quantify the total number of conflict misses [35, 14, 15, 12, 11] Unfortunately, the performance of a tiled program resulting from existing tiling heuristics shows a large amount of instability [32, 28]. Instability comes from the so called pathological array sizes [4, 10, 22, 2] which result in poor choices of tile sizes. Array padding [1, 23, 24, 30, 31] is a compiler optimization that increases the array sizes and initial locations to avoid the pathological cases. It introduces space overhead ....

....introduces space overhead but e#ectively stabilizes program performance. More recent research e#orts have investigated the combination of both loop tiling and array padding in the hope that both magnitude and stability of performance improvements of tiled programs can be achieved at the same time [32, 28, 21]. In this paper we discuss a new tile selection algorithm. Unlike some other tiling algorithms, our new algorithm does not require the number of padding choices to be fixed a priori. Instead, all pad sizes are evaluated in increasing order until a good tile has been found. Most previous ....

[Article contains additional citation context not shown here]

P. Panda, H. Nakamura, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2), February 1999.


A Data Cache with Dynamic Mapping - D'Alberto, Nicolau, Veidenbaum   Self-citation (Nicolau)   (Correct)

No context found.

P.R. Panda H. Nakamura N.D. Dutt and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2):142--9, Feb 1999.


SCIMA: A Novel Processor Architecture for High Performance .. - Masaaki Kondo Institute (2000)   Self-citation (Nakamura)   (Correct)

....the location and replacement of data is controlled by hardware in the cache, the required data are sometimes flushed out from the cache unfortunately due to line conflicts, which leads to performance degradation. To solve this problem, good tile size selection algorithm[6] and padding technique[8] have been proposed so far. One of the major disadvantages of these technique, however, is that programs should be rewritten depending on the detail of the cache structure, such as the cache size and the line size, and the data array sizes in the programs. The other disadvantage is that cache ....

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2), February 1999. 6


SCIMA: A Novel Architecture for High Performance Computing - Nakamura, Okawara   Self-citation (Nakamura)   (Correct)

....the location and replacement of data is controlled by hardware in the cache, the required data are sometimes flushed out unfortunately from the cache due to line conflicts, which leads to performance degradation. To solve this problem, good tile size selection algorithm[6] and padding technique[7] have been proposed so far. One of the major disadvantages of these technique, however, is that programs should be rewritten depending on the detail of the cache structure, such as the cache size and the line size, and the data array sizes in the programs. The other disadvantage is that cache ....

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, February 1999.


Software Methods to Improve Data Locality and Cache Behavior - Beyls (2004)   (Correct)

No context found.

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142--149, Feb 1999. 2.5, 2.8, 2.13, 2.3.2, 2.4, 3.3.2


Optimizing Matrix Multiplication with a Classifier Learning.. - Li, Garzaran   (Correct)

No context found.

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Trans. on Computers, 48(2):142--149, February 1999.


Analysis of Memory Access Behavior of DIS Stressmark - Suite And Optimization   (Correct)

No context found.

Panda, P.R.,Nakamura, H., Dutt, N.D., Nicolau,A. Augmenting Loop Tiling With Data Alignment For Improved Cache Performance. IEEE Transactions on Computers, vol.48, (no.2), IEEE, Feb. 1999. p.142-9. 14


A Quantitative Analysis of Tile Size Selection Algorithms - Hsu, Kremer   (Correct)

No context found.

P. Panda, H. Nakamura, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2), February 1999.


A Stable and Efficient Loop Tiling Algorithm - Hsu, Kremer (2000)   (1 citation)  (Correct)

No context found.

P. Panda, H. Nakamura, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE Transactions on Computers, 48(2), February 1999.


Software Methods to Improve Data Locality and Cache Behavior - Beyls (2004)   (Correct)

No context found.

P. Panda, H. Nakamura, N. Dutt, and A. Nicolau. Augmenting loop tiling with data alignment for improved cache performance. IEEE transactions on computers, 48(2):142--149, Feb 1999. 2.5, 2.8, 2.13, 2.3.2, 2.4, 3.3.2


Analysis of Memory Behavior of DIS Stressmark Suite and.. - Haitao Du November   (Correct)

No context found.

Panda, P.R.,Nakamura, H., Dutt, N.D., Nicolau,A. Augmenting Loop Tiling With Data Alignment For Improved Cache Performance. IEEE Transactions on Computers, vol.48, (no.2), IEEE, Feb. 1999. p.142-9. 13


Code placement in Hardware Software Co synthesis to improve.. - Parameswaran (2001)   (Correct)

No context found.

P. R. Panda, H. Nakamura, N. D. Dutt, and A. Nicolau, "Augmenting loop tiling with data alignment for improved cache performance," IEEE Transactions on Computers, vol. 48, pp. 142149 Phys Sci & Engin Journal holding 117(1968)- 1944(1995);1945(1996)-, 1999.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC