#### DMCA

## Tiling Optimizations for 3D Scientific Computations (2000)

### Cached

### Download Links

- [www.sc2000.org]
- [www.cs.umd.edu]
- [counter.cs.umd.edu]
- [www.cs.berkeley.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 68 - 4 self |

### Citations

795 | A data locality optimizing algorithm
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...in hiding the complexities of the memory hierarchy from scientists and engineers. Compilers may either rearrange the computation through loop transformations (e.g., loop permutation, fusion, fission) =-=[21, 33]-=-, or change the layout of data through data transformations (e.g., padding, transpose) [1, 16, 24]. Experiments show compilers can improve the performance of many benchmark programs, some times dramat... |

567 | The cache performance and optimizations of blocked algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ...ng cache lines to be flushed from cache before they may be reused, despite overall sufficient capacity in the cache. Conflict misses have been shown to severely degrade the performance of tiled codes =-=[20]-=-. In Section 2 we found that to improve reuse, the array tile must fit in cache and the tile size should be favorable according to the cost function. In this section we also show that the array tile s... |

329 | Improving data locality with loop transformations
- McKinley, Carr, et al.
- 1996
(Show Context)
Citation Context ...in hiding the complexities of the memory hierarchy from scientists and engineers. Compilers may either rearrange the computation through loop transformations (e.g., loop permutation, fusion, fission) =-=[21, 33]-=-, or change the layout of data through data transformations (e.g., padding, transpose) [1, 16, 24]. Experiments show compilers can improve the performance of many benchmark programs, some times dramat... |

263 |
Supernode partitioning
- Irigoin, Triolet
- 1988
(Show Context)
Citation Context ...ns for 3D stencils. Tiling (blocking) is a transformation which combines strip-mining with loop permutation to form small tiles of loop iterations which are executed together to exploit data locality =-=[2, 15, 36]-=-. 2.1 Tiling for stencil codes Tiling has been shown to be very effective for linear algebra codes. Because they perform O(n 3 ) computations over O(n 2 ) data, tiling can exploit O(n) temporal reuse ... |

250 |
Strategies for cache and local memory management by global program transformation
- Gannon, Jalby, et al.
- 1988
(Show Context)
Citation Context ...m provide a concise definition and summary of important types of data locality [33]. Computation-reordering transformations such as loop permutation and tiling are the primary optimization techniques =-=[9, 21, 33]-=-, though loop fission (distribution) and loop fusion have also been found to be helpful [21]. Data layout optimizations such as padding and transpose have been shown to be useful in eliminating confli... |

231 | Tile size selection using cache organization and data layout
- Coleman, McKinley
- 1995
(Show Context)
Citation Context ...yield large improvements since they also perform O(n 3 ) computation over O(n 2 ) data. To exploit temporal reuse, compilers can simply skew the time-step loop with respect to the inner stencil loops =-=[33, 6, 3, 23]-=-. Unfortunately, we believe these simple stencil kernels are in fact over-simplified. A more realistic stencil code would actually have multiple loop nests within the timestep loop, in order to actual... |

205 | More iteration space tiling
- Wolfe
- 1989
(Show Context)
Citation Context ...ns for 3D stencils. Tiling (blocking) is a transformation which combines strip-mining with loop permutation to form small tiles of loop iterations which are executed together to exploit data locality =-=[2, 15, 36]-=-. 2.1 Tiling for stencil codes Tiling has been shown to be very effective for linear algebra codes. Because they perform O(n 3 ) computations over O(n 2 ) data, tiling can exploit O(n) temporal reuse ... |

176 | ªUnifying Data and Control Transformations for Distributed Shared Memory
- Cierniak, Li
- 1995
(Show Context)
Citation Context ...adding and transpose have been shown to be useful in eliminating conflict misses and improving spatial locality [1, 17, 24, 25]. Data transformations have also been combined with loop transformations =-=[5, 16]-=-. Several cache capacity estimation techniques have been proposed to help guide data locality optimizations [9, 33]. These techniques can also be enhanced to take into account limited cache associativ... |

164 | Cache miss equations: a compiler framework for analyzing and tuning memory behavior
- Ghosh, Martonosi, et al.
- 1999
(Show Context)
Citation Context ...can also be enhanced to take into account limited cache associativity [8, 30]. More recently, Ghosh et al. developed symbolic cache representation which are highly accurate in predicting cache misses =-=[11, 12, 13]-=-. Their cache miss equations can be used to predict the number of cache misses for a computation, and also be used to guide compiler transformations such as tiling [14]. A number of researchers have i... |

162 |
Combining loop transformations considering caches and scheduling
- Wolf, Maydan, et al.
- 1996
(Show Context)
Citation Context ... of the cache A second method called effective cache size estimates conflicts which may arise in cache, then computes the size of a subset of cache, which is a small fraction of the actual cache size =-=[28, 34]-=-. The compiler then simply chooses smaller tiles that target the effective cache size. Experimental evaluations seem to indicate the effective cache size is close to 10% for tiled codes [26, 34]. This... |

154 | Data-centric multi-level blocking
- Kodukula, Ahmed, et al.
- 1997
(Show Context)
Citation Context ...her a tile should be copied to a contiguous buffer [31]. Kodukula et al. present a technique called data shackling which is very effective at generating tiled code for 2D complex linear algebra codes =-=[18]-=-. They implemented it in the SGI compiler and demonstrated its effectiveness compared to the commercial compiler [19]. Sarkar describes data locality optimizations used in the IBM XL Fortran compilers... |

133 | Data transformation for eliminating conflict misses
- Rivera, Tseng
- 1998
(Show Context)
Citation Context ... either rearrange the computation through loop transformations (e.g., loop permutation, fusion, fission) [21, 33], or change the layout of data through data transformations (e.g., padding, transpose) =-=[1, 16, 24]-=-. Experiments show compilers can improve the performance of many benchmark programs, some times dramatically. An important class of scientific programs attempt to compute solutions to partial differen... |

132 |
Iteration space tiling for memory hierarchies
- Wolfe
- 1987
(Show Context)
Citation Context ...guide compiler transformations such as tiling [14]. A number of researchers have investigated tiling as 11 a means of exploiting reuse. Tiling was first proposed by Irigoin and Triolet [15] and Wolfe =-=[35, 36]-=-. Lam, Rothberg, and Wolf show conflict misses can severely degrade the performance of 2D tiling [20]. Wolf and Lam analyze temporal and spatial reuse, and apply tiling when necessary to capture outer... |

121 | New Tiling Techniques to Improve Cache Temporal Locality
- Song, Li
- 1999
(Show Context)
Citation Context ...later sections, existing methods such as the algorithm of Wolf and Lam [33] can have only a minor impact on 3D stencils. Approaches devised specifically for stencils such as the method of Song and Li =-=[29]-=- are neither directly applicable in the 3D case nor extensible to more specialized multigrid applications. In this paper we examine the problems inhibiting locality for 3D stencil codes. We show that ... |

115 | Cache miss equations: An analytical representation of cache misses
- Ghosh, Martonosi, et al.
- 1997
(Show Context)
Citation Context ...can also be enhanced to take into account limited cache associativity [8, 30]. More recently, Ghosh et al. developed symbolic cache representation which are highly accurate in predicting cache misses =-=[11, 12, 13]-=-. Their cache miss equations can be used to predict the number of cache misses for a computation, and also be used to guide compiler transformations such as tiling [14]. A number of researchers have i... |

113 | To Copy or Not to Copy: A CompileTime Technique for Assessing When Data Copying Should Be Used to Eliminate Cache Conflicts”, Supercomputing’93
- Temam, al
- 1993
(Show Context)
Citation Context ...sider each approach, developing several strategies and heuristics which we then evaluate in Section 4. 3.1 Copy optimization Copying tiles into contiguous buffers is one method for avoiding conflicts =-=[20, 26, 31]-=-. It works well for linear algebra codes because each tile can be reused a large number of times, amortizing the overhead of performing the copy. In matrix multiplication for instance, copying costs a... |

104 |
On Estimating and Enhancing Cache Effectiveness
- Ferrante, Sarkar, et al.
- 1991
(Show Context)
Citation Context ...veral cache capacity estimation techniques have been proposed to help guide data locality optimizations [9, 33]. These techniques can also be enhanced to take into account limited cache associativity =-=[8, 30]-=-. More recently, Ghosh et al. developed symbolic cache representation which are highly accurate in predicting cache misses [11, 12, 13]. Their cache miss equations can be used to predict the number of... |

99 | Compiler Blockability of Numerical Algorithms
- Carr, Kennedy
(Show Context)
Citation Context ...ns for 3D stencils. Tiling (blocking) is a transformation which combines strip-mining with loop permutation to form small tiles of loop iterations which are executed together to exploit data locality =-=[2, 15, 36]-=-. 2.1 Tiling for stencil codes Tiling has been shown to be very effective for linear algebra codes. Because they perform O(n 3 ) computations over O(n 2 ) data, tiling can exploit O(n) temporal reuse ... |

87 | Precise miss analysis for program transformations with caches of arbitrary associativity
- Ghosh, Martonosi, et al.
- 1998
(Show Context)
Citation Context ...of non-conflicting tile dimensions [6, 26]. A cost function is used to select the tiles preserving the most reuse. A search space algorithm using a very precise cache model can obtain similar results =-=[12, 14]-=-. To efficiently compute non-conflicting tile dimensions for 3D arrays we introduce Euc3D, an extension to the Euc algorithm given in [26]. The pseudocode in Figure 9 presents an overview of Euc3D. Li... |

84 | Cache interference phenomena - Temam, Fricker, et al. - 1994 |

64 | Quantifying the multi-level nature of tiling interactions
- Mitchell, Högstedt, et al.
- 1998
(Show Context)
Citation Context ...based on cost models for estimating capacity and cross-interference misses [3]. Mitchell et al. discussed the interactions of multi-level tiling for several goals, such as cache, TLB, and parallelism =-=[22]-=-. They found explicitly considering multiple levels of the memory hierarchy (cache and TLB) led to the choice of compromise tile sizes which can yield significant improvements in performance. Followin... |

58 | Improving locality using loop and data transformations in an integrated framework
- Kandemir, Choudhary, et al.
- 1998
(Show Context)
Citation Context ... either rearrange the computation through loop transformations (e.g., loop permutation, fusion, fission) [21, 33], or change the layout of data through data transformations (e.g., padding, transpose) =-=[1, 16, 24]-=-. Experiments show compilers can improve the performance of many benchmark programs, some times dramatically. An important class of scientific programs attempt to compute solutions to partial differen... |

55 | A comparison of compiler tiling algorithms
- Rivera, Tseng
- 1999
(Show Context)
Citation Context ...sider each approach, developing several strategies and heuristics which we then evaluate in Section 4. 3.1 Copy optimization Copying tiles into contiguous buffers is one method for avoiding conflicts =-=[20, 26, 31]-=-. It works well for linear algebra codes because each tile can be reused a large number of times, amortizing the overhead of performing the copy. In matrix multiplication for instance, copying costs a... |

48 | Augmenting loop tiling with data alignment for improved cache performance
- Panda, Nakamura, et al.
- 1999
(Show Context)
Citation Context ...yield large improvements since they also perform O(n 3 ) computation over O(n 2 ) data. To exploit temporal reuse, compilers can simply skew the time-step loop with respect to the inner stencil loops =-=[33, 6, 3, 23]-=-. Unfortunately, we believe these simple stencil kernels are in fact over-simplified. A more realistic stencil code would actually have multiple loop nests within the timestep loop, in order to actual... |

36 | Eliminating conflict misses for high performance architectures
- Rivera, Tseng
- 1998
(Show Context)
Citation Context ...individual loop nests [21, 33]. Tiling is usually not needed, since most locality can be obtained through loop permutation, though in some cases array padding may be necessary to preserve group reuse =-=[25]-=-. Finally, we consider multigrid codes, a third type of stencil code where even these locality transformations are not possible. Multigrid codes are also iterative PDE solvers, but speed up convergenc... |

35 | A compiler algorithm for optimizing locality in loop nests
- Kandemir, Ramanujam, et al.
- 1997
(Show Context)
Citation Context ...loop fusion have also been found to be helpful [21]. Data layout optimizations such as padding and transpose have been shown to be useful in eliminating conflict misses and improving spatial locality =-=[1, 17, 24, 25]-=-. Data transformations have also been combined with loop transformations [5, 16]. Several cache capacity estimation techniques have been proposed to help guide data locality optimizations [9, 33]. The... |

34 | A tile selection algorithm for data locality and cache interference
- Chame, Moon
- 1999
(Show Context)
Citation Context ...yield large improvements since they also perform O(n 3 ) computation over O(n 2 ) data. To exploit temporal reuse, compilers can simply skew the time-step loop with respect to the inner stencil loops =-=[33, 6, 3, 23]-=-. Unfortunately, we believe these simple stencil kernels are in fact over-simplified. A more realistic stencil code would actually have multiple loop nests within the timestep loop, in order to actual... |

33 | 2000. Transforming loops to recursion for multi-level memory hierarchies
- Yi, Adve, et al.
(Show Context)
Citation Context ...natively, recursive divide-and-conquer algorithms can be used to obtain many of the benefits of tiling [10]. Compiler analysis can even automatically transform loop nests and generate recursive codes =-=[38]-=-. Finally, Wei et al. explored program transformations for improving the performance numerical algorithms on hierarchical memories [32]. They applied tiling and padding transformations by hand based o... |

30 | Locality optimizations for multi-level caches
- Rivera, Tseng
- 1999
(Show Context)
Citation Context ... actual program performance on a 360 MHz Sun UltraSparc2 platform. Though tiling only targeted the L1 cache, we also expect indirect improvements in L2 cache performance as shown by previous research =-=[27]-=-. Cache miss rates were simulated for the 16K L1 and 2M L2 direct-mapped caches present in this architecture. To compare original and optimized performance thoroughly, we varied problem sizes over a r... |

28 | A compiler framework for restructuring data declarations to enhance cache and TLB effectiveness
- Bacon, Chow, et al.
- 1994
(Show Context)
Citation Context ... either rearrange the computation through loop transformations (e.g., loop permutation, fusion, fission) [21, 33], or change the layout of data through data transformations (e.g., padding, transpose) =-=[1, 16, 24]-=-. Experiments show compilers can improve the performance of many benchmark programs, some times dramatically. An important class of scientific programs attempt to compute solutions to partial differen... |

25 | Architecture-Cognizant Divide and Conquer Algorithms
- Gatlin, Carter
- 1999
(Show Context)
Citation Context ...misses in tiled codes by storing data accessed in each tile using space-filling curves [4]. Alternatively, recursive divide-and-conquer algorithms can be used to obtain many of the benefits of tiling =-=[10]-=-. Compiler analysis can even automatically transform loop nests and generate recursive codes [38]. Finally, Wei et al. explored program transformations for improving the performance numerical algorith... |

24 | Memory characteristics of iterative methods
- Weiss, Karl, et al.
- 1999
(Show Context)
Citation Context ...e I J K Figure 4 Access pattern for 3D Jacobi As 3D stencil codes become widespread, numerical analysts discover that they have particularly poor memory behavior with respect to microprocessor caches =-=[32]-=-. Ideally, applications can minimize cache misses by bringing data into cache just once for all of its multiple accesses. 3D PDE solvers suffer poor cache performance because accesses to the same data... |

15 | Time skewing for parallel computers
- Wonnacott
- 2000
(Show Context)
Citation Context ...om the time-step loop, the compiler must perform much more complex analyses and transformations to match and fuse tiles from the multiple loop nests. Such techniques have been developed for 2D arrays =-=[29, 37]-=-, but not for 3D. Instead, most optimizations have focused on exploiting temporal and spatial reuse within individual loop nests [21, 33]. Tiling is usually not needed, since most locality can be obta... |

11 | Automated cache optimizations using cme driven diagnosis, in
- Ghosh, Martonosi, et al.
(Show Context)
Citation Context ...of non-conflicting tile dimensions [6, 26]. A cost function is used to select the tiles preserving the most reuse. A search space algorithm using a very precise cache model can obtain similar results =-=[12, 14]-=-. To efficiently compute non-conflicting tile dimensions for 3D arrays we introduce Euc3D, an extension to the Euc algorithm given in [26]. The pseudocode in Figure 9 presents an overview of Euc3D. Li... |

6 |
An experimental evaluation of tiling and shacking for memory hierarchy management
- Kodukula, Pingali
- 1999
(Show Context)
Citation Context ...ich is very effective at generating tiled code for 2D complex linear algebra codes [18]. They implemented it in the SGI compiler and demonstrated its effectiveness compared to the commercial compiler =-=[19]-=-. Sarkar describes data locality optimizations used in the IBM XL Fortran compilers, including loop transformations and tiling [28]. Chame and Moon propose tiling algorithms for choosing tile sizes ba... |

6 |
Automatic selection of higher order transformations in the IBM XL Fortran compilers
- Sarkar
- 1997
(Show Context)
Citation Context ... of the cache A second method called effective cache size estimates conflicts which may arise in cache, then computes the size of a subset of cache, which is a small fraction of the actual cache size =-=[28, 34]-=-. The compiler then simply chooses smaller tiles that target the effective cache size. Experimental evaluations seem to indicate the effective cache size is close to 10% for tiled codes [26, 34]. This... |

1 |
Nonlinear array layouts for hierar12 chical memory systems
- Chatterjee, Jain, et al.
- 1999
(Show Context)
Citation Context ...sizes directly for different pads. Chatterjee et al. demonstrate that nonlinear array layouts can avoid conflict misses in tiled codes by storing data accessed in each tile using space-filling curves =-=[4]-=-. Alternatively, recursive divide-and-conquer algorithms can be used to obtain many of the benefits of tiling [10]. Compiler analysis can even automatically transform loop nests and generate recursive... |

1 |
Improving data locality for caches. Master 's thesis
- Esseghir
- 1993
(Show Context)
Citation Context ...e temporal and spatial reuse, and apply tiling when necessary to capture outer loop reuse [33], Esseghir proposed using tall tiles consisting of the maximum number of array columns which fit in cache =-=[7]-=-. Coleman and McKinley select rectangular non-conflicting tile sizes [6] while others focus on using a portion of cache [34]. Temam et al. analyze the program to determine whether a tile should be cop... |

1 | Nonlinear array layouts for hierar memory systems - Chatterjee, Jain, et al. - 1999 |

1 | Improving data locality for caches.Master’s thesis - Esseghir - 1993 |

1 | The cacheperformance and optimizations of blocked algorithms - Lam, Rothberg, et al. - 1991 |