#### DMCA

## Nonlinear Array Layouts for Hierarchical Memory Systems (1999)

### Cached

### Download Links

- [www.cs.duke.edu]
- [ftp.cs.unc.edu]
- [www.cs.duke.edu]
- [ftp.cs.unc.edu]
- [cobweb.ecn.purdue.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 74 - 5 self |

### Citations

1153 |
Advanced Compiler Design and Implementation
- Muchnick
- 1997
(Show Context)
Citation Context ... loop nest. Loop tiling is usually accompanied by loop unrolling and software pipelining to enable better use of registers and the processor pipeline. The theory of these techniques is well-developed =-=[42]-=-, and implementations This work supported in part by DARPA Grant DABT63-98-1-0001, NSF Grants CDA-97-2637 and CDA-95-12356, NSF Career Award MIP97 -02547, The University of North Carolina at Chapel Hi... |

861 |
A set of level 3 basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ... techniques are available as optimization options in several high-performance compilers. Tiling techniques are also often applied at the source level in numerical libraries, e.g., in the level-3 BLAS =-=[15]-=- and in LAPACK [1]. In this paper, we assume that loop tiling has been performed either by the programmer or by the compiler, and examine the additional performance gains achievable using nonlinear ar... |

795 | A data locality optimizing algorithm
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...te in Section 3.4 that such conversions can indeed be performed efficiently. Tile size selection The tile size chosen for loop tiling has a significant impact on performance, and previous work (e.g., =-=[2, 14, 16, 18, 60]-=-) has addressed various questions related to tile size selection. The nonlinear layout schemes are based on the result that a small contiguous tile exhibits no self-interference misses. This makes the... |

782 | ATOM: A System for Building Customized Program Analysis Tools
- Srivastava, Eustace
- 1994
(Show Context)
Citation Context ...excluding such things as allocation and initialization) using the getrusage() system call; and we execute multiple trials to further reduce measurement error. We evaluate cache performance using ATOM =-=[50]-=- and TLB performance using fast-cache [39]. The following sections highlight our major results. The data presented below is a highly condensed version of our complete data, due to length limitations. ... |

600 |
Computer Solution of Large Sparse Positive Definite Systems
- George, Liu
- 1981
(Show Context)
Citation Context ...based data structures, such as trees [12] and heaps [34, 36, 37]; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices =-=[19]-=-); and in parallel computing [5, 27, 28, 44, 49, 59]. But when working with general dense matrices in a uniprocessor environment, most programmers are reluctant to alter the default row-major or colum... |

567 | The cache performance and optimizations of blocked algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ...d non-smooth function of the array size, the tile size, and the cache parameters [18]. These considerations lead us to investigate other array layout functions. We begin with the result of Lam et al. =-=[35]-=- that a t R \Theta t C array that is contiguous in memory and fits in cache causes no self-interference misses. If we fix values for t R and t C , we can now conceptually view our original m \Theta n ... |

514 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ...situation. Name Description BLAS? Layouts used LCM L 4D LMO BMXM Tiled 6-loop matrix multiplication p p p p RECMXM Recursive matrix multiplication [17] p p p STRASSEN Strassen's matrix multiplication =-=[53]-=- p p p CHOL Right-looking Cholesky factorization [32] p p STDHAAR Standard wavelet compression of image (Haar basis) [52] p p p NONHAAR Non-standard wavelet compression of image (Haar basis) [52] p p ... |

503 |
LAPACK User’s Guide
- Anderson, Bai, et al.
- 1992
(Show Context)
Citation Context ...ilable as optimization options in several high-performance compilers. Tiling techniques are also often applied at the source level in numerical libraries, e.g., in the level-3 BLAS [15] and in LAPACK =-=[1]-=-. In this paper, we assume that loop tiling has been performed either by the programmer or by the compiler, and examine the additional performance gains achievable using nonlinear array layout functio... |

357 |
Space-Filling curves
- Sagan
- 1994
(Show Context)
Citation Context ... ; t 0 R ); U(t j ; kC ; t C ; t 0 C )) 2.2 The Morton layout, LMO Our second nonlinear layout function has been variously described as being based either on quadtrees [17] or on space-filling curves =-=[25, 43, 48]-=-. This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in informati... |

320 | Evaluating Associativity in CPU Caches - Hill, Smith - 1989 |

277 |
Wavelets for Computer Graphics: Theory and Applications
- Stollnitz, DeRose, et al.
- 1996
(Show Context)
Citation Context ...ive matrix multiplication [17] p p p STRASSEN Strassen's matrix multiplication [53] p p p CHOL Right-looking Cholesky factorization [32] p p STDHAAR Standard wavelet compression of image (Haar basis) =-=[52]-=- p p p NONHAAR Non-standard wavelet compression of image (Haar basis) [52] p p p Table 1: Description of benchmark suite. Parameter Ultra 10 Ultra 60 Miata CPU UltraSPARC-IIi UltraSPARC-II Alpha 21164... |

261 | Optimizing matrix multiply using PHiPAC: A portable high-performance ANSI C methodology
- Bilmes, Asanovic, et al.
- 1997
(Show Context)
Citation Context ...layout in scientific libraries and applications, and work in the compiler community related to tiling for parallelism and cache optimizations. Scientific libraries and applications The PHiPAC project =-=[7]-=- aims at producing highly tuned code for specific BLAS 3 [15] kernels such as matrix multiplication that are tiled for multiple levels of the memory hierarchy. Their approach to generating an efficien... |

231 | Tile size selection using cache organization and data layout
- Coleman, McKinley
- 1995
(Show Context)
Citation Context ...te in Section 3.4 that such conversions can indeed be performed efficiently. Tile size selection The tile size chosen for loop tiling has a significant impact on performance, and previous work (e.g., =-=[2, 14, 16, 18, 60]-=-) has addressed various questions related to tile size selection. The nonlinear layout schemes are based on the result that a small contiguous tile exhibits no self-interference misses. This makes the... |

218 | Compiler optimizations for improving data locality
- Carr, McKinley, et al.
- 1994
(Show Context)
Citation Context ...mputations. Tiling and related work The compiler literature contains much work on iteration space tiling. Some authors aim at gaining parallelism [61], while others target improving cache performance =-=[9, 60]. Kodukula-=- et al. [32] present a data-centric approach to loop tiling called "shackling" that handles imperfect loop nests and can be composed to tile for multiple levels of the memory hierarchy. Cart... |

216 |
Clustering of objects with multiple attributes
- Linear
- 1990
(Show Context)
Citation Context ...used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in information theory [6], for graphics applications [23, 38], and for database applications =-=[29]-=-. Figure 1(d) illustrates this layout. Morton ordering has the following operational interpretation. Divide the original matrix into four quadrants, and lay out these submatrices in memory in the orde... |

205 | More iteration space tiling
- Wolfe
- 1989
(Show Context)
Citation Context ...uring of the control flow of a program and of the data structures it uses results in the highest level of performance. We focus on dense matrix codes for which loop tiling (also called loop blocking) =-=[61]-=- is an appropriate means of high-level control flow restructuring to improve locality. Loop tiling is a program transformation that tessellates the iteration space of a loop nest with uniform tiles of... |

197 | A parallel hashed oct-tree n-body algorithm
- Warren, Salmon
- 1993
(Show Context)
Citation Context ...ees [12] and heaps [34, 36, 37]; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing =-=[5, 27, 28, 44, 49, 59]-=-. But when working with general dense matrices in a uniprocessor environment, most programmers are reluctant to alter the default row-major or column-major linearization of multidimensional arrays tha... |

177 | Data and computation transformations for multiprocessors
- Anderson, Amarasinghe, et al.
- 1995
(Show Context)
Citation Context ...te in Section 3.4 that such conversions can indeed be performed efficiently. Tile size selection The tile size chosen for loop tiling has a significant impact on performance, and previous work (e.g., =-=[2, 14, 16, 18, 60]-=-) has addressed various questions related to tile size selection. The nonlinear layout schemes are based on the result that a small contiguous tile exhibits no self-interference misses. This makes the... |

176 | ªUnifying Data and Control Transformations for Distributed Shared Memory
- Cierniak, Li
- 1995
(Show Context)
Citation Context ...out functions that are linear and monotonically increasing in the argumentssi and j (such functions certainly being easy to compute), it is easy to prove that there are only two such layout functions =-=[13]-=-: the row-major layout LRM as used in Pascal, given by LRM (i; j; m;n) = n \Delta i + j; and the column-major layout LCM as used in Fortran, given by LCM (i; j; m;n) = m \Delta j + i. We refer to thes... |

161 | Cacheconscious data placement
- Calder, Krintz, et al.
- 1998
(Show Context)
Citation Context ...ache-conscious" or "memory-friendly". Such restructuring techniques have been studied for pointer-based data structures, such as trees [12] and heaps [34, 36, 37]; for profile-driven ob=-=ject placement [8]-=-; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing [5, 27, 28, 44, 49, 59]. But when working with general dense matrices in... |

154 | Data-centric multi-level blocking
- Kodukula, Ahmed, et al.
- 1997
(Show Context)
Citation Context ...4D LMO BMXM Tiled 6-loop matrix multiplication p p p p RECMXM Recursive matrix multiplication [17] p p p STRASSEN Strassen's matrix multiplication [53] p p p CHOL Right-looking Cholesky factorization =-=[32]-=- p p STDHAAR Standard wavelet compression of image (Haar basis) [52] p p p NONHAAR Non-standard wavelet compression of image (Haar basis) [52] p p p Table 1: Description of benchmark suite. Parameter ... |

135 |
Software Methods for Improvement of Cache Performance
- Porterfield
- 1989
(Show Context)
Citation Context ...ests and can be composed to tile for multiple levels of the memory hierarchy. Carter et al. [10] discuss hierarchical tiling schemes for a hierarchical shared memory model. Porterfield's dissertation =-=[45]-=- discusses program transformations and software pre-fetching techniques to improve the cache behavior of scientific codes. Lam, Rothberg, and Wolf [35] discuss the importance of cache optimizations fo... |

133 | Data transformation for eliminating conflict misses
- Rivera, Tseng
- 1998
(Show Context)
Citation Context ...esulting in non-square tiles. They claim that their method outperforms the copy optimization recommended by Lam et al. [35]. We have compared our approach with theirs in Section 3.5. Rivera and Tseng =-=[46, 47]-=- discuss intra- and inter-array padding as a means of reducing conflict misses. Ghosh et al. [20, 21] present an analytical model for estimating cache misses for perfect loop nests. A substantial body... |

132 |
Sur une courbe qui remplit toute une aire plaine
- Peano
(Show Context)
Citation Context ... ; t 0 R ); U(t j ; kC ; t C ; t 0 C )) 2.2 The Morton layout, LMO Our second nonlinear layout function has been variously described as being based either on quadtrees [17] or on space-filling curves =-=[25, 43, 48]-=-. This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in informati... |

131 |
Loop Transformations for Restructuring Compilers: The Foundations
- Banerjee
- 1992
(Show Context)
Citation Context ...essive elements of an array row. Such low spatial locality can usually be corrected by appropriate loop transformations (such as interchange, reversal, or skewing) when such transformations are legal =-=[4]-=-. Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs), because the dilation effect extends to virtual memory pages [3, 54]. Finally, it may cau... |

115 | Cache miss equations: An analytical representation of cache misses
- Ghosh, Martonosi, et al.
- 1997
(Show Context)
Citation Context ...ded by Lam et al. [35]. We have compared our approach with theirs in Section 3.5. Rivera and Tseng [46, 47] discuss intra- and inter-array padding as a means of reducing conflict misses. Ghosh et al. =-=[20, 21]-=- present an analytical model for estimating cache misses for perfect loop nests. A substantial body of work in the parallel computing literature deals with layout optimization of arrays. Representativ... |

113 |
Surpassing the TLB Performance of Superpages with Less Operating System Support
- Talluri, Hill
- 1994
(Show Context)
Citation Context ...ry fully-associative TLB reveal that nonlinear layout reduces the TLB miss ratio to 0.01% compared to 2.5% for the canonical layout. We note that TLB performance could be improved by using superpages =-=[55]-=- to map the entire array with a single TLB entry. 3.8 Summary The experimental data support our claim that nonlinear layouts provide significant performance benefits for dense matrix codes. By making ... |

107 | Automatic Data Partitioning on Distributed Memory Multicomputers
- Gupta
- 1992
(Show Context)
Citation Context ...oosing good data decompositions in languages such as High Performance Fortran [33]. We expect that the techniques for layout optimization developed in the vector and parallel computing context (e.g., =-=[11, 24, 30, 31, 40]-=-) can be adapted to the hierarchical memory situation. Name Description BLAS? Layouts used LCM L 4D LMO BMXM Tiled 6-loop matrix multiplication p p p p RECMXM Recursive matrix multiplication [17] p p ... |

87 | Precise miss analysis for program transformations with caches of arbitrary associativity
- Ghosh, Martonosi, et al.
- 1998
(Show Context)
Citation Context ...ded by Lam et al. [35]. We have compared our approach with theirs in Section 3.5. Rivera and Tseng [46, 47] discuss intra- and inter-array padding as a means of reducing conflict misses. Ghosh et al. =-=[20, 21]-=- present an analytical model for estimating cache misses for perfect loop nests. A substantial body of work in the parallel computing literature deals with layout optimization of arrays. Representativ... |

85 | Auto-blocking matrix-multiplication or tracking blas3 performance with source code
- Frens, Wise
- 1997
(Show Context)
Citation Context ...CM (f i ; f j ; U(t i ; kR ; t R ; t 0 R ); U(t j ; kC ; t C ; t 0 C )) 2.2 The Morton layout, LMO Our second nonlinear layout function has been variously described as being based either on quadtrees =-=[17]-=- or on space-filling curves [25, 43, 48]. This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been appl... |

79 | Footprints in the cache - Stone, Thiebaut - 1986 |

72 |
Space-Filling Curves: Their Generation and Their Application to Bandwidth Reduction
- Bially
- 1969
(Show Context)
Citation Context ...out is known in parallel computing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in information theory =-=[6]-=-, for graphics applications [23, 38], and for database applications [29]. Figure 1(d) illustrates this layout. Morton ordering has the following operational interpretation. Divide the original matrix ... |

65 | The influence of caches on the performance of heaps
- LaMarca, Ladner
- 1996
(Show Context)
Citation Context ...ly to restructure multidimensional arrays to be "cache-conscious" or "memory-friendly". Such restructuring techniques have been studied for pointer-based data structures, such as t=-=rees [12] and heaps [34, 36, 37]-=-; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing [5, 27, 28, 44, 49, 59]. But wh... |

61 | Dynamic partitioning of non-uniform structured workloads with space lling curves. (submitted to
- Pilkington, Baden
- 1995
(Show Context)
Citation Context ...ees [12] and heaps [34, 36, 37]; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing =-=[5, 27, 28, 44, 49, 59]-=-. But when working with general dense matrices in a uniprocessor environment, most programmers are reluctant to alter the default row-major or column-major linearization of multidimensional arrays tha... |

57 | Empirical Evaluation of the CRAY-T3D: A Compiler Perspective
- Arpaci, Culler, et al.
- 1995
(Show Context)
Citation Context ...ransformations are legal [4]. Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs), because the dilation effect extends to virtual memory pages =-=[3, 54]-=-. Finally, it may cause cache misses due to self-interference even when a tiled loop repeatedly accesses a small tile in the array index space, because the canonical layout depends on the matrix size ... |

53 | BRecursive array layouts and fast parallel matrix multiplication
- Chatterjee, Lebeck, et al.
- 1999
(Show Context)
Citation Context ... There is, in fact, a family of Morton and Morton-like layout functions, of varying degrees of complexity. The variant describes above is more accurately called the Z-Morton layout. Chatterjee et al. =-=[12]-=- discuss the details of this family in greater detail. 2.3 Practical issues The mathematical description of L4D and LMO above glosses over several practical details that are critical for efficient imp... |

51 | Towards a theory of cache-efficient algorithms
- Sen, Chatterjee, et al.
(Show Context)
Citation Context ...mpiler support. The observed performance is nonetheless quite respectable. It is our position that the ability to directly manipulate array layout has ramifications all the way up to algorithm design =-=[50, 60]-=-, and is not something that compilers alone should manipulate. Replacing one layout by another is simple and easily mechanizable, but determining matching controlflow changes is significantly more com... |

50 | Optimal Evaluation of Array Expressions on Massively Parallel Machines
- Chatterjee, Gilbert, et al.
- 1995
(Show Context)
Citation Context ...oosing good data decompositions in languages such as High Performance Fortran [33]. We expect that the techniques for layout optimization developed in the vector and parallel computing context (e.g., =-=[11, 24, 30, 31, 40]-=-) can be adapted to the hierarchical memory situation. Name Description BLAS? Layouts used LCM L 4D LMO BMXM Tiled 6-loop matrix multiplication p p p p RECMXM Recursive matrix multiplication [17] p p ... |

49 | Increasing TLB reach using superpages backed by shadow memory
- Swanson, Stoller, et al.
- 1998
(Show Context)
Citation Context ...ransformations are legal [4]. Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs), because the dilation effect extends to virtual memory pages =-=[3, 54]-=-. Finally, it may cause cache misses due to self-interference even when a tiled loop repeatedly accesses a small tile in the array index space, because the canonical layout depends on the matrix size ... |

45 |
Memory Storage Patterns in Parallel Processing. Kluwer international series in engineering and computer science
- Mace
- 1987
(Show Context)
Citation Context ...oosing good data decompositions in languages such as High Performance Fortran [33]. We expect that the techniques for layout optimization developed in the vector and parallel computing context (e.g., =-=[11, 24, 30, 31, 40]-=-) can be adapted to the hierarchical memory situation. Name Description BLAS? Layouts used LCM L 4D LMO BMXM Tiled 6-loop matrix multiplication p p p p RECMXM Recursive matrix multiplication [17] p p ... |

41 | Hierarchical Tiling for Improved Superscalar Performance
- Carter, Ferrante, et al.
- 1995
(Show Context)
Citation Context ... et al. [32] present a data-centric approach to loop tiling called "shackling" that handles imperfect loop nests and can be composed to tile for multiple levels of the memory hierarchy. Cart=-=er et al. [10]-=- discuss hierarchical tiling schemes for a hierarchical shared memory model. Porterfield's dissertation [45] discusses program transformations and software pre-fetching techniques to improve the cache... |

41 |
Uber stetige abbildung einer linie auf ein flächenstük
- Hilbert
(Show Context)
Citation Context ... ; t 0 R ); U(t j ; kC ; t C ; t 0 C )) 2.2 The Morton layout, LMO Our second nonlinear layout function has been variously described as being based either on quadtrees [17] or on space-filling curves =-=[25, 43, 48]-=-. This layout is known in parallel computing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in informati... |

41 | Tuning Strassen’s matrix multiplication for memory efficiency
- THOTTETHODI, CHATTERJEE, et al.
- 1998
(Show Context)
Citation Context ...hich would result in capacity misses). Since tiles are contiguous, there are no self-interference misses. This makes the performance of the leaf-level computations almost insensitive to the tile size =-=[57]-=-. Our scheme is very sensitive to the amount of padding, since it performs redundant computations on the padded portions of the matrices. However, if we choose tile sizes from the range [T min ; Tmax ... |

36 | Eliminating conflict misses for high performance architectures
- Rivera, Tseng
- 1998
(Show Context)
Citation Context ...esulting in non-square tiles. They claim that their method outperforms the copy optimization recommended by Lam et al. [35]. We have compared our approach with theirs in Section 3.5. Rivera and Tseng =-=[46, 47]-=- discuss intra- and inter-array padding as a means of reducing conflict misses. Ghosh et al. [20, 21] present an analytical model for estimating cache misses for perfect loop nests. A substantial body... |

35 | Balancing processor loads and exploiting data locality in N-body simulations
- Banicescu, Hummel
- 1995
(Show Context)
Citation Context ...ees [12] and heaps [34, 36, 37]; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing =-=[5, 27, 28, 44, 49, 59]-=-. But when working with general dense matrices in a uniprocessor environment, most programmers are reluctant to alter the default row-major or column-major linearization of multidimensional arrays tha... |

34 | High Performance Fortran for highly irregular problems
- Hu, Johnsson, et al.
- 1997
(Show Context)
Citation Context |

30 |
Active memory: A new abstraction for memory system simulation
- Lebeck, Wood
- 1997
(Show Context)
Citation Context ...itialization) using the getrusage() system call; and we execute multiple trials to further reduce measurement error. We evaluate cache performance using ATOM [50] and TLB performance using fast-cache =-=[39]-=-. The following sections highlight our major results. The data presented below is a highly condensed version of our complete data, due to length limitations. 3.1 Performance improvement over canonical... |

30 | An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH Multiprocessors
- Singh, Joe, et al.
- 1993
(Show Context)
Citation Context |

29 |
The High Performance Fortran Handbook. Scientific and Engineering Computation
- Koelbel, Loveman, et al.
- 1994
(Show Context)
Citation Context ...mmable does raise the obvious question of how one chooses the desired layout. This problem is related to the problem of choosing good data decompositions in languages such as High Performance Fortran =-=[33]-=-. We expect that the techniques for layout optimization developed in the vector and parallel computing context (e.g., [11, 24, 30, 31, 40]) can be adapted to the hierarchical memory situation. Name De... |

25 |
Improving data locality for caches
- Esseghir
- 1993
(Show Context)
Citation Context |

23 |
Automatic data layout for distributed memory machines
- Kennedy, Kremer
- 1998
(Show Context)
Citation Context |

21 | Influence of cross-interference on blocked loops: A case study with matrix-vector multiply
- FRICKER, TEMAM, et al.
- 1995
(Show Context)
Citation Context ...he canonical layout depends on the matrix size rather than the tile size. Such interference misses are a complicated and non-smooth function of the array size, the tile size, and the cache parameters =-=[18]-=-. These considerations lead us to investigate other array layout functions. We begin with the result of Lam et al. [35] that a t R \Theta t C array that is contiguous in memory and fits in cache cause... |

19 |
Steele Jr. Data optimization: Allocation of arrays to reduce communication on SlMD machines
- Knobe, Lukas, et al.
- 1990
(Show Context)
Citation Context |

18 | Load balancing and data locality via fractiling: An experimental study
- Hummel, Banicescu, et al.
- 1995
(Show Context)
Citation Context |

17 |
Optimizing raster storage: an examination of four alternatives
- Goodchild, Grandfield
- 1983
(Show Context)
Citation Context ...ing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in information theory [6], for graphics applications =-=[23, 38]-=-, and for database applications [29]. Figure 1(d) illustrates this layout. Morton ordering has the following operational interpretation. Divide the original matrix into four quadrants, and lay out the... |

17 | Caches and Algorithms
- LaMarca
- 1996
(Show Context)
Citation Context ...ly to restructure multidimensional arrays to be "cache-conscious" or "memory-friendly". Such restructuring techniques have been studied for pointer-based data structures, such as t=-=rees [12] and heaps [34, 36, 37]-=-; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing [5, 27, 28, 44, 49, 59]. But wh... |

16 | Can parallel algorithms enhance serial implementation
- Vishkin
- 1996
(Show Context)
Citation Context ...mpiler support. The observed performance is nonetheless quite respectable. It is our position that the ability to directly manipulate array layout has ramifications all the way up to algorithm design =-=[58]-=-, and is not something that compilers alone should manipulate. Replacing one layout by another is simple and easily mechanizable, but determining matching controlflow changes is significantly more com... |

15 | Improving pointer-based codes through cache-conscious data placement
- Chilimbi, Larus, et al.
- 1998
(Show Context)
Citation Context ...y are less likely to restructure multidimensional arrays to be "cache-conscious" or "memory-friendly". Such restructuring techniques have been studied for pointer-based data struct=-=ures, such as trees [12]-=- and heaps [34, 36, 37]; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing [5, 27, ... |

10 |
Cache Performance Analysis of Algorithms
- LaMarca, Ladner
- 1997
(Show Context)
Citation Context ...ly to restructure multidimensional arrays to be "cache-conscious" or "memory-friendly". Such restructuring techniques have been studied for pointer-based data structures, such as t=-=rees [12] and heaps [34, 36, 37]-=-; for profile-driven object placement [8]; for matrices with special structure (e.g., banded matrices in LAPACK [1], or sparse matrices [19]); and in parallel computing [5, 27, 28, 44, 49, 59]. But wh... |

10 | Techniques for improving the data locality of iterative methods
- Stals, Rüde
- 1997
(Show Context)
Citation Context ... and selecting the code with highest performance. It appears that the code they generate is specialized not only for a specific memory architecture but also for a specific matrix size. Stals and Rude =-=[51]-=- investigate algorithmic restructuring techniques for improving the cache behavior of iterative methods. They do not investigate nonlinear data reorganization. Frens and Wise [17] provide an implement... |

7 | Report of the working group on storage I/O for largescale computing
- Gibson, Vitter, et al.
- 1996
(Show Context)
Citation Context ...on restructuring by recursion unfolding. They appear to carry the recursion down to the level of single array elements, which causes a dramatic loss of performance. The goal of out-of-core algorithms =-=[22]-=- is related to ours. However, the problems differ in two fundamental ways: the limited associativity of caches and their fixed replacement policies are not relevant for virtual memory systems; and the... |

6 |
Graphical data bases built on Peano space-filling curves
- Laurini
- 1985
(Show Context)
Citation Context ...ing as the Morton ordering and has been used for load balancing purposes [5, 27, 28, 44, 49, 59]. It has also been applied for bandwidth reduction in information theory [6], for graphics applications =-=[23, 38]-=-, and for database applications [29]. Figure 1(d) illustrates this layout. Morton ordering has the following operational interpretation. Divide the original matrix into four quadrants, and lay out the... |

5 |
Analysis of the clustering properting of Hilbert space-filling curve
- Moon, Jagadish, et al.
- 1996
(Show Context)
Citation Context ... layout, the Morton layout function is expensive to compute naively. It requires bit manipulation operations, or can alternatively be computed in O(d) integer operations without any bit manipulations =-=[41]-=-. Yet another option is to pre-compute the required M (i; j) values in a lookup table. In our implementation, we use a combination of both techniques. We process the inputs four bits at a time, genera... |

3 |
Unifying data and control transformationsfor distributed shared-memory machines
- Cierniak, Li
- 1995
(Show Context)
Citation Context ...out functions that are linear and monotonically increasing in the arguments i and j (such functions certainly being easy to compute), it is easy to prove that there are only two such layout functions =-=[14]-=-: the row-major layout LRM as used in Pascal, given by LRM (i; j; m; n) =n i +j;andthecolumn-major layout LCM as used in Fortran, given by LCM(i; j; m; n) =m j + i. We refer to these two layouts as ca... |

3 | Recursion leads to automatic variable blockingfor dense linearalgebra algorithms - Gustavson - 1997 |

2 |
Recursion leads to automaticvariable blockingfor dense linearalgebra algorithms
- Gustavson
- 1997
(Show Context)
Citation Context ...ile LMO actually increases execution time. In contrast, the recursive control structure of RECMXM matches the recursive nature of LMO, producing commensurate improvements in execution time. Gustavson =-=[26]-=- and and Chatterjee et al. [12] discuss this issue in greater detail. 3.7 Cache and TLB simulations To gain further insight into the memory system behavior of nonlinear layouts, we simulated various c... |

2 |
Data transformationsfor eliminating conflict misses
- Rivera, Tseng
- 1998
(Show Context)
Citation Context ...sually resulting in nonsquare tiles. They claim that their method outperforms the copy optimization recommended by Lam et al. [36]. We have compared our approach with theirs in x3.5. Rivera and Tseng =-=[47, 48]-=- discuss intra- and inter-array padding as a means of reducing conflict misses. Ghosh et al. [21, 22] present an analytical model for estimating cache misses for perfect loop nests. A substantial body... |

1 | ATOM: A system for building customizedprogram analysis tools - Eustace - 1994 |