Results 1 -
5 of
5
Multi-level tiling: M for the price of one
- in Proceedings of the ACM/IEEE Conference on Supercomputing, 2007
"... Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation of multi-level tiled code is essential for effec ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
(Show Context)
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation of multi-level tiled code is essential for effective use of multi-level tiling. Parameterized tiled code, where tile sizes are not fixed but left as symbolic parameters can enable several dynamic and run-time optimizations. Previous solutions to multi-level tiled loop generation are limited to the case where tile sizes are fixed at compile time. We present an algorithm that can generate multi-level parameterized tiled loops at the same cost as generating single-level tiled loops. The efficiency of our method is demonstrated on several benchmarks. We also present a method–useful in register tiling–for separating partial and full tiles at any arbitrary level of tiling. The code generator we have implemented is available as an open source tool.
Parameterized Tiled Loops for Free
, 2007
"... Parameterized tiled loops—where the tile sizes are not fixed at compile time, but remain symbolic parameters until later—are quite useful for iterative compilers and “auto-tuners” that produce highly optimized libraries and codes. Tile size parameterization could also enable optimizations such as re ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
Parameterized tiled loops—where the tile sizes are not fixed at compile time, but remain symbolic parameters until later—are quite useful for iterative compilers and “auto-tuners” that produce highly optimized libraries and codes. Tile size parameterization could also enable optimizations such as register tiling to become dynamic optimizations. Although it is easy to generate such loops for (hyper) rectangular iteration spaces tiled with (hyper) rectangular tiles, many important computations do not fall into this restricted domain. Parameterized tile code generation for the general case of convex iteration spaces being tiled by (hyper) rectangular tiles has in the past been solved with bounding box approaches or symbolic Fourier Motzkin approaches. However, both approaches have less than ideal code generation efficiency and resulting code quality. We present the theoretical foundations, implementation, and experimental validation of a simple, unified technique for generating parameterized tiled code. Our code generation efficiency is comparable to all existing code generation techniques including those for fixed tile sizes, and the resulting code is as efficient as, if not more than, all previous techniques. Thus the technique provides parameterized tiled loops for free! Our “one-size-fits-all” solution, which is available as open source software can be adapted for use in production compilers.
On Parameterized Tiled Loop Generation and Its Parallelization
, 2010
"... Tiling is a loop transformation that decomposes computations into a set of smaller computation blocks. The transformation has proved to be useful for many high-level program optimizations, such as data locality optimization and exploiting coarse-grained parallelism, and crucial for architecture with ..."
Abstract
- Add to MetaCart
(Show Context)
Tiling is a loop transformation that decomposes computations into a set of smaller computation blocks. The transformation has proved to be useful for many high-level program optimizations, such as data locality optimization and exploiting coarse-grained parallelism, and crucial for architecture with limited resources, such as embedded systems, GPUs, and the Cell. Data locality and parallelism will continue to serve as major vehicles for achieving high performance on modern architectures. Parameterized tiling is tiling where the size of blocks is not fixed at compile time but remains a symbolic constant that can be selected/changed even at runtime. Parameterized tiled loops facilitate iterative and runtime optimizations, such as iterative compilation, auto-tuning and dynamic program adaption. Existing solutions to parameterized tiled loop generation are either restricted to perfectly nested loops or difficult to parallelize on distributed memory systems and even on shared memory systems when a program does not have synchronization free parallelism. We present an approach for parameterized tiled loop generation for imperfectly nested loops. We empoly a direct extension of the tiled code generation technique for perfectly nested loops and three simple optimizations on the resulting parameterized tiled loops. The generation as well as the optimizations are achieved purely syntactic processing, hence loop generation time remains negligible. Our code generation technique provides comparable
Architecture Aware Programming on Multi-Core Systems
"... Abstract — In order to improve the processor performance, the response of the industry has been to increase the number of cores on the die. One salient feature of multi-core architectures is that they have a varying degree of sharing of caches at different levels. With the advent of multi-core archi ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — In order to improve the processor performance, the response of the industry has been to increase the number of cores on the die. One salient feature of multi-core architectures is that they have a varying degree of sharing of caches at different levels. With the advent of multi-core architectures, we are facing the problem that is new to parallel computing, namely, the management of hierarchical caches. Data locality features need to be considered in order to reduce the variance in the performance for different data sizes. In this paper, we propose a programming approach for the algorithms running on shared memory multi-core systems by using blocking, which is a well-known optimization technique coupled with parallel programming paradigm, OpenMP. We have chosen the sizes of various problems based on the architectural parameters of the system like cache level, cache size, cache line size. We studied the cache optimization scheme on commonly used linear algebra applications – matrix multiplication (MM), Gauss-Elimination (GE) and LU Decomposition (LUD) algorithm. Keywords- multi-core architecture; parallel programming; cache miss; blocking; OpenMP; linear algebra. I.