Results 1 - 10
of
168
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
, 2000
"... Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary drama ..."
Abstract
-
Cited by 108 (9 self)
- Add to MetaCart
Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary dramatically from one processor to the next. In this paper, we therefore address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies based on genetic algorithms, random sampling and simulated annealing. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outperform each of them on a range of benchmarks across a variety of ar...
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract
-
Cited by 69 (4 self)
- Add to MetaCart
(Show Context)
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17--121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
A Framework for Performance Modeling and Prediction
- IN SC 2002
, 2002
"... Cycle-accurate simulation is far too slow for modeling the expected performance of full parallel applications on large HPC systems. And just running an application on a system and observing wallclock time tells you nothing about why the application performs as it does (and is anyway impossible on ..."
Abstract
-
Cited by 60 (7 self)
- Add to MetaCart
Cycle-accurate simulation is far too slow for modeling the expected performance of full parallel applications on large HPC systems. And just running an application on a system and observing wallclock time tells you nothing about why the application performs as it does (and is anyway impossible on yet-to-be-built systems). Here we present a framework for performance modeling and prediction that is faster than cycle-accurate simulation, more informative than simple benchmarking, and is shown useful for performance investigations in several dimensions.
Statcache: A probabilistic approach to efficient and accurate data locality analysis
- In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2004
"... The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present Stat-Cache, a novel sampling-based method for performing data-locality analysis on realisti ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
(Show Context)
The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present Stat-Cache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as �. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives. 1
Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply
- In Proceedings of Supercomputing
, 2002
"... We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how ..."
Abstract
-
Cited by 57 (10 self)
- Add to MetaCart
We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.
Modeling Application Performance by Convolving Machine Signatures with Application Profiles
, 2001
"... This paper presents a performance modeling methodology that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on a class of High Performance Computing benchmarks. The method ..."
Abstract
-
Cited by 50 (5 self)
- Add to MetaCart
This paper presents a performance modeling methodology that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on a class of High Performance Computing benchmarks. The method yields insight into the factors that affect performance on single-processor and parallel computers.
Data Cache Locking for Higher Program Predictability
- In SIGMETRICS ’03: Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
, 2003
"... ABSTRACT Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected. Cache locking mechanisms adapt caches to t ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
(Show Context)
ABSTRACT Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected. Cache locking mechanisms adapt caches to the needs of real-time systems. Locking the cache is a solution that trades performance for predictability: at a cost of generally lower performance, the time of accessing the memory becomes predictable. This paper combines compile-time cache analysis with data cache locking to estimate the worst-case memory performance (WCMP) in a safe, tight and fast way. In order to get predictable cache behavior, we first lock the cache for those parts of the code where the static analysis fails. To minimize the performance degradation, our method loads the cache, if necessary, with data likely to be accessed. Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed program. When compared to an algorithm that assumes compulsory misses when the state of the cache is unknown, our approach eliminates all overestimation for the set of benchmarks, giving an exact WCMP of the transformed program without any significant decrease in performance.
Counting integer points in parametric polytopes using Barvinok’s rational functions
- Algorithmica
, 2007
"... Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric ..."
Abstract
-
Cited by 44 (9 self)
- Add to MetaCart
(Show Context)
Abstract Many compiler optimization techniques depend on the ability to calculate the number of elements that satisfy certain conditions. If these conditions can be represented by linear constraints, then such problems are equivalent to counting the number of integer points in (possibly) parametric polytopes. It is well known that the enumerator of such a set can be represented by an explicit function consisting of a set of quasi-polynomials each associated with a chamber in the parameter space. Previously, interpolation was used to obtain these quasi-polynomials, but this technique has several disadvantages. Its worstcase computation time for a single quasi-polynomial is exponential in the input size, even for fixed dimensions. The worst-case size of such a quasi-polynomial (measured in bits needed to represent the quasi-polynomial) is also exponential in the input size. Under certain conditions this technique even fails to produce a solution. Our main contribution is a novel method for calculating the required quasipolynomials analytically. It extends an existing method, based on Barvinok’s decomposition,
Data caches in multitasking hard real-time systems
- IN IEEE REAL-TIME SYSTEMS SYMPOSIUM
, 2003
"... Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which makes it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
(Show Context)
Data caches are essential in modern processors, bridging the widening gap between main memory and processor speeds. However, they yield very complex performance models, which makes it hard to bound execution times tightly. This paper contributes a new technique to obtain predictability in preemptive multitasking systems in the presence of data caches. We explore the use of cache partitioning, dynamic cache locking and static cache analysis to provide worst-case performance estimates in a safe and tight way. Cache partitioning divides the cache among tasks to eliminate inter-task cache interferences. We combine static cache analysis and cache locking mechanisms to ensure that all intra-task conflicts, and consequently, memory access times, are exactly predictable. To minimize the performance degradation due to cache partitioning and locking, two strategies are employed. First, the cache is loaded with data likely to be accessed so that their cache utilization is maximized. Second, compiler optimizations such as tiling and padding are applied in order to reduce cache replacement misses. Experimental results show that this scheme is fully predictable, without compromising the performance of the transformed programs. Our method outperforms static cache locking for all analyzed task sets under various cache architectures, with a CPU utilization reduction ranging between 3.8 and 20.0 times for a high performance system.
Let’s Study Whole-Program Cache Behaviour Analytically
- In Proceedings of International Symposium on High-Performance Computer Architecture (HPCA 8
, 2002
"... ..."
(Show Context)