Results 1 - 10
of
37
Statistical models for empirical search-based performance tuning
- INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementa ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a search-based system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compile-time search early if a near-optimal implementation is found. Second, we show how to construct
Predictive Modeling in a Polyhedral Optimization Space
- INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO'11)
, 2011
"... Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challen ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challenge. Polyhedral models for compiler optimization have demonstrated strong potential for enhancing program performance, in particular for compute-intensive applications. But existing static cost models to optimize polyhedral transformations have significant limitations, and iterative compilation has become a very promising alternative to these models to find the most effective transformations. But since the number of polyhedral optimization alternatives can be enormous, it is often impractical to iterate over a significant fraction of the entire space of polyhedrally transformed variants. Recent research has focused on iterating over this search space either with manually-constructed heuristics or with automatic but very expensive search algorithms (e.g., genetic algorithms) that can eventually find good points in the polyhedral space. In this paper, we propose the use of machine learning to address the problem of selecting the best polyhedral optimizations. We show that these models can quickly find high-performance program variants in the polyhedral space, without resorting to extensive empirical search. We introduce models that take as input a characterization of a program based on its dynamic behavior, and predict the performance of aggressive high-level polyhedral transformations that includes tiling, parallelization and vectorization. We allow for a minimal empirical search on the target machine, discovering on average 83 % of the searchspace-optimal combinations in at most 5 runs. Our end-to-end framework is validated using numerous benchmarks on two multi-core platforms.
Data locality optimization for synthesis of efficient out-of-core algorithms
- In Proc. of the Intl. Conf. on High Performance Computing
, 2003
"... Abstract. This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested loops that represent tensor contraction com-putations. Tensor contraction expressions arise in many accurate computational models of electronic structure. The developed approach co ..."
Abstract
-
Cited by 14 (12 self)
- Add to MetaCart
(Show Context)
Abstract. This paper describes an approach to synthesis of efficient out-of-core code for a class of imperfectly nested loops that represent tensor contraction com-putations. Tensor contraction expressions arise in many accurate computational models of electronic structure. The developed approach combines loop fusion with loop tiling and uses a performance-model driven approach to loop tiling for the generation of out-of-core code. Experimental measurements are provided that show a good match with model-based predictions and demonstrate the effective-ness of the proposed algorithm. 1
Raising the Level of Programming Abstraction in Scalable Programming Models
- In IEEE International Conference on High Performance Computer Architecture (HPCA), Workshop on Productivity and Performance in High-End Computing (P-PHEC
, 2004
"... The complexity of modern scientific simulations combined with the complexity of the high-performance computer hardware on which they run place an everincreasing burden on scientific software developers, with clear impacts on both productivity and performance. We argue that raising the level of abstr ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
The complexity of modern scientific simulations combined with the complexity of the high-performance computer hardware on which they run place an everincreasing burden on scientific software developers, with clear impacts on both productivity and performance. We argue that raising the level of abstraction of the programming model/environment is a key element of addressing this situation. We present examples of two distinctly different approaches to raising the level of abstraction of the programming model while maintaining or increasing performance: the Tensor Contraction engine, a narrowly-focused domain specific language together with an optimizing compiler; and Extended Global Arrays, a programming framework that integrates programming models dealing with different layers of the memory/storage hierarchy using compiler analysis and code transformation techniques. 1.
Hypergraph partitioning for automatic memory hierarchy management
- In Supercomputing (SC06
, 2006
"... In this paper, we present a mechanism for automatic management of the memory hierarchy, including secondary storage, in the context of a global address space parallel programming framework. The programmer specifies the parallelism and locality in the computation. The scheduling of the computation in ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper, we present a mechanism for automatic management of the memory hierarchy, including secondary storage, in the context of a global address space parallel programming framework. The programmer specifies the parallelism and locality in the computation. The scheduling of the computation into stages, together with the movement of the associated data between secondary storage and global memory, and between global memory and local memory, is automatically managed. A novel formulation of hypergraph partitioning is used to model the optimization problem of minimizing disk I/O. Experimental evaluation of the proposed approach using a sub-computation from the quantum chemistry domain shows a reduction in the disk I/O cost by up to a factor of 11, and a reduction in turnaround time by up to 49%, as compared to alternative approaches used in state-of-the-art quantum chemistry codes. 1
On efficient out-of-core matrix transposition
, 2003
"... This paper addresses the problem of transposition of large out-of-core arrays. Although algorithms for out-of-core matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the in-memory permu-tation time. We propose an alg ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
This paper addresses the problem of transposition of large out-of-core arrays. Although algorithms for out-of-core matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the in-memory permu-tation time. We propose an algorithm that directly targets the improvement of overall transpo-sition time. The algorithm proposed decouples the algorithm from the matrix dimensions and associates it with the I/O characteristics of the system. The I/O characteristics of the system are used to determine the read and write block sizes. These I/O block sizes are chosen in order to optimize the total execution time. Experimental results are provided that demonstrate the
P.: Efficient parallel out-of-core matrix transposition
- In: Proceedings of the International Conference on Cluster Computing, IEEE Computer Society Press
, 2003
"... This paper addresses the problem of parallel transposition of large out-of-core arrays. Although algorithms for out-of-core matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the inmemory permutation time. We propose ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
This paper addresses the problem of parallel transposition of large out-of-core arrays. Although algorithms for out-of-core matrix transposition have been widely studied, previously proposed algorithms have sought to minimize the number of I/O operations and the inmemory permutation time. We propose an algorithm that directly targets the improvement of overall transposition time. The I/O characteristics of the system are used to determine the read, write and communication block sizes such that the total execution time is minimized. We also provide a solution to the array redistribution problem for arrays on disk. The solution to the sequential transposition problem and the parallel array redistribution problem are then combined to obtain an algorithm for the parallel out-of-core transposition problem. 1.
Automated operation minimization of tensor contraction expressions in electronic structure calculations
- In Proc. ICCS 2005 5th International Conference, volume 3514 of Lecture Notes in Computer Science
, 2005
"... Abstract. Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the Coupled Cluster method. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operat ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Complex tensor contraction expressions arise in accurate electronic structure models in quantum chemistry, such as the Coupled Cluster method. Transformations using algebraic properties of commutativity and associativity can be used to significantly decrease the number of arithmetic operations required for evaluation of these expressions, but the optimization problem is NP-hard. Operation minimization is an important optimization step for the Tensor Contraction Engine, a tool being developed for the automatic transformation of high-level tensor contraction expressions into efficient programs. In this paper, we develop an effective heuristic approach to the operation minimization problem, and demonstrate its effectiveness on tensor contraction expressions for coupled cluster equations. 1
Efficient Search-Space Pruning for Integrated Fusion and Tiling Transformations
"... Abstract. Compile-time optimizations involve a number of transformations such as loop permutation, fusion, tiling, array contraction, etc. Determination of the choice of these transformations that minimizes the execution time is a challenging task. We address this problem in the context of tensor co ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Compile-time optimizations involve a number of transformations such as loop permutation, fusion, tiling, array contraction, etc. Determination of the choice of these transformations that minimizes the execution time is a challenging task. We address this problem in the context of tensor contraction expressions involving arrays too large to fit in main memory. Domain-specific features of the computation are exploited to develop an integrated framework that facilitates the exploration of the entire search space of optimizations. In this paper, we discuss the exploration of the space of loop fusion and tiling transformations in order to minimize the disk I/O cost. These two transformations are integrated and pruning strategies are presented that significantly reduce the number of loop structures to be evaluated for subsequent transformations. The evaluation of the framework using representative contraction expressions from quantum chemistry shows a dramatic reduction in the size of the search space using the strategies presented. 1
AUsing Machine Learning to Improve Automatic Vectorization
"... Automatic vectorization is critical to enhancing performance of compute-intensive programs on modern processors. How-ever, there is much room for improvement over the auto-vectorization capabilities of current production compilers, through careful vector-code synthesis that utilizes a variety of loo ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Automatic vectorization is critical to enhancing performance of compute-intensive programs on modern processors. How-ever, there is much room for improvement over the auto-vectorization capabilities of current production compilers, through careful vector-code synthesis that utilizes a variety of loop transformations (e.g. unroll-and-jam, interchange, etc.). As the set of transformations considered is increased, the selection of the most effective combination of transformations becomes a significant challenge: currently used cost-models in vectorizing compilers are often unable to identify the best choices. In this paper, we address this problem using machine learning models to predict the performance of SIMD codes. In contrast to existing approaches that have used high-level features of the program, we develop machine learning models based on features extracted from the generated assembly code, The models are trained off-line on a number of benchmarks, and used at compile-time to discriminate between numerous possible vectorized variants generated from the input code. We demonstrate the effectiveness of the machine learning model by using it to guide automatic vectorization on a variety of tensor contraction kernels, with improvements ranging from 2 × to 8 × over Intel ICC’s auto-vectorized code. We also evaluate the effectiveness of the model on a number of stencil computations and show good improvement over auto-vectorized code. 1.