Results 1 - 10
of
222
The design and implementation of FFTW3
- PROCEEDINGS OF THE IEEE
, 2005
"... FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our cu ..."
Abstract
-
Cited by 726 (3 self)
- Add to MetaCart
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.
Rapidly selecting good compiler optimizations using performance counters
- In Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO
, 2007
"... Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space ..."
Abstract
-
Cited by 66 (24 self)
- Add to MetaCart
(Show Context)
Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-ofthe-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter based approach outperforms techniques based on static code features. Finally, we show that such improvements are stable across varying input data sets. Using our technique we achieve a 10 % improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a recent AMD Athlon 64 3700+ platform in just three evaluations. 1
Is Search Really Necessary to Generate High-Performance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract
-
Cited by 64 (13 self)
- Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, model-driven optimization, compilers, library generators, BLAS, high-performance computing
A modified split-radix FFT with fewer arithmetic operations
- IEEE TRANS. SIGNAL PROCESSING
, 2006
"... ..."
Synthesis of High-Performance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
- PROCEEDINGS OF THE IEEE
, 2005
"... ..."
(Show Context)
A Cross-Input Adaptive Framework for GPU Programs Optimization
, 2008
"... Recent years have seen a trend in using graphic processing units (GPU) as accelerators for generalpurpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-qua ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
(Show Context)
Recent years have seen a trend in using graphic processing units (GPU) as accelerators for generalpurpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-quality GPU application is challenging, thanks to the large optimization space and complex unpredictable effects of optimizations on GPU program performance. Recently, several studies have attempted to use empirical search to help the optimization. Although those studies have shown promising results, one important factor—program inputs—in the optimization has remained unexplored. In this work, we initiate the exploration in this new dimension. By conducting a series of measurement, we find that the ability to adapt to program inputs is important for some applications to achieve their best performance on GPU. In light of the findings, we develop an input-adaptive optimization framework, namely G-ADAPT, to address the influence by constructing cross-input predictive models for automatically predicting the (near-)optimal configurations for an arbitrary input to a GPU program. The results demonstrate the promise of the framework in serving as a tool to alleviate the productivity bottleneck in GPU programming.
Online performance auditing: using hot optimizations without getting burned
- In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation
, 2006
"... As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good ..."
Abstract
-
Cited by 38 (3 self)
- Add to MetaCart
(Show Context)
As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good performance for a range of benchmarks. Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly. Today’s VMs perform sophisticated feedback-directed optimizations, but these techniques do not address performance degradations, and they actually make the situation worse by making the system more unpredictable. This paper presents an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work opens the door for a fundamental shift in the way optimizations are developed and tuned for online systems, and may allow the body of work in offline empirical optimization search to be applied automatically at runtime. We present our implementation and evaluation of this system in a product Java VM.
Formal Loop Merging for Signal Transforms
, 2005
"... A critical optimization in the domain of linear signal transforms, such as the discrete Fourier transform (DFT), is loop merging, which increases data locality and reuse and thus performance. In particular, this includes the conversion of shuffle operations into array reindexings. To date, loop merg ..."
Abstract
-
Cited by 33 (24 self)
- Add to MetaCart
A critical optimization in the domain of linear signal transforms, such as the discrete Fourier transform (DFT), is loop merging, which increases data locality and reuse and thus performance. In particular, this includes the conversion of shuffle operations into array reindexings. To date, loop merging is well understood only for the DFT, and only for Cooley-Tukey FFT based algorithms, which excludes DFT sizes divisible by large primes. In this paper, we present a formal loop merging framework for general signal transforms and its implementation within the SPIRAL code generator. The framework consists of Σ-SPL, a mathematical language to express loops and index mappings; a rewriting system to merge loops in Σ-SPL; and a compiler that translates Σ-SPL into code. We apply the framework to DFT sizes that cannot be handled using only the Cooley-Tukey FFT and compare our method to FFTW 3.0.1 and the vendor library Intel MKL 7.2.1. Compared to FFTW our generated code is a factor of 2–4 faster under equal implementation conditions (same algorithms, same unrolling threshold). For some sizes we show a speed-up of a factor of 9 using Bluestein’s algorithm. Further, we give a detailed comparison against the Intel vendor library MKL; our generated code is between 2 times faster and 4.5 times slower.
Automatic performance model construction for the fast software exploration of new hardware designs
- In ACM International Conference on Compilers, Architecture and Synthesis for Embedded Systems
, 2006
"... ..."
(Show Context)
Feldspar: A Domain Specific Language for Digital Signal Processing algorithms
- IN: PROC. 8 TH ACM/IEEE INTERNATIONAL CONFERENCE ON FORMAL METHODS AND MODELS FOR CODESIGN. IEEE
, 2010
"... ... high-level and platform-independent description of digital signal processing (DSP) algorithms. Feldspar is a pure functional language embedded in Haskell. It offers a high-level dataflow style of programming, as well as a more mathematical style based on vector indices. The key to generating eff ..."
Abstract
-
Cited by 32 (10 self)
- Add to MetaCart
(Show Context)
... high-level and platform-independent description of digital signal processing (DSP) algorithms. Feldspar is a pure functional language embedded in Haskell. It offers a high-level dataflow style of programming, as well as a more mathematical style based on vector indices. The key to generating efficient code from such descriptions is a high-level optimization technique called vector fusion. Feldspar is based on a low-level, functional core language which has a relatively small semantic gap to machine-oriented languages like C. The core language serves as the interface to the back-end code generator, which produces C. For very small examples, the generated code performs comparably to hand-written C code when run on a DSP target. While initial results are promising, to achieve good performance on larger examples, issues related to memory access patterns and array copying will have to be addressed.