Results 1  10
of
19
SPIRAL: Code Generation for DSP Transforms
 PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
"... Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal proces ..."
Abstract

Cited by 212 (39 self)
 Add to MetaCart
Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domainspecific mathematical structure of transform algorithms to implement a feedbackdriven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.
SPIRAL: A Generator for PlatformAdapted Libraries of Signal Processing Algorithms
 Journal of High Performance Computing and Applications
, 2004
"... SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical fra ..."
Abstract

Cited by 82 (20 self)
 Add to MetaCart
(Show Context)
SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical framework that concisely describes signal transforms and their fast algorithms; the formula generator that captures at the algorithmic level the degrees of freedom in expressing a particular signal processing transform; the formula translator that encapsulates the compilation degrees of freedom when translating a specific algorithm into an actual code implementation; and, finally, an intelligent search engine that finds within the large space of alternative formulas and implementations
Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy
 In International Symposium on Code Generation and Optimization
, 2005
"... This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for densematrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heurist ..."
Abstract

Cited by 57 (10 self)
 Add to MetaCart
(Show Context)
This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for densematrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, Matrix Multiply and Jacobi Relaxation. For Matrix Multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS selftuning library and the handtuned vendor BLAS library. Jacobi results also substantially outperform the native compilers. 1
Optimizing Sorting with Genetic Algorithms
 In The International Symposium on Code Generation and Optimization
, 2005
"... 1 ..."
Short Vector Code Generation for the Discrete Fourier Transform
 In Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS
"... In this paper we use a mathematical approach to automatically generate high performance short vector code for the discrete Fourier transform (DFT). We represent the wellknown CooleyTukey fast Fourier transform in a mathematical notation and formally derive a "short vector variant". Using ..."
Abstract

Cited by 28 (16 self)
 Add to MetaCart
(Show Context)
In this paper we use a mathematical approach to automatically generate high performance short vector code for the discrete Fourier transform (DFT). We represent the wellknown CooleyTukey fast Fourier transform in a mathematical notation and formally derive a "short vector variant". Using this recursion we generate for a given DFT a large number of different algorithms, represented as formulas, and translate them into short vector code. Then we present a vector code specific dynamic programming method that searches in the space of different implementations for the fastest on the given architecture. We implemented this approach as part of the SPIRAL library generator. On Pentium III and 4, our automatically generated SSE and SSE2 vector code compares favorably with the handtuned Intel vendor library.
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms
 In Proc. IPDPS
, 2002
"... Short vector SIMD instructions on recent microprocessors, such as SSE on Pentium III and 4, speed up code but are a major challenge to software developers. We present a compiler that automatically generates C code enhanced with short vector instructions for digital signal processing (DSP) transforms ..."
Abstract

Cited by 28 (20 self)
 Add to MetaCart
(Show Context)
Short vector SIMD instructions on recent microprocessors, such as SSE on Pentium III and 4, speed up code but are a major challenge to software developers. We present a compiler that automatically generates C code enhanced with short vector instructions for digital signal processing (DSP) transforms, such as the fast Fourier transform (FFT). The input to our compiler is a concise mathematical description of a DSP algorithm in the language SPL. SPL is used in the SPIRAL system (http://www.ece.cmu.edu/spiral) to generate highly optimized architecture adapted implementations of DSP transforms. Interfacing our compiler with SPIRAL yields speedups of more than a factor of 2 in several important cases including the FFT and the discrete cosine transform (DCT) used in the JPEG compression standard. For the FFT our automatically generated code is competitive with the handcoded Intel Math Kernel Library.
FFT program generation for shared memory: SMP and multicore
 In Proc. Supercomputing
, 2006
"... The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but different performance characteristics. This situation considerably increases the burden on library developers and stre ..."
Abstract

Cited by 26 (13 self)
 Add to MetaCart
(Show Context)
The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks such as Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate parallelization speedup for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms. 1
Fast Automatic Generation of DSP Algorithms
, 2001
"... SPIRAL is a generator of optimized, platformadapted libraries for digital signal processing algorithms. SPIRAL's strategy translates the implementation task into a search in an expanded space of alternatives. ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
SPIRAL is a generator of optimized, platformadapted libraries for digital signal processing algorithms. SPIRAL's strategy translates the implementation task into a search in an expanded space of alternatives.
Adaptive Java Optimisation Using InstanceBased Learning
, 2004
"... This paper describes a portable, machine learningbased approach to Java optimisation. This approach uses an instancebased learning scheme to select good transformations drawn from Pugh's Unified Transformation Framework[11]. This approach was implemented and applied to a number of numerical J ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper describes a portable, machine learningbased approach to Java optimisation. This approach uses an instancebased learning scheme to select good transformations drawn from Pugh's Unified Transformation Framework[11]. This approach was implemented and applied to a number of numerical Java benchmarks on two platforms. Using this scheme, we are able to gain over 70% of the performance improvement found when using an exhaustive iterative search of the best compiler optimisations. Thus we have a scheme that gives a high level of portable performance without any excessive compilations.
Optimizing Sorting with Machine Learning Algorithms
"... The growing complexity of modern processors has made the development of highly efficient code increasingly difficult. Manually developing highly efficient code is usually expensive but necessary due to the limitations of today’s compilers. A promising automatic code generation strategy, implemented ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
The growing complexity of modern processors has made the development of highly efficient code increasingly difficult. Manually developing highly efficient code is usually expensive but necessary due to the limitations of today’s compilers. A promising automatic code generation strategy, implemented by library generators such as ATLAS, FFTW, and SPIRAL, relies on empirical search to identify, for each target machine, the code characteristics, such as the tile size and instruction schedules, that deliver the best performance. This approach has mainly been applied to scientific codes which can be optimized by identifying code characteristics that depend only on the target machine. In this paper, we study the generation of sorting routines whose performance also depends on the characteristics of the input