Results 11  20
of
222
Decoupling algorithms from schedules for easy . . .
"... Using existing programming tools, writing highperformance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refe ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
Using existing programming tools, writing highperformance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feedforward imaging pipelines that separates the algorithm from its schedule, enabling highperformance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than stateoftheart implementations.
BanditBased Optimization on Graphs with Application to Library Performance Tuning
, 2009
"... The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximiz ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
(Show Context)
The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximizing an objective function over the sinks of a directed acyclic graph. This algorithm valuates nodes using MonteCarlo and grows a subgraph in the most promising directions by considering local maximum karmed bandits. When used inside an adaptive linear transform library, it cuts down the search time by an order of magnitude compared to the existing algorithm. In some cases, the performance of the implementations found is also increased by up to 10 % which is of considerable practical importance since it consequently improves the performance of all applications using the library.
Fast arithmetic for triangular sets: from theory to practice
 ISSAC'07
, 2007
"... We study arithmetic operations for triangular families of polynomials, concentrating on multiplication in dimension zero. By a suitable extension of fast univariate Euclidean division, we obtain theoretical and practical improvements over a direct recursive approach; for a family of special cases, ..."
Abstract

Cited by 26 (21 self)
 Add to MetaCart
(Show Context)
We study arithmetic operations for triangular families of polynomials, concentrating on multiplication in dimension zero. By a suitable extension of fast univariate Euclidean division, we obtain theoretical and practical improvements over a direct recursive approach; for a family of special cases, we reach quasilinear complexity. The main outcome we have in mind is the acceleration of higherlevel algorithms, by interfacing our lowlevel implementation with languages such as AXIOM or Maple. We show the potential for huge speedups, by comparing two AXIOM implementations of van Hoeij and Monagan's modular GCD algorithm.
AnnotationBased Empirical Performance Tuning Using Orio
"... In many scientific applications, significant time is spent tuning codes for a particular highperformance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. I ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
(Show Context)
In many scientific applications, significant time is spent tuning codes for a particular highperformance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. Intrusive techniques often result in code changes that are not easily reversible, which can negatively impact readability, maintainability, and performance on different architectures. We introduce an extensible annotationbased empirical tuning system called Orio, which is aimed at improving both performance and productivity by enabling software developers to insert annotations in the form of structured comments into their source code that trigger a number of lowlevel performance optimizations on a specified code fragment. To maximize the performance tuning opportunities, we have designed the annotation processing infrastructure to support both architectureindependent and architecturespecific code optimizations. Given the annotated code as input, Orio generates many tuned versions of the same operation and empirically evaluates the versions to select the best performing one for production use. We have also enabled the use of the PLuTo automatic parallelization tool in conjunction with Orio to generate efficient OpenMPbased parallel code. We describe our experimental results involving a number of computational kernels, including dense array and sparse matrix operations.
A rewriting system for the vectorization of signal transforms
 In Proc. High Performance Computing for Computational Science (VECPAR
, 2006
"... Abstract. We present a rewriting system that automatically vectorizes signal transform algorithms at a high level of abstraction. The input to the system is a transform algorithm given as a formula in the wellknown Kronecker product formalism. The output is a “vectorized” formula, which means it co ..."
Abstract

Cited by 23 (18 self)
 Add to MetaCart
(Show Context)
Abstract. We present a rewriting system that automatically vectorizes signal transform algorithms at a high level of abstraction. The input to the system is a transform algorithm given as a formula in the wellknown Kronecker product formalism. The output is a “vectorized” formula, which means it consists exclusively of constructs that can be directly mapped into short vector code. This approach obviates compiler vectorization, which is known to be limited in this domain. We included the formula vectorization into the Spiral program generator for signal transforms, which enables us to generate vectorized code and optimize through search over alternative algorithms. Benchmarks for the discrete Fourier transform (DFT) show that our generated floatingpoint code is competitive with and that our fixedpoint code clearly outperforms the best available libraries. 1
POET: Parameterized Optimizations for Empirical Tuning
"... The excessive complexity of both machine architectures and applications have made it difficult for compilers to statically model and predict application behavior. This observation motivates the recent interest in performance tuning using empirical techniques. We present a new embedded scripting lang ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
(Show Context)
The excessive complexity of both machine architectures and applications have made it difficult for compilers to statically model and predict application behavior. This observation motivates the recent interest in performance tuning using empirical techniques. We present a new embedded scripting language, POET (Parameterized Optimization for Empirical Tuning), for parameterizing complex code transformations so that they can be empirically tuned. The POET language aims to significantly improve the generality, flexibility, and efficiency of existing empirical tuning systems. We have used the language to parameterize and to empirically tune three loop optimizations—interchange, blocking, and unrolling— for two linear algebra kernels. We show experimentally that the time required to tune these optimizations using POET, which does not require any program analysis, is significantly shorter than that when using a full compilerbased sourcecode optimizer which performs sophisticated program analysis and optimizations. I.
Multilevel tiling: M for the price of one
 in Proceedings of the ACM/IEEE Conference on Supercomputing, 2007
"... Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Highperformance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation of multilevel tiled code is essential for effec ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Highperformance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation of multilevel tiled code is essential for effective use of multilevel tiling. Parameterized tiled code, where tile sizes are not fixed but left as symbolic parameters can enable several dynamic and runtime optimizations. Previous solutions to multilevel tiled loop generation are limited to the case where tile sizes are fixed at compile time. We present an algorithm that can generate multilevel parameterized tiled loops at the same cost as generating singlelevel tiled loops. The efficiency of our method is demonstrated on several benchmarks. We also present a method–useful in register tiling–for separating partial and full tiles at any arbitrary level of tiling. The code generator we have implemented is available as an open source tool.
Autotuning a highlevel language targeted to GPU codes
 In Innovative Parallel Computing Conference. IEEE
, 2012
"... Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a speci ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel. In this work, we apply optimizations to GPU code using HMPP, a highlevel directivebased language and sourcetosource compiler that can generate CUDA / OpenCL code. However, programming with highlevel languages may mean a loss of performance compared to using lowlevel languages. Our work shows that it is possible to improve the performance of a highlevel language by using autotuning. We perform autotuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our autotuned HMPPgenerated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.
Predictive Modeling in a Polyhedral Optimization Space
 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO'11)
, 2011
"... Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challen ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challenge. Polyhedral models for compiler optimization have demonstrated strong potential for enhancing program performance, in particular for computeintensive applications. But existing static cost models to optimize polyhedral transformations have significant limitations, and iterative compilation has become a very promising alternative to these models to find the most effective transformations. But since the number of polyhedral optimization alternatives can be enormous, it is often impractical to iterate over a significant fraction of the entire space of polyhedrally transformed variants. Recent research has focused on iterating over this search space either with manuallyconstructed heuristics or with automatic but very expensive search algorithms (e.g., genetic algorithms) that can eventually find good points in the polyhedral space. In this paper, we propose the use of machine learning to address the problem of selecting the best polyhedral optimizations. We show that these models can quickly find highperformance program variants in the polyhedral space, without resorting to extensive empirical search. We introduce models that take as input a characterization of a program based on its dynamic behavior, and predict the performance of aggressive highlevel polyhedral transformations that includes tiling, parallelization and vectorization. We allow for a minimal empirical search on the target machine, discovering on average 83 % of the searchspaceoptimal combinations in at most 5 runs. Our endtoend framework is validated using numerous benchmarks on two multicore platforms.
Operator Language: A Program Generation Framework for Fast Kernels
, 2009
"... We present the Operator Language (OL), a framework to automatically generate fast numerical kernels. OL provides the structure to extend the program generation system Spiral beyond the transform domain. Using OL, we show how to automatically generate library functionality for the fast Fourier transf ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
We present the Operator Language (OL), a framework to automatically generate fast numerical kernels. OL provides the structure to extend the program generation system Spiral beyond the transform domain. Using OL, we show how to automatically generate library functionality for the fast Fourier transform and multiple nontransform kernels, including matrixmatrix multiplication, synthetic aperture radar (SAR), circular convolution, sorting networks, and Viterbi decoding. The control flow of the kernels is dataindependent, which allows us to cast their algorithms as operator expressions. Using rewriting systems, a structural architecture model and empirical search, we automatically generate very fast C implementations for stateoftheart multicore CPUs that rival handtuned implementations.