Results 11  20
of
391
A CrossInput Adaptive Framework for GPU Programs Optimization
, 2008
"... Recent years have seen a trend in using graphic processing units (GPU) as accelerators for generalpurpose computing. The inexpensive, singlechip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a highqua ..."
Abstract

Cited by 41 (7 self)
 Add to MetaCart
(Show Context)
Recent years have seen a trend in using graphic processing units (GPU) as accelerators for generalpurpose computing. The inexpensive, singlechip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a highquality GPU application is challenging, thanks to the large optimization space and complex unpredictable effects of optimizations on GPU program performance. Recently, several studies have attempted to use empirical search to help the optimization. Although those studies have shown promising results, one important factor—program inputs—in the optimization has remained unexplored. In this work, we initiate the exploration in this new dimension. By conducting a series of measurement, we find that the ability to adapt to program inputs is important for some applications to achieve their best performance on GPU. In light of the findings, we develop an inputadaptive optimization framework, namely GADAPT, to address the influence by constructing crossinput predictive models for automatically predicting the (near)optimal configurations for an arbitrary input to a GPU program. The results demonstrate the promise of the framework in serving as a tool to alleviate the productivity bottleneck in GPU programming.
Online performance auditing: using hot optimizations without getting burned
 In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation
, 2006
"... As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
(Show Context)
As hardware complexity increases and virtualization is added at more layers of the execution stack, predicting the performance impact of optimizations becomes increasingly difficult. Production compilers and virtual machines invest substantial development effort in performance tuning to achieve good performance for a range of benchmarks. Although optimizations typically perform well on average, they often have unpredictable impact on running time, sometimes degrading performance significantly. Today’s VMs perform sophisticated feedbackdirected optimizations, but these techniques do not address performance degradations, and they actually make the situation worse by making the system more unpredictable. This paper presents an online framework for evaluating the effectiveness of optimizations, enabling an online system to automatically identify and correct performance anomalies that occur at runtime. This work opens the door for a fundamental shift in the way optimizations are developed and tuned for online systems, and may allow the body of work in offline empirical optimization search to be applied automatically at runtime. We present our implementation and evaluation of this system in a product Java VM.
A framework for adaptive algorithm selection in STAPL
 IN PROC. ACM SIGPLAN SYMP. PRIN. PRAC. PAR. PROG. (PPOPP), PP 277–288
, 2005
"... Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distr ..."
Abstract

Cited by 39 (8 self)
 Add to MetaCart
(Show Context)
Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at runtime. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines runtime tests that correctly select the best performing algorithm from among several competing algorithmic options in 86100 % of the cases studied, depending on the operation and the system.
Cache and bandwidth aware matrix multiplication on the GPU
, 2003
"... Recent advances in the speed and programmability of consumer level graphics hardware has sparked a flurry of research that goes beyond the realm of image synthesis and computer graphics. We examine the use of the GPU (graphics processing unit) as a tool for scientific computing, by analyzing techniq ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
Recent advances in the speed and programmability of consumer level graphics hardware has sparked a flurry of research that goes beyond the realm of image synthesis and computer graphics. We examine the use of the GPU (graphics processing unit) as a tool for scientific computing, by analyzing techniques for performing large matrix multiplies in GPU hardware. An earlier method for multiplying matrices on the GPU suffered from problems of memory bandwidth. This paper examines more efficient algorithms that make the implementation of large matrix multiplication on upcoming GPU architectures more competitive, using only 25 % of the memory bandwidth and instructions of previous GPU algorithms. 1
Fast SVM training algorithm with decomposition on very large data sets
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2005
"... Training a support vector machine on a data set of huge size with thousands of classes is a challenging problem. This paper proposes an efficient algorithm to solve this problem. The key idea is to introduce a parallel optimization step to quickly remove most of the nonsupport vectors, where block ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
Training a support vector machine on a data set of huge size with thousands of classes is a challenging problem. This paper proposes an efficient algorithm to solve this problem. The key idea is to introduce a parallel optimization step to quickly remove most of the nonsupport vectors, where block diagonal matrices are used to approximate the original kernel matrix so that the original problem can be split into hundreds of subproblems which can be solved more efficiently. In addition, some effective strategies such as kernel caching and efficient computation of kernel matrix are integrated to speed up the training process. Our analysis of the proposed algorithm shows that its time complexity grows linearly with the number of classes and size of the data set. In the experiments, many appealing properties of the proposed algorithm have been investigated and the results show that the proposed algorithm has a much better scaling capability than Libsvm, SVM light, and SVMTorch. Moreover, the good generalization performances on several large databases have also been achieved.
PolicyGradient Algorithms for Partially Observable Markov decision processes
, 2003
"... Partially observable Markov decision processes are interesting because of their ability to model most conceivable realworld learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
Partially observable Markov decision processes are interesting because of their ability to model most conceivable realworld learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the socalled policygradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the longterm average of a reward signal. Policygradient methods are attractive as a scalable approach for controlling partially observable Markov decision processes (POMDPs). In the most
Automatic online tuning for fast Gaussian summation
"... Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve va ..."
Abstract

Cited by 35 (13 self)
 Add to MetaCart
(Show Context)
Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
Midatasets: Creating the conditions for a more realistic evaluation of iterative optimization
 In Proceedings of the International Conference on High Performance Embedded Architectures & Compilers (HiPEAC
, 2007
"... Abstract. Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performancecritical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will w ..."
Abstract

Cited by 35 (17 self)
 Add to MetaCart
(Show Context)
Abstract. Iterative optimization has become a popular technique to obtain improvements over the default settings in a compiler for performancecritical applications, such as embedded applications. An implicit assumption, however, is that the best configuration found for any arbitrary data set will work well with other data sets that a program uses. In this article, we evaluate that assumption based on 20 data sets per benchmark of the MiBench suite. We find that, though a majority of programs exhibit stable performance across data sets, the variability can significantly increase with many optimizations. However, for the best optimization configurations, we find that this variability is in fact small. Furthermore, we show that it is possible to find a compromise optimization configuration across data sets which is often within 5 % of the best possible configuration for most data sets, and that the iterative process can converge in less than 20 iterations (for a population of 200 optimization configurations). All these conclusions have significant and positive implications for the practical utilization of iterative optimization. 1
AUTOMATING THE FINITE ELEMENT METHOD
, 2006
"... The finite element method can be viewed as a machine that automates the discretization of differential equations, taking as input a variational problem, a finite element and a mesh, and producing as output a system of discrete equations. However, the generality of the framework provided by the finit ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
(Show Context)
The finite element method can be viewed as a machine that automates the discretization of differential equations, taking as input a variational problem, a finite element and a mesh, and producing as output a system of discrete equations. However, the generality of the framework provided by the finite element method is seldom reflected in implementations (realizations), which are often specialized and can handle only a small set of variational problems and finite elements (but are typically parametrized over the choice of mesh). This paper reviews ongoing research in the direction of a complete automation of the finite element method. In particular, this work discusses algorithms for the efficient and automatic computation of a system of discrete equations from a given variational problem, finite element and mesh. It is demonstrated that by automatically generating and compiling efficient lowlevel code, it is possible to parametrize a finite element code over variational problem and finite element in addition to the mesh.
Statistical models for empirical searchbased performance tuning
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementa ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a searchbased system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a nearoptimal implementation is found. Second, we show how to construct