Results 1 
8 of
8
A Note on Autotuning GEMM for GPUs
, 2009
"... Abstract. The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs t ..."
Abstract

Cited by 32 (15 self)
 Add to MetaCart
(Show Context)
Abstract. The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM [13, 11]. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA’s GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development [12]. In this paper, we describe some GPU GEMM autotuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Autotuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27 % in certain cases for both single and double precision GEMMs on the GTX 280). Keywords: Autotuning, matrix multiply, dense linear algebra, GPUs. 1
Dynamic Load Balancing on Dedicated Heterogeneous Systems
"... Abstract. Parallel computing in heterogeneous environments is drawing considerable attention due to the growing number of these kind of systems. Adapting existing code and libraries to such systems is a fundamental problem. The performance of this code is affected by the large interdependence betw ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Parallel computing in heterogeneous environments is drawing considerable attention due to the growing number of these kind of systems. Adapting existing code and libraries to such systems is a fundamental problem. The performance of this code is affected by the large interdependence between the code and these parallel architectures. We have developed a dynamic load balancing library that allows parallel code to be adapted to heterogeneous systems for a wide variety of problems. The overhead introduced by our system is minimal and the cost to the programmer negligible. The strategy was validated on several problems to confirm the soundness of our proposal. 1
Transistor Scaled HPC Application Performance
, 2012
"... We propose a radically new, biologically inspired, model of extreme scale computer on which application performance automatically scales with the transistor count even in the face of component failures. Today high performance computers are massively parallel systems composed of potentially hundreds ..."
Abstract
 Add to MetaCart
(Show Context)
We propose a radically new, biologically inspired, model of extreme scale computer on which application performance automatically scales with the transistor count even in the face of component failures. Today high performance computers are massively parallel systems composed of potentially hundreds of thousands of traditional processor cores, formed from trillions of transistors, consuming megawatts of power. Unfortunately, increasing the number of cores in a system, unlike increasing clock frequencies, does not automatically translate to application level improvements. No general autoparallelization techniques or tools exist for HPC systems. To obtain application improvements, HPC application programmers must manually cope with the challenge of multicore programming and the significant drop in reliability associated with the sheer number of transistors. Drawing on biological inspiration, the basic premise behind this work is that computation can be dramatically accelerated by integrating a very largescale, systemwide, predictive associative memory into the operation of the computer. The memory effectively turns computation into a form of pattern recognition and prediction whose result can be used to avoid significant fractions of computation. To be effective the expectation is that the memory will require billions of concurrent devices akin to biological cortical systems, where each device implements a small amount of storage,
Foundation award #1012798. Transistor Scaled HPC Application Performance
"... We propose a radically new, biologically inspired, model of extreme scale computer on which application performance automatically scales with the transistor count even in the face of component failures. Today high performance computers are massively parallel systems composed of potentially hundreds ..."
Abstract
 Add to MetaCart
(Show Context)
We propose a radically new, biologically inspired, model of extreme scale computer on which application performance automatically scales with the transistor count even in the face of component failures. Today high performance computers are massively parallel systems composed of potentially hundreds of thousands of traditional processor cores, formed from trillions of transistors, consuming megawatts of power. Unfortunately, increasing the number of cores in a system, unlike increasing clock frequencies, does not automatically translate to application level improvements. No general autoparallelization techniques or tools exist for HPC systems. To obtain application improvements, HPC application programmers must manually cope with the challenge of multicore programming and the significant drop in reliability associated with the sheer number of transistors. Drawing on biological inspiration, the basic premise behind this work is that computation can be dramatically accelerated by integrating a very largescale, systemwide, predictive associative memory into the operation of the computer. The memory effectively turns computation into a form of pattern recognition and prediction whose result can be used to avoid significant fractions of computation. To be effective the expectation is that the memory will require billions of concurrent devices akin to biological cortical systems, where each device implements a small amount of storage,
Characterizing the Relationship between ILUtype Preconditioners and the Storage Hierarchy
"... Many problems in high performance applications involve the solution of large sparse linear systems. A barrier encountered in this class of applications is the computational time required. Optimization techniques that focus on reducing internode communication and improving data partitioning/layout ca ..."
Abstract
 Add to MetaCart
(Show Context)
Many problems in high performance applications involve the solution of large sparse linear systems. A barrier encountered in this class of applications is the computational time required. Optimization techniques that focus on reducing internode communication and improving data partitioning/layout can significantly lower this computational barrier [9], [10], [3], [4], [2]. While parallelization can also be used to improve performance, the sparsity of the data reduces the effectiveness of direct parallel computation. ILUtype preconditioning techniques are widely recognized as being an extremely effective approach to providing efficient solvers[7]. These techniques have been used to increase the performance and reliability of Krylov subspace methods. However, a drawback of these approaches is that it is difficult to choose appropriate values for the preconditioner tuning parameters[1]. Usually, parameter selection is done through trialanderror for a few sample matrices from a given application. For instance, Figure 1 shows the percentage of duple (a duple is a set of parameter values) providing convergence in less than 500 iterations using the preconditioner ILUD 1 and the Krylov method GMRES 2. Several matrices from different scientific applications were tested[5] and fourteen possible values for each parameter were evaluated, giving us 378 possible combinations to try for each matrix. % good duple
Environments
"... The general goal of this Thesis is the design of high performance tools for distributed image processing. Since the image processing demand of resources is increasing, parallel and distributed technologies represent an attractive solution for image processing computing intensive applications. To ach ..."
Abstract
 Add to MetaCart
(Show Context)
The general goal of this Thesis is the design of high performance tools for distributed image processing. Since the image processing demand of resources is increasing, parallel and distributed technologies represent an attractive solution for image processing computing intensive applications. To achieve this goal we identify two main tasks: the first task is the development of a parallel image processing library, in order to ensure the satisfaction of high performance of requirements using available facilities such as a local parallel cluster. The second one is to enable the effective use of the library in different environments, i.e. distributed architectures and in particular the Grid. In this phase our scope is to plug the library in
A Standard and Software for Numerical Metadata submitted to ACM TOMS
, 2007
"... proceedings. Since changes may be made before publication, this preprint is made available with the understanding that anyone wanting to cite or reproduce it ascertains that no published version in journal or proceedings exists. This work was funded in part by the Los Alamos Computer Science Institu ..."
Abstract
 Add to MetaCart
(Show Context)
proceedings. Since changes may be made before publication, this preprint is made available with the understanding that anyone wanting to cite or reproduce it ascertains that no published version in journal or proceedings exists. This work was funded in part by the Los Alamos Computer Science Institute through the subcontract
Machine Learning for Multistage Selection of Numerical Methods*
"... In various areas of numerical analysis, there are several possible algorithms for solving a problem. In such cases, each method potentially solves the problem, but the runtimes can widely differ, and breakdown is possible. Also, there is typically no governing theory for finding the best method, or ..."
Abstract
 Add to MetaCart
In various areas of numerical analysis, there are several possible algorithms for solving a problem. In such cases, each method potentially solves the problem, but the runtimes can widely differ, and breakdown is possible. Also, there is typically no governing theory for finding the best method, or the theory is in essence uncomputable. Thus, the choice of the optimal method is in practice determined by experimentation and ‘numerical folklore’. However, a more systematic approach is needed, for instance since such choices may need to be made in a dynamic context such as a timeevolving system. Thus we formulate this as a classification problem: assign each numerical problem to a class corresponding to the best method for solving that problem. What makes this an interesting problem for Machine Learning, is the large number of classes, and their relationships. A method is a combination of (at least) a preconditioner and an iterative scheme, making the total number of methods the product of these individual cardinalities. Since this can be a very large number, we want to exploit this structure of the set of classes, and find a way to classify the components of a method separately. We have developed various techniques for such multistage recommendations, using automatic recognition of superclases. These techniques are shown to pay off very well in our application area of iterative linear system solvers. We present the basic concepts of our recommendation strategy, and give an overview of the software libraries that make up the Salsa (SelfAdapting Largescale Solver Architecture) project. 1.