Results 1  10
of
20
Towards dense linear algebra for hybrid gpu accelerated manycore systems
 Parallel Computing
"... a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with t ..."
Abstract

Cited by 67 (20 self)
 Add to MetaCart
(Show Context)
a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.
Accelerating Scientific Computations with Mixed Precision Algorithms
, 2008
"... On modern architectures, the performance of 32bit operations is often at least twice as fast as the performance of 64bit operations. By using a combination of 32bit and 64bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanc ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
On modern architectures, the performance of 32bit operations is often at least twice as fast as the performance of 64bit operations. By using a combination of 32bit and 64bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
Precimonious: Tuning assistant for floatingpoint precision
 Proc. of SC13
"... Given the variety of numerical errors that can occur, floatingpoint programs are difficult to write, test and debug. One common practice employed by developers without an advanced background in numerical analysis is using the highest available precision. While more robust, this can degrade program p ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Given the variety of numerical errors that can occur, floatingpoint programs are difficult to write, test and debug. One common practice employed by developers without an advanced background in numerical analysis is using the highest available precision. While more robust, this can degrade program performance significantly. In this paper we present Precimonious, a dynamic program analysis tool to assist developers in tuning the precision of floatingpoint programs. Precimonious performs a search on the types of the floatingpoint program variables trying to lower their precision subject to accuracy constraints and performance goals. Our tool recommends a type instantiation that uses lower precision while producing an accurate enough answer without causing exceptions. We evaluate Precimonious on several widely used functions from the GNU Scientific Library, two NAS Parallel Benchmarks, and three other numerical programs. For most of the programs analyzed, Precimonious reduces precision, which results in performance improvements as high as 41%.
FGMRES to obtain backward stability in mixed precision
, 2008
"... Dedicated to Gérard Meurant on the occasion of his 60th birthday Abstract. We consider the triangular factorization of matrices in singleprecision arithmetic and show how these factors can be used to obtain a backward stable solution. Our aim is to obtain doubleprecision accuracy even when the sy ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Dedicated to Gérard Meurant on the occasion of his 60th birthday Abstract. We consider the triangular factorization of matrices in singleprecision arithmetic and show how these factors can be used to obtain a backward stable solution. Our aim is to obtain doubleprecision accuracy even when the system is illconditioned. We examine the use of iterative refinement and show by example that it may not converge. We then show both theoretically and practically that the use of FGMRES will give us the result that we desire with fairly mild conditions on the matrix and the direct factorization. We perform extensive experiments on dense matrices using MATLAB and indicate how our work extends to sparse matrix factorization and solution.
Fast conjugate gradients with multiple GPUs
 In ICCS ’09: Proceedings of the 9th International Conference on Computational Science
, 2009
"... Abstract. The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we utilize GPU’s high memory bandwidth for implementation of a sparse iterative solver for unstructured problems. We describe a fast Conjugate Gradient solver, which runs on multiple GPUs in ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we utilize GPU’s high memory bandwidth for implementation of a sparse iterative solver for unstructured problems. We describe a fast Conjugate Gradient solver, which runs on multiple GPUs installed on a single mainboard. The solver achieves double precision accuracy with single precision GPUs, using a mixed precision iterative refinement algorithm. To achieve high computation speed, we propose a fast sparse matrixvector multiplication algorithm, which is the core operation of iterative solvers. The proposed multiplication algorithm efficiently utilizes GPU resources via caching, coalesced memory accesses and load balance between running threads. Experiments on wide range of matrices show that our matrixvector multiplication algorithm achieves up to 9.9 Gflops on single GeForce 8800 GTS card and CG implementation achieves up to 22.6 Gflops with four GPUs. 1
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems
 Scientific Programming
, 2012
"... Solvers for large sparse linear systems come in two categories: direct and iterative. Amesos2, a package in the Trilinos software project, provides direct methods, and Belos, another Trilinos package, provides iterative methods. Amesos2 offers a common interface to many different sparse matrix fact ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
Solvers for large sparse linear systems come in two categories: direct and iterative. Amesos2, a package in the Trilinos software project, provides direct methods, and Belos, another Trilinos package, provides iterative methods. Amesos2 offers a common interface to many different sparse matrix factorization codes, and can handle any implementation of sparse matrices and vectors, via an easytoextend C++ traits interface. It can also factor matrices whose entries have arbitrary “Scalar ” type, enabling extendedprecision and mixedprecision algorithms. Belos includes many different iterative methods for solving large sparse linear systems and leastsquares problems. Unlike competing iterative solver libraries, Belos completely decouples the algorithms from the implementations of the underlying linear algebra objects. This lets Belos exploit the latest hardware without changes to the code. Belos favors algorithms that solve higherlevel problems, such as multiple simultaneous linear systems and sequences of related linear systems, faster than standard algorithms. The package also supports extendedprecision and mixedprecision algorithms. Together, Amesos2 and Belos form a complete suite of sparse linear solvers. 1
A fast and robust mixedprecision solver for the solution of sparse symmetric linear systems
 ISSN
, 2010
"... On many current and emerging computing architectures, singleprecision calculations are at least twice as fast as doubleprecision calculations. In addition, the use of single precision may reduce pressure on memory bandwidth. The penalty for using single precision for the solution of linear systems ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
On many current and emerging computing architectures, singleprecision calculations are at least twice as fast as doubleprecision calculations. In addition, the use of single precision may reduce pressure on memory bandwidth. The penalty for using single precision for the solution of linear systems is a potential loss of accuracy in the computed solutions. For sparse linear systems, the use of mixed precision in which doubleprecision iterative methods are preconditioned by a singleprecision factorization can enable the recovery of highprecision solutions more quickly and use less memory than a sparse direct solver run using doubleprecision arithmetic. In this article, we consider the use of single precision within direct solvers for sparse symmetric linear systems, exploiting both the reduction in memory requirements and the performance gains. We develop a practical algorithm to apply a mixedprecision approach and suggest parameters and techniques to minimize the number of solves required by the iterative recovery process. These experiments provide the basis for our new code HSL MA79—a fast, robust, mixedprecision sparse symmetric solver that is included in the mathematical software library HSL. Numerical results for a wide range of problems from practical applications are presented.
REFERENCES
"... our study. This illustrates the fallacy of extrapolating results obtained in patients with severe hypertension to patients with milder hypertension and emphasizes the necessity for carefully designed studies comparing side effect of drugs at equally effective doses in the same population. Acknowledg ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
our study. This illustrates the fallacy of extrapolating results obtained in patients with severe hypertension to patients with milder hypertension and emphasizes the necessity for carefully designed studies comparing side effect of drugs at equally effective doses in the same population. Acknowledgment The authors are grateful to Dr. Morton Leeds and the CIBAGeigy Corporation for the drugs, forms and support for this study. We wish to thank Dr. Dennis Gilliland for statistical advice and Mrs. Esther Stuart for secretarial assistance.
Recent algorithm and machine developments for lattice QCD
, 2008
"... I review recent machine trends and algorithmic developments for dynamical lattice QCD simulations with the HMC algorithm for Wilsontype fermions. The topics include the trend toward multicore processors and general purpose GPU (GPGPU) computing, and improvements on the quark determinant preconditi ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
I review recent machine trends and algorithmic developments for dynamical lattice QCD simulations with the HMC algorithm for Wilsontype fermions. The topics include the trend toward multicore processors and general purpose GPU (GPGPU) computing, and improvements on the quark determinant preconditioning, molecular dynamics integrator, and quark solvers. I also discuss the prospect on the use of these techniques on the forthcoming petaflops machines.
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid ∗
"... Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel iterative algorithms that rely on sparse matrixvector multiplication (SpMV). This paper introduces Hierarchical Diagonal Blocking (HDB), an approach which we believe captures many of the existing optimization techni ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel iterative algorithms that rely on sparse matrixvector multiplication (SpMV). This paper introduces Hierarchical Diagonal Blocking (HDB), an approach which we believe captures many of the existing optimization techniques for SpMV in a common representation. Using this representation in conjuction with precisionreduction techniques, we develop and evaluate highperformance SpMV kernels. We also study the implications of using our SpMV kernels in a complete iterative solver. Our method of choice is a Combinatorial Multigrid solver that can fully utilize our fastest reducedprecision SpMV kernel without sacrificing the quality of the solution. We provide extensive empirical evaluation of the effectiveness of the approach on a variety of benchmark matrices, demonstrating substantial speedups on all matrices considered. I.