Results 1  10
of
36
Performance and accuracy of hardwareoriented native, emulated and mixedprecision solvers in FEM simulations
 International Journal of Parallel, Emergent and Distributed Systems
"... In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs) [1]. In particul ..."
Abstract

Cited by 54 (12 self)
 Add to MetaCart
In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs) [1]. In particular, we evaluated mixed and emulatedprecision schemes on commodity graphics processors (GPUs), which at that time only supported computations in single precision. With the advent of graphics cards that natively provide double precision, this report updates our previous results. We demonstrate that with new coprocessor hardware supporting native double precision, such as NVIDIA’s G200 architecture, the situation does not change qualitatively for PDEs, and the previously introduced mixed precision schemes are still preferable to double precision alone. But the schemes achieve significant quantitative performance improvements with the more powerful hardware. In particular, we demonstrate that a Multigrid scheme can accurately solve a common test problem in Finite Element settings with one million unknowns in less than 0.1 seconds, which is truely outstanding performance. We support these conclusions by exploring the algorithmic design space enlarged by the availability of double precision directly in the hardware. 1 Introduction and
Using GPUs to improve multigrid solver performance on a cluster
 J. OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requirin ..."
Abstract

Cited by 31 (7 self)
 Add to MetaCart
(Show Context)
This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPIbased Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number
Accelerating Scientific Computations with Mixed Precision Algorithms
, 2008
"... On modern architectures, the performance of 32bit operations is often at least twice as fast as the performance of 64bit operations. By using a combination of 32bit and 64bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanc ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
On modern architectures, the performance of 32bit operations is often at least twice as fast as the performance of 64bit operations. By using a combination of 32bit and 64bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
MODIFIED GRAM–SCHMIDT (MGS), LEAST SQUARES, AND BACKWARD STABILITY OF MGSGMRES
, 2006
"... The generalized minimum residual method (GMRES) [Y. Saad and M. Schultz, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869] for solving linear systems Ax = b is implemented as a sequence of least squares problems involving Krylov subspaces of increasing dimensions. The most usual implementation ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
The generalized minimum residual method (GMRES) [Y. Saad and M. Schultz, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869] for solving linear systems Ax = b is implemented as a sequence of least squares problems involving Krylov subspaces of increasing dimensions. The most usual implementation is modified Gram–Schmidt GMRES (MGSGMRES). Here we show that MGSGMRES is backward stable. The result depends on a more general result on the backward stability of a variant of the MGS algorithm applied to solving a linear least squares problem, and uses other new results on MGS and its loss of orthogonality, together with an important but neglected condition number, and a relation between residual norms and certain singular values.
Extraprecise iterative refinement for overdetermined least squares problems
, 2007
"... We present the algorithm, error bounds, and numerical results for extraprecise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
We present the algorithm, error bounds, and numerical results for extraprecise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ε) unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn 2) cost of QR factorization for problems of size mbyn. The extra precision calculation is facilitated by the new extendedprecision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems. 1
Reducing the influence of tiny normwise relative errors on performance profiles
 Manchester Institute for Mathematical Sciences, The University of Manchester
, 2011
"... Reports available from: And by contacting: ..."
REDUCING FLOATING POINT ERROR IN DOT PRODUCT USING THE SUPERBLOCK FAMILY OF ALGORITHMS
, 2008
"... This paper discusses both the theoretical and statistical errors obtained by various wellknown dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make tradeoffs bet ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
This paper discusses both the theoretical and statistical errors obtained by various wellknown dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make tradeoffs between computational performance, memory usage, and error behavior. We show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike many such errorreducing algorithms, superblock requires no additional floating point operations and should be implementable with little to no performance loss, making it suitable for use as a performancecritical building block of a linear algebra kernel.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
A fast and robust mixedprecision solver for the solution of sparse symmetric linear systems
 ISSN
, 2010
"... On many current and emerging computing architectures, singleprecision calculations are at least twice as fast as doubleprecision calculations. In addition, the use of single precision may reduce pressure on memory bandwidth. The penalty for using single precision for the solution of linear systems ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
On many current and emerging computing architectures, singleprecision calculations are at least twice as fast as doubleprecision calculations. In addition, the use of single precision may reduce pressure on memory bandwidth. The penalty for using single precision for the solution of linear systems is a potential loss of accuracy in the computed solutions. For sparse linear systems, the use of mixed precision in which doubleprecision iterative methods are preconditioned by a singleprecision factorization can enable the recovery of highprecision solutions more quickly and use less memory than a sparse direct solver run using doubleprecision arithmetic. In this article, we consider the use of single precision within direct solvers for sparse symmetric linear systems, exploiting both the reduction in memory requirements and the performance gains. We develop a practical algorithm to apply a mixedprecision approach and suggest parameters and techniques to minimize the number of solves required by the iterative recovery process. These experiments provide the basis for our new code HSL MA79—a fast, robust, mixedprecision sparse symmetric solver that is included in the mathematical software library HSL. Numerical results for a wide range of problems from practical applications are presented.