Results 1  10
of
144
Error bounds from extra precise iterative refinement
 ACM Transactions on Mathematical Software
, 2006
"... We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most illcondi ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
(Show Context)
We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most illconditioned linear systems of equations. However two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard [5] has recently removed the first obstacle. To overcome the second obstacle, we show how a single application of iterative refinement can be used to compute an error bound in any norm at small cost, and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound. We report extensive test results on over 6.2 million matrices of dimension 5, 10, 100, and 1000. As long as a normwise (resp. componentwise) condition number computed by the algorithm is less than 1/max{10, √ n}εw, the computed normwise (resp. componentwise) error bound is at most
Fast Parallel PageRank: A Linear System Approach
, 2004
"... In this paper we investigate the convergence of iterative stationary and Krylov subspace methods for the PageRank linear system, including the convergence dependency on teleportation. We demonstrate that linear system iterations converge faster than the simple power method and are less sensitive to ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
(Show Context)
In this paper we investigate the convergence of iterative stationary and Krylov subspace methods for the PageRank linear system, including the convergence dependency on teleportation. We demonstrate that linear system iterations converge faster than the simple power method and are less sensitive to the changes in teleportation. In order to perform this study we developed a framework for parallel PageRank computing. We describe the details of the parallel implementation and provide experimental results obtained on a 70node Beowulf cluster.
Analyzing UltraScale Application Communication Requirements for a Reconfigurable Hybrid Interconnect
 in Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC ’05
, 2005
"... The path towards realizing petascale computing is increasingly dependent on scaling up to unprecedented numbers of processors. To prevent the interconnect architecture between processors from dominating the overall cost of such systems, there is a critical need for interconnect solutions that both ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
(Show Context)
The path towards realizing petascale computing is increasingly dependent on scaling up to unprecedented numbers of processors. To prevent the interconnect architecture between processors from dominating the overall cost of such systems, there is a critical need for interconnect solutions that both provide performance to ultascale applications and have costs that scale linearly with system size. In this work we propose the Hybrid Flexibly Assignable Switch Topology (HFAST) infrastructure. The HFAST approach uses both passive (circuit switch) and active (packet switch) commodity switch components to deliver all of the flexibility and faulttolerance of a fullyinterconnected network (such as a fattree), while preserving the nearly linear cost scaling associated with traditional lowdegree interconnect networks. To understand the applicability of this technology, we perform an indepth study of the communication requirements across a broad spectrum of important scientific applications, whose computational methods include: finitedifference, latticebolzmann, particle in cell, sparse linear algebra, particle mesh ewald, and FFTbased solvers. We use the IPM (Integrated Performance Monitoring) profiling layer to gather detailed messaging statistics with minimal impact to code performance. This profiling provides us sufficiently detailed communication topology and message volume data to evaluate these applications in the context of the proposed hybrid interconnect. Overall results show that HFAST is a promising approach for practically addressing the interconnect requirements of future petascale systems. 1.
MultiThreading and OneSided Communication in Parallel LU Factorization
"... Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has nontrivial dependence patterns which limi ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has nontrivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memoryconstrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other stateoftheart MPI codes.
Scalable Parallel Approach for HighFidelity SteadyState Aeroelastic Analysis and Derivative
 Computations,” AIAA Journal
, 2014
"... In this paper we present several significant improvements to both coupled solution methods and sensitivity analysis techniques for highfidelity aerostructural systems. We consider the analysis of full aircraft configurations using Euler CFD models with more than 80 million state variables and stru ..."
Abstract

Cited by 23 (19 self)
 Add to MetaCart
(Show Context)
In this paper we present several significant improvements to both coupled solution methods and sensitivity analysis techniques for highfidelity aerostructural systems. We consider the analysis of full aircraft configurations using Euler CFD models with more than 80 million state variables and structural finiteelement models with more than 1 million degrees of freedom. A coupled Newton–Krylov solution method for the aerostructural system is presented that accelerates the convergence rate for highly flexible structures. A coupled adjoint technique is presented that can compute gradients with respect to thousands of design variables accurately and efficiently. The efficiency of the presented methods is assessed on a high performance parallel computing cluster for up to 544 processors. To demonstrate the effectiveness of the proposed approach and the developed framework, an aerostructural model based on the Common Research Model is optimized with respect to hundreds of variables representing the wing outer mold line and the structural sizing. Two separate problems are solved: one where fuel burn is minimized, and another where the maximum takeoff weight is minimized. Multipoint optimizations with 5 cruise conditions and 2 maneuver conditions are performed with a 2 million cell CFD mesh and 300 000 DOF structural mesh. The optima for problems with 476 design variables are obtained within 36 hours of wall time on 435 processors. The resulting optimal aircraft are discussed and analyze the aerostructural tradeoffs for each objective. Convergence tolerance for aerostructural solution Convergence tolerance for aerostructural adjoint solution Nomenclature α Angle of attack AS Convergence tolerance for aerostructural solution A Convergence tolerance for aerodynamic solution S Convergence tolerance for structural solution A Aerodynamic residuals R All residuals
Using mixed precision for sparse matrix computations to enhance the performance while achieving 64bit accuracy
 ACM Trans. Math. Softw
"... By using a combination of 32bit and 64bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techni ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
By using a combination of 32bit and 64bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techniques and sparse iterative techniques such as Krylov subspace methods. The approach presented here can apply not only to conventional processors but also to exotic technologies such as
Autotuning a highlevel language targeted to GPU codes
 In Innovative Parallel Computing Conference. IEEE
, 2012
"... Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a speci ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
(Show Context)
Determining the best set of optimizations to apply to a kernel to be executed on the graphics processing unit (GPU) is a challenging problem. There are large sets of possible optimization configurations that can be applied, and many applications have multiple kernels. Each kernel may require a specific configuration to achieve the best performance, and moving an application to new hardware often requires a new optimization configuration for each kernel. In this work, we apply optimizations to GPU code using HMPP, a highlevel directivebased language and sourcetosource compiler that can generate CUDA / OpenCL code. However, programming with highlevel languages may mean a loss of performance compared to using lowlevel languages. Our work shows that it is possible to improve the performance of a highlevel language by using autotuning. We perform autotuning on a large optimization space on GPU kernels, focusing on loop permutation, loop unrolling, tiling, and specifying which loop(s) to parallelize, and show results on convolution kernels, codes in the PolyBench suite, and an implementation of belief propagation for stereo vision. The results show that our autotuned HMPPgenerated implementations are significantly faster than the default HMPP implementation and can meet or exceed the performance of manually coded CUDA / OpenCL implementations.
Understanding UltraScale Application Communication Requirements
 in Proceedings of the 2005 IEEE International Symposium on Workload Characterization (IISWC
, 2005
"... As thermal constraints reduce the pace of CPU performance improvements, the cost and scalability of future HPC architectures will be increasingly dominated by the interconnect. In this work we perform an indepth study of the communication requirements across a broad spectrum of important scientific ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
As thermal constraints reduce the pace of CPU performance improvements, the cost and scalability of future HPC architectures will be increasingly dominated by the interconnect. In this work we perform an indepth study of the communication requirements across a broad spectrum of important scientific applications, whose computational methods include: finitedifference, latticebolzmann, particle in cell, sparse linear algebra, particle mesh ewald, and FFTbased solvers. We use the IPM (integrated Performance Monitoring) profiling framework to collect detailed statistics on communication topology and message volume with minimal impact to code performance. By characterizing the parallelism and communication requirements of such a diverse set of applications, we hope to guide architectural choices for the design and implementation of interconnects for future HPC systems. 1.