Results 1  10
of
35
Minimizing Communication in Sparse Matrix Solvers
"... Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vecto ..."
Abstract

Cited by 36 (10 self)
 Add to MetaCart
(Show Context)
Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparsematrix kernel to compute a set of matrixvector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our sharedmemory implementation on an 8core Intel Clovertown gets speedups of up to 4.3 × over standard GMRES, without sacrificing convergence rate or numerical stability. 1.
Hiding global communication latency in the GMRES algorithm on massively parallel machines
, 2012
"... Abstract. In the Generalized Minimal Residual Method (GMRES), the global alltoall communication required in each iteration for orthogonalization and normalization of the Krylov base vectors is becoming a performance bottleneck on massively parallel machines. Long latencies, system noise and load ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In the Generalized Minimal Residual Method (GMRES), the global alltoall communication required in each iteration for orthogonalization and normalization of the Krylov base vectors is becoming a performance bottleneck on massively parallel machines. Long latencies, system noise and load imbalance cause these global reductions to become very costly global synchronizations. In this work, we propose the use of nonblocking or asynchronous global reductions to hide these global communication latencies by overlapping them with other communications and calculations. A pipelined variation of GMRES is presented in which the result of a global reduction is only used one or more iterations after the communication phase has started. This way, global synchronization is relaxed and scalability is much improved at the expense of some extra computations. The numerical instabilities that inevitably arise due to the typical monomial basis by powering the matrix are reduced and often annihilated by using Newton or Chebyshev bases instead. We model the performance on massively parallel machines with an analytical model. Key words. GMRES, GramSchmidt, parallel computing, latency hiding, global communication AMS subject classifications. 65F10
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems
 Scientific Programming
, 2012
"... Solvers for large sparse linear systems come in two categories: direct and iterative. Amesos2, a package in the Trilinos software project, provides direct methods, and Belos, another Trilinos package, provides iterative methods. Amesos2 offers a common interface to many different sparse matrix fact ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
Solvers for large sparse linear systems come in two categories: direct and iterative. Amesos2, a package in the Trilinos software project, provides direct methods, and Belos, another Trilinos package, provides iterative methods. Amesos2 offers a common interface to many different sparse matrix factorization codes, and can handle any implementation of sparse matrices and vectors, via an easytoextend C++ traits interface. It can also factor matrices whose entries have arbitrary “Scalar ” type, enabling extendedprecision and mixedprecision algorithms. Belos includes many different iterative methods for solving large sparse linear systems and leastsquares problems. Unlike competing iterative solver libraries, Belos completely decouples the algorithms from the implementations of the underlying linear algebra objects. This lets Belos exploit the latest hardware without changes to the code. Belos favors algorithms that solve higherlevel problems, such as multiple simultaneous linear systems and sequences of related linear systems, faster than standard algorithms. The package also supports extendedprecision and mixedprecision algorithms. Together, Amesos2 and Belos form a complete suite of sparse linear solvers. 1
Numerical evaluation of the CommunicationAvoiding Lanczos
, 2012
"... The Lanczos algorithm is widely used for solving large sparse symmetric eigenvalue problems when only a few eigenvalues from the spectrum are needed. Due to sparse matrixvector multiplications and frequent synchronization, the algorithm is communication intensive leading to poor performance on para ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The Lanczos algorithm is widely used for solving large sparse symmetric eigenvalue problems when only a few eigenvalues from the spectrum are needed. Due to sparse matrixvector multiplications and frequent synchronization, the algorithm is communication intensive leading to poor performance on parallel computers and modern cachebased processors. The CommunicationAvoiding Lanczos algorithm [Hoemmen; 2010] attempts to improve performance by taking the equivalence of s steps of the original algorithm at a time. The scheme is equivalent to the original algorithm in exact arithmetic but as the value of s grows larger, numerical roundoff errors are expected to have a greater impact. In this paper, we investigate the numerical properties of the CommunicationAvoiding Lanczos (CALanczos) algorithm and how well it works in practical computations. Apart from the algorithm itself, we have implemented techniques that are commonly used with the Lanczos algorithm to improve its numerical performance, such as semiorthogonal schemes and restarting. We present results that show that CALanczos is often as accurate as the original algorithm. In many cases, if the parameters of the sstep basis are chosen appropriately, the numerical behaviour of CALanczos is close to the standard algorithm even though it is somewhat more sensitive to loosing mutual orthogonality among the basis vectors. 1
Performance analysis of asynchronous Jacobi’s method implemented in MPI, SHMEM and
, 2013
"... Everincreasing core counts create the need to develop parallel algorithms that avoid closelycoupled execution across all cores. In this paper we present performance analysis of several parallel asynchronous implementations of Jacobi’s method for solving systems of linear equations, using MPI, SHME ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Everincreasing core counts create the need to develop parallel algorithms that avoid closelycoupled execution across all cores. In this paper we present performance analysis of several parallel asynchronous implementations of Jacobi’s method for solving systems of linear equations, using MPI, SHMEM and OpenMP. In particular we have solved systems of over 4 billion unknowns using up to 32,768 processes on a Cray XE6 supercomputer. We show that the precise implementation details of asynchronous algorithms can strongly affect the resulting performance and convergence behaviour of our solvers in unexpected ways, discuss how our specific implementations could be generalised to other classes of problem, and how existing parallel programming models might be extended to allow asynchronous algorithms to be expressed more easily. 3 1
A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of CommunicationAvoiding Krylov Subspace Methods
, 2012
"... ..."
Implementation and Performance Evaluation of a Distributed Conjugate Gradient Method in a Cloud Computing Environment
"... Cloud computing is an emerging technology where IT resources are provisioned to users in a set of a unified computing resources on a pay per use basis. The resources are dynamically chosen to satisfy a user Service Level Agreement and a required level of performance. A Cloud is seen as a computing p ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Cloud computing is an emerging technology where IT resources are provisioned to users in a set of a unified computing resources on a pay per use basis. The resources are dynamically chosen to satisfy a user Service Level Agreement and a required level of performance. A Cloud is seen as a computing platform for heavy load applications. Conjugate Gradient (CG) method is an iterative linear solver which is used by many scientific and engineering applications to solve a linear system of algebraic equations. CG generates a heavy load of computation and therefore it slows the performance of the applications using it. Distributing CG is considered as a way to increase its performance. However, running a distributed CG, based on a standard API, such as MPI, in a Cloud face many challenges, such as the Cloud processing and networking capabilities. In this work, we present an indepth analysis of the CG algorithm and its complexity in order to develop adequate distributed algorithms. The implementation of these algorithms and their evaluation in our Cloud environment reveals the gains and losses achieved by distributing the CG. The performance results show that despite the complexity of the CG processing and communication, a speedup gain of at least 1,157.7 is obtained using 128 cores compared to NAS sequential execution. Given the emergence of Clouds, the results in this paper analyzes performance issues when a generic public Cloud, along with a standard development library, such as MPI, is used for High Performance applications, without the need of some specialized hardware and software.