#### DMCA

## STENCIL-AWARE GPU OPTIMIZATION OF ITERATIVE SOLVERS∗

### Citations

258 | Efficient management of parallelism in object oriented numerical software libraries - Balay, Gropp, et al. - 1997 |

153 | Optimization of sparse matrix-vector multiplication on emerging multicore platforms.
- Williams, Oliker, et al.
- 2009
(Show Context)
Citation Context ...totuning the kernels that typically dominate the runtime of large-scale applications involving nonlinear PDE solutions using finite-difference or finite-volume approximations. Many other works (e.g., =-=[13, 30, 34]-=-) explore this structure, but address only a portion of the relevant kernels and typically rely on general sparse matrix formats such as those described in Section 2.1. Stencil-Aware GPU Optimization ... |

143 | SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
- Li, Demmel
(Show Context)
Citation Context ...GPU, PETSc AMS subject classifications. 65Y10, 65F50, 15A06, 68N19 1. Introduction. Many scientific applications rely on high-performance numerical libraries, such as Hypre [21], PETSc [6–8], SuperLU =-=[24]-=-, and Trilinos [32], for providing accurate and fast solutions to problems modeled by using nonlinear partial differential equations (PDEs). Thus, the bulk of the burden in achieving good performance ... |

135 | Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
- Datta, Murphy, et al.
- 2008
(Show Context)
Citation Context ...totuning the kernels that typically dominate the runtime of large-scale applications involving nonlinear PDE solutions using finite-difference or finite-volume approximations. Many other works (e.g., =-=[13, 30, 34]-=-) explore this structure, but address only a portion of the relevant kernels and typically rely on general sparse matrix formats such as those described in Section 2.1. Stencil-Aware GPU Optimization ... |

113 | Efficient Sparse MatrixVector Multiplication on CUDA.
- Bell, Garland
- 2008
(Show Context)
Citation Context ... in Section 6. 2. Background. We begin our discussion with a brief contextual overview of storage formats and software packages used in this work. 2.1. Sparse Matrix Storage Formats. Bell and Harland =-=[9]-=- proposed several sparse matrix storage formats, each optimized for different use cases. In this section, we briefly describe three prominent formats: compressed sparse row (CSR), blocked CSR, diagona... |

67 | Towards dense linear algebra for hybrid GPU accelerated manycore systems.
- Tomov, Dongarra, et al.
- 2010
(Show Context)
Citation Context ...revious work [18], we demonstrated the performance scaling of our SG-DIA format across different degrees of freedom and efficient bandwidth utilization on various GPU architectures. The MAGMA project =-=[26,31]-=- aims to develop a library similar to LAPACK [4], but for heterogeneous architectures, initially focusing on CPU+GPU systems. MAGMA supports dense (full) matrix algebra, unlike our approach, which foc... |

46 |
Cusp: Generic parallel algorithms for sparse matrix and graph computations,
- Bell, Garland
- 2010
(Show Context)
Citation Context ...in these plots) on an 8-core E5462 Xeon/Tesla C2070 (left) and an 8-core Xeon E5430/Tesla C1060 (right). Stencil-Aware GPU Optimization of Iterative Solvers 19 5. Related Work. Libraries such as Cusp =-=[10,16]-=-, Thrust [11], and cuBLAS [28] provide optimized CUDA implementations of many numerical kernels used in scientific computing. These implementations, however, are not tunable for specific problem chara... |

32 | Pseudotransient continuation and differential-algebraic equations’,
- Coffey, Kelley, et al.
- 2003
(Show Context)
Citation Context ... arising in structured grid problems with degrees of freedom greater than or equal to one. As an example of a problem with multiple degrees of freedom, consider a 2-D simulation of driven cavity flow =-=[12]-=-, represented by a system of nonlinear PDEs of the form f(u) = 0, where f : Rn → Rn, and: −4U −∇yΩ = 0 −4V +∇xΩ = 0 −4Ω +∇ · ([U ∗ Ω, V ∗ Ω])−GR ∗ ∇xT = 0 −4T + PR ∗ ∇ · ([U ∗ T, V ∗ T ]) = 0. where (... |

27 |
Thrust: A productivity-oriented library for CUDA, in:
- Bell, Hoberock
- 2012
(Show Context)
Citation Context ... kernels in Table 4.2 tuned with OrCuda with that of different library-based implementations. PETSc already includes vector and matrix types with GPU implementations that rely on Cusp [16] and Thrust =-=[11]-=-. While PETSc does not use cuBLAS [28], we use it as a baseline for comparison with the different vector operation implementations because it is the best-performing among the available library options... |

25 |
An Improved MAGMA GEMM For Fermi Graphics Processing Units
- Nath, Tomov, et al.
(Show Context)
Citation Context ...revious work [18], we demonstrated the performance scaling of our SG-DIA format across different degrees of freedom and efficient bandwidth utilization on various GPU architectures. The MAGMA project =-=[26,31]-=- aims to develop a library similar to LAPACK [4], but for heterogeneous architectures, initially focusing on CPU+GPU systems. MAGMA supports dense (full) matrix algebra, unlike our approach, which foc... |

24 | H.: PETSc users manual. - Buschelman, Eijkhout, et al. - 2012 |

24 |
Accessed
- gov
- 2013
(Show Context)
Citation Context ...ect classifications. 65Y10, 65F50, 15A06, 68N19 1. Introduction. Many scientific applications rely on high-performance numerical libraries, such as Hypre [21], PETSc [6–8], SuperLU [24], and Trilinos =-=[32]-=-, for providing accurate and fast solutions to problems modeled by using nonlinear partial differential equations (PDEs). Thus, the bulk of the burden in achieving good performance and portability is ... |

19 |
Top 500 supercomputer sites.” http://www.top5OO.org,
- Top
- 2004
(Show Context)
Citation Context ...ility of hybrid CPU/accelerator architectures is making the task of providing both portability and high performance in both libraries and applications increasingly challenging. The latest Top500 list =-=[3]-=- contains thirtynine supercomputing systems with GPGPUs. Amazon has announced the availability of Cluster GPU Instances for Amazon EC2. More and more researchers have access to GPU clusters instead of... |

18 |
the gnu compiler collection. http://gcc.gnu.org
- Gcc
- 2012
(Show Context)
Citation Context ...t functionality is limited to vendor libraries and is not generally available to arbitrary codes. The significant exception among mainstream compilers is the open-source GNU Compiler Collection (GCC) =-=[17]-=- and specifically the Milepost GCC component [15], which employs a machine-learningbased approach that performs optimizations based on a set of code features. To our knowledge, GCC does not generate a... |

13 | Milepost GCC: Machine learning enabled self-tuning compiler
- Fursin, Kashnikov, et al.
- 2011
(Show Context)
Citation Context ...d is not generally available to arbitrary codes. The significant exception among mainstream compilers is the open-source GNU Compiler Collection (GCC) [17] and specifically the Milepost GCC component =-=[15]-=-, which employs a machine-learningbased approach that performs optimizations based on a set of code features. To our knowledge, GCC does not generate and optimize CUDA (or other GPU) code at this time... |

7 | SPAPT: Search problems in automatic performance tuning
- Balaprakash, Wild, et al.
- 2012
(Show Context)
Citation Context ...ts [ j ]; if (col >= 0 && col < nrows) y[ i ] += A[i+j∗nrows] ∗ x[col]; } } /∗@ end @∗/ /∗@ end @∗/ } Fig. 3.2: Annotated matrix-vector multiplication. requiring fewer runs are also available in Orio =-=[5,20]-=-. The highest-performing version replaces the annotated code in the final output of autotuning. Orio also optionally performs validation of tuned kernels by comparing their results with those of the o... |

7 | Madpak: A family of abstract multigrid or multilevel solutions
- Douglas
- 1995
(Show Context)
Citation Context ... our approach, which focuses on sparse matrix algebra in support of stencil-based PDE discretizations. The MADPACK package provides a highly memory efficient storage format for multigrid formulations =-=[14]-=-. Williams et al. [34] introduce an autotuning approach for optimizing the performance of sparse matrix-vector products on multicore processors, considering many parameters such as loop optimizations,... |

7 | Generating empirically optimized composed matrix kernels from MATLAB prototypes
- Norris, Hartono, et al.
- 2009
(Show Context)
Citation Context ...ted code. In previous work, we showed that high-level computation specifications can be embedded in existing C or Fortran codes by expressing them through annotations specified as structured comments =-=[20, 27]-=-, as illustrated in Figure 2.2. The performance of code generated from such high-level specifications is almost always significantly better than that of compiled C or Fortran code, and for composed op... |

5 | High-performance sparse matrix-vector multiplication on GPUs for structured grid computations
- Godwin, Holewinski, et al.
- 2012
(Show Context)
Citation Context ...cally extensible library is typically much more difficult to optimize through traditional compiler approaches. ∗This work builds on and significantly extends previous work by the authors described in =-=[18,25]-=-. 1 2 Choudary, Godwin, Holewinski, Karthik, Lowell, Mametjanov, Norris, Sabin, Sadayappan Our goal is to tackle the challenges in achieving the best possible performance on hybrid CPU/ GPGPU architec... |

5 | PrimeTile: A parametric multi-level tiler for imperfect loop nests
- Hartono, Baskaran, et al.
- 2009
(Show Context)
Citation Context ...ted code. In previous work, we showed that high-level computation specifications can be embedded in existing C or Fortran codes by expressing them through annotations specified as structured comments =-=[20, 27]-=-, as illustrated in Figure 2.2. The performance of code generated from such high-level specifications is almost always significantly better than that of compiled C or Fortran code, and for composed op... |

4 |
Accelerating the solution of families of shifted linear systems with CUDA,” arXiv:1102.2143 [hep-lat
- Galvez, Anders
(Show Context)
Citation Context ..., int n, int nos, int dof) { int i , j , col ; /∗@ begin PerfTuning ( def performance params { param TC[] = range (32,1025,32); param BC[] = range (14,113,14); param UIF[] = range (1,6); param PL[] = =-=[16,48]-=-; param CFLAGS[] = map(join,product([’’,’−use fast math ’], [’’,’− O1’,’−O2’,’−O3 ’])); } def input params { param m[] = [32,64,128,256,512]; param n[] = [32,64,128,256,512]; param nos = 5; param dof ... |

4 |
CUDA Basic Linear Algebra Subroutines (cuBLAS) Library. http:// developer.nvidia.com/cublas, 2012. Last accessed April 28
- NVIDIA
- 2012
(Show Context)
Citation Context ...a with that of different library-based implementations. PETSc already includes vector and matrix types with GPU implementations that rely on Cusp [16] and Thrust [11]. While PETSc does not use cuBLAS =-=[28]-=-, we use it as a baseline for comparison with the different vector operation implementations because it is the best-performing among the available library options. 0.00s0.50s1.00s1.50s2.00s2.50s3.00s3... |

3 |
Automatic high-performance GPU code generation using CUDA-CHiLL
- Khan, Chame, et al.
- 2011
(Show Context)
Citation Context ...autotuning systems are also beginning to target hybrid architectures. For example, the combination of the CHiLL and ActiveHarmony tools can process C code and empirically tune the generated CUDA code =-=[22, 30]-=-. The goals of this approach are similar to ours. Because the existing CPU code itself is used as input in CHiLL, the complexity of the CPU implementation may prevent full optimization of CUDA code. U... |

3 | Autotuning stencil-based computations on GPUs
- Mametjanov, Lowell, et al.
(Show Context)
Citation Context ...cally extensible library is typically much more difficult to optimize through traditional compiler approaches. ∗This work builds on and significantly extends previous work by the authors described in =-=[18,25]-=-. 1 2 Choudary, Godwin, Holewinski, Karthik, Lowell, Mametjanov, Norris, Sabin, Sadayappan Our goal is to tackle the challenges in achieving the best possible performance on hybrid CPU/ GPGPU architec... |

3 |
CUDA-CHiLL: A Programming Language Interface for GPGPU Optimizations and Code Generation
- Rudy
- 2010
(Show Context)
Citation Context ...totuning the kernels that typically dominate the runtime of large-scale applications involving nonlinear PDE solutions using finite-difference or finite-volume approximations. Many other works (e.g., =-=[13, 30, 34]-=-) explore this structure, but address only a portion of the relevant kernels and typically rely on general sparse matrix formats such as those described in Section 2.1. Stencil-Aware GPU Optimization ... |

2 |
Last accessed
- govhypre
- 2012
(Show Context)
Citation Context ...ive solvers, autotuning, GPGPU, PETSc AMS subject classifications. 65Y10, 65F50, 15A06, 68N19 1. Introduction. Many scientific applications rely on high-performance numerical libraries, such as Hypre =-=[21]-=-, PETSc [6–8], SuperLU [24], and Trilinos [32], for providing accurate and fast solutions to problems modeled by using nonlinear partial differential equations (PDEs). Thus, the bulk of the burden in ... |

2 |
Vasileios Karakasis, Georgios Goumas. Performance models for blocked sparse matrixvector multiplication kernels
- K
- 2009
(Show Context)
Citation Context ...egister blocking, reduced indirection compared to CSR, and reduced storage space. The optimal block size is matrix dependent and machine dependent and is usually obtained by using a performance model =-=[33]-=-. 2.1.3. Diagonal Format. DIA is specifically suitable for storing sparse matrices that contain nonzero elements only along the matrix diagonals. In this format, the diagonals with nonzero elements ar... |

1 |
hicuda: High-level gpgpu programming,. ieee transactions on parallel and distributed systems
- Han, Abdelrahman
(Show Context)
Citation Context ... resulting performance gains at that time were limited by the GTX280 double-precision features and required CPU-GPU transfers. Recent directive based GPU programming models (e.g., hiCuda and OpenACC) =-=[1, 19, 23]-=- provide a higher level interface to the programmers for expressing GPU code fragments and automatically generating the GPU-specific code. However, these models support limited features and can only a... |

1 |
Moving heterogeneous gpu computing into the mainstream with directive-based, high-level programming models
- Lee, Vetter
- 2012
(Show Context)
Citation Context ... resulting performance gains at that time were limited by the GTX280 double-precision features and required CPU-GPU transfers. Recent directive based GPU programming models (e.g., hiCuda and OpenACC) =-=[1, 19, 23]-=- provide a higher level interface to the programmers for expressing GPU code fragments and automatically generating the GPU-specific code. However, these models support limited features and can only a... |