#### DMCA

## Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

### Citations

861 |
A set of level 3 basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...iderable performance gains to be attained with minimal effort. We do so by focusing on the familiar and important matrix-matrix operations that are part of the Basic Linear Algebra Subprograms (BLAS) =-=[4]-=- and targeting the NVIDIA family of GPUs. The arrival of NVIDIA’s GPUs and IBM’s Cell Broadband Engine and the recognition that they can be used for computation outside of the field of graphics has cr... |

103 | C.: GEMM-Based Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark
- K˚agström, Ling, et al.
- 1998
(Show Context)
Citation Context ... processed. matrix-matrix multiplication (symm), It is well-known that for each operation there are algorithms that cast most computation in terms of matrix-matrix multiplication, as was pioneered in =-=[5]-=-. Moreover, as part of the FLAME project we have long advocated that it is important to have multiple algorithmic variants at our disposal so that the best algorithm can be chosen for each situation [... |

74 |
de Geijn, The science of deriving dense linear algebra algorithms
- Bientinesi, Gunnels, et al.
(Show Context)
Citation Context ...mportant to have multiple algorithmic variants at our disposal so that the best algorithm can be chosen for each situation [3]. The FLAME methodology advocates systematic derivation of these variants =-=[2, 9]-=-. In Section 4 we will show that this is again the case for GPUs. We view our ability to rapidly develop different algorithms as a way of performing software acceleration, the natural (and much needed... |

31 |
QR and Cholesky factorizations using vector capabilities of GPUs
- LU
- 2008
(Show Context)
Citation Context ...ntific computing this has meant that considerable effort has been expended on implementing the most important kernel: matrix-matrix multiplication (gemm). Very admirable performance has been achieved =-=[11]-=-. Yet, even operations that are very similar to gemm, e.g., the other level-3 BLAS, did not achieve decent performance in the CUBLAS library for the NVIDIA GPUs when we started this study. Worse, the ... |

30 |
de Geijn. Families of algorithms related to the inversion of a symmetric positive definite matrix
- Bientinesi, Gunter, et al.
(Show Context)
Citation Context ...]. Moreover, as part of the FLAME project we have long advocated that it is important to have multiple algorithmic variants at our disposal so that the best algorithm can be chosen for each situation =-=[3]-=-. The FLAME methodology advocates systematic derivation of these variants [2, 9]. In Section 4 we will show that this is again the case for GPUs. We view our ability to rapidly develop different algor... |

17 |
de Geijn and Enrique S. Quintana-Ort́ı. The Science of Programming Matrix Computations
- van
- 2008
(Show Context)
Citation Context ...mportant to have multiple algorithmic variants at our disposal so that the best algorithm can be chosen for each situation [3]. The FLAME methodology advocates systematic derivation of these variants =-=[2, 9]-=-. In Section 4 we will show that this is again the case for GPUs. We view our ability to rapidly develop different algorithms as a way of performing software acceleration, the natural (and much needed... |

16 |
libflame: The Complete Reference. www.lulu.com
- Zee
- 2009
(Show Context)
Citation Context ...n. Thus the quote from Einstein. We are working on an tool, FLAMES2S [10], that can automatically translate algorithms represented in code with the FLAME/C API, used to implement our libflame library =-=[12]-=-, to low-level code that uses loops and indexing. This tool could easily generate the code that was created manually for the experiments in this paper. With that, we will make further progress towards... |

9 |
de Geijn. Solving dense linear algebra problems on platforms with multiple hardware accelerators
- Quintana-Ort́ı, Igual, et al.
- 2009
(Show Context)
Citation Context ...cusing our efforts on using the gemm implementation for high-level operations like Cholesky factorization by using the accelarators only to compute subproblems that were matrix-matrix multiplications =-=[8, 7, 6]-=-. We all hoped that soon other functionality will be ported to the GPUs, but that some other poor soul would do it for us. In this paper we once again show that as new functionality and optimizations ... |

4 |
de Geijn., R.: Solving Large Dense Matrix Problems on Multi-Core Processors and GPUs. FLAME Working Note #36
- Marqués, Quintana-Ort́ı, et al.
(Show Context)
Citation Context ...cusing our efforts on using the gemm implementation for high-level operations like Cholesky factorization by using the accelarators only to compute subproblems that were matrix-matrix multiplications =-=[8, 7, 6]-=-. We all hoped that soon other functionality will be ported to the GPUs, but that some other poor soul would do it for us. In this paper we once again show that as new functionality and optimizations ... |

2 |
de Geijn. Using graphics processors to accelerate the solution of out-of-core linear systems
- Marqués, Quintana-Ort́ı, et al.
- 2009
(Show Context)
Citation Context ...cusing our efforts on using the gemm implementation for high-level operations like Cholesky factorization by using the accelarators only to compute subproblems that were matrix-matrix multiplications =-=[8, 7, 6]-=-. We all hoped that soon other functionality will be ported to the GPUs, but that some other poor soul would do it for us. In this paper we once again show that as new functionality and optimizations ... |