Results 1 - 10
of
16
Kernel Weaver: Automatically Fusing Database
- Primitives for Efficient GPU Computation.” MICRO
, 2012
"... Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the microbenchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements. 1.
Absorption Reconstruction Improves Biodistribution Assessment of Fluorescent Nanoprobes Using Hybrid Fluorescence-mediated Tomography. Theranostics
, 2014
"... licenses/by-nc-nd/3.0/). Reproduction is permitted for personal, noncommercial use, provided that the article is in whole, unmodified, and properly cited. Received: 2014.04.03; Accepted: 2014.05.27; Published: 2014.07.26 Aim: Fluorescence-mediated tomography (FMT) holds potential for accelerating di ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
licenses/by-nc-nd/3.0/). Reproduction is permitted for personal, noncommercial use, provided that the article is in whole, unmodified, and properly cited. Received: 2014.04.03; Accepted: 2014.05.27; Published: 2014.07.26 Aim: Fluorescence-mediated tomography (FMT) holds potential for accelerating diagnostic and theranostic drug development. However, for proper quantitative fluorescence reconstruction, knowledge on optical scattering and absorption, which are highly heterogeneous in different (mouse) tissues, is required. We here describe methods to assess these parameters using co-registered micro Computed Tomography (µCT) data and nonlinear whole-animal absorption reconstruction, and evaluate their importance for assessment of the biodistribution and target site accumulation of fluorophore-labeled drug delivery systems.
PARALLEL UNSMOOTHED AGGREGATION ALGEBRAIC MULTIGRID ALGORITHMS ON GPUS
"... Abstract. We design and implement a parallel algebraic multigrid method for isotropic graph Laplacian problems on multicore Graphical Processing Units (GPUs). The proposed AMG method is based on the aggregation framework. The setup phase of the algorithm uses a parallel maximal independent set algor ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We design and implement a parallel algebraic multigrid method for isotropic graph Laplacian problems on multicore Graphical Processing Units (GPUs). The proposed AMG method is based on the aggregation framework. The setup phase of the algorithm uses a parallel maximal independent set algorithm in forming aggregates and the resulting coarse level hierarchy is then used in a K-cycle iteration solve phase with a `1-Jacobi smoother. Numerical tests of a parallel implementation of the method for graphics processors are presented to demonstrate its effectiveness. 1.
ACCELERATING PRECONDITIONED ITERATIVE LINEAR SOLVERS ON GPU
"... Abstract. Linear systems are required to solve in many scientific applications and the solution of these systems often dominates the total running time. In this paper, we introduce our work on developing parallel linear solvers and preconditioners for solving large sparse linear systems using NVIDIA ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Linear systems are required to solve in many scientific applications and the solution of these systems often dominates the total running time. In this paper, we introduce our work on developing parallel linear solvers and preconditioners for solving large sparse linear systems using NVIDIA GPUs. We develop a new sparse matrix-vector multiplication kernel and a sparse BLAS library for GPUs. Based on the BLAS library, several Krylov subspace linear solvers, and algebraic multi-grid (AMG) solvers and commonly used preconditioners are developed, including GMRES, CG, BICGSTAB, ORTHOMIN, classical AMG solver, polynomial preconditioner, ILU(k) and ILUT preconditioner, and domain decomposition preconditioner. Numerical experiments show that these linear solvers and preconditioners are efficient for solving the large linear systems. Key words. Krylov subspace solver, algebraic multi-grid solver, parallel preconditioner, GPU computing, sparse matrix-vector multiplication, HEC
EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION
"... Abstract. Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2. ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdős-Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. Key words. Parallel computing, numerical linear algebra, sparse matrix-matrix multiplication, 2.5D algorithms, 3D algorithms, multithreading, SpGEMM, 2D decomposition, graph algorithms.
Parallelization Strategies of the Canny Edge Detector for Multi-core CPUs and Many-core GPUs
"... Abstract—In this paper we study two parallelization strategies (loop-level parallelism and domain decomposition), and we inves-tigate their impact in terms of performance and scalability on two different parallel architectures. As a test application, we use the Canny Edge Detector due to its wide ra ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—In this paper we study two parallelization strategies (loop-level parallelism and domain decomposition), and we inves-tigate their impact in terms of performance and scalability on two different parallel architectures. As a test application, we use the Canny Edge Detector due to its wide range of parallelization op-portunities, and its frequent use in computer vision applications. Different parallel implementations of the Canny Edge Detector are run on two distinct hardware platforms, namely a multi-core CPU, and a many-core GPU. Our experiments uncover design rules that, depending on a set of applications and platform factors (parallel features, data size, and architecture), indicate which parallelization scheme is more suitable. I.
Cray XE6, Intel R ○ Xeon R ○ E5-2670 and X5550 processorbased
"... Abstract—Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteronbased
project. Communication-Avoiding Optimization of Geometric Multigrid on GPUs
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
- Add to MetaCart
(Show Context)
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement I would like to thank Samuel Williams, Research Scientist at Lawrence Berkeley National Laboratory (LBNL) for guiding me throughout the project and giving valuable feedbacks. Also, I would like to thank my research adviser Professor James Demmel for his suggestions and feedback for my
Unstructured Forests of Octrees
"... Abstract—We present a parallel multigrid method for solving variable-coefficient elliptic partial differential equations on arbi-trary geometries using highly adapted meshes. Our method is designed for meshes that are built from an unstructured hexa-hedral macro mesh, in which each macro element is ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—We present a parallel multigrid method for solving variable-coefficient elliptic partial differential equations on arbi-trary geometries using highly adapted meshes. Our method is designed for meshes that are built from an unstructured hexa-hedral macro mesh, in which each macro element is adaptively refined as an octree. This forest-of-octrees approach enables us to generate meshes for complex geometries with arbitrary levels of local refinement. We use geometric multigrid (GMG) for each of the octrees and algebraic multigrid (AMG) as the coarse grid solver. We designed our GMG sweeps to entirely avoid collectives, thus minimizing communication cost. We present weak and strong scaling results for the 3D variable-coefficient Poisson problem that demonstrate high parallel scal-ability. As a highlight, the largest problem we solve is on a non-uniform mesh with 100 billion unknowns on 262,144 cores of NCCS’s Cray XK6 “Jaguar”; in this solve we sustain 272 TFlops/s. I.