• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Exposing fine-grained parallelism in algebraic multigrid methods (2012)

by Nathan Bell, Steven Dalton, Luke N. Olson
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Kernel Weaver: Automatically Fusing Database

by Haicheng Wu, Gregory Diamos, Srihari Cadambi, Sudhakar Yalamanchili - Primitives for Efficient GPU Computation.” MICRO , 2012
"... Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the microbenchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements. 1.
(Show Context)

Citation Context

...tation Kernel fusion is based on the multi-stage formulation of algorithms for the RA operators. Multi-stage algorithms are common to sorting [31], pattern matching [39], algebraic multi-grid solvers =-=[5]-=-, or compression [17]. This formulation is popular for GPU algorithms in particular since it enables one to separate the structured components of the algorithm from the irregular or unstructured compo...

Absorption Reconstruction Improves Biodistribution Assessment of Fluorescent Nanoprobes Using Hybrid Fluorescence-mediated Tomography. Theranostics

by Felix Gremse, Benjamin Theek, Sijumon Kunjachan, Wiltrud Lederle, Alessa Pardo, Stefan Barth, Twan Lammers, Uwe Naumann, Fabian Kiessling , 2014
"... licenses/by-nc-nd/3.0/). Reproduction is permitted for personal, noncommercial use, provided that the article is in whole, unmodified, and properly cited. Received: 2014.04.03; Accepted: 2014.05.27; Published: 2014.07.26 Aim: Fluorescence-mediated tomography (FMT) holds potential for accelerating di ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
licenses/by-nc-nd/3.0/). Reproduction is permitted for personal, noncommercial use, provided that the article is in whole, unmodified, and properly cited. Received: 2014.04.03; Accepted: 2014.05.27; Published: 2014.07.26 Aim: Fluorescence-mediated tomography (FMT) holds potential for accelerating diagnostic and theranostic drug development. However, for proper quantitative fluorescence reconstruction, knowledge on optical scattering and absorption, which are highly heterogeneous in different (mouse) tissues, is required. We here describe methods to assess these parameters using co-registered micro Computed Tomography (µCT) data and nonlinear whole-animal absorption reconstruction, and evaluate their importance for assessment of the biodistribution and target site accumulation of fluorophore-labeled drug delivery systems.
(Show Context)

Citation Context

... using the nonlinear conjugate gradient method and the required gradient computations were performed using algorithmic differentiation [31,32] with GPU-acceleratedssparse vector and matrix operations =-=[33]-=-. The unknown scale factor between source power and pixelsintensities was calibrated using a phantom withsknown homogeneous scattering (8 cm-1) and negligible absorption (0.1 cm-1). Minimization of th...

PARALLEL UNSMOOTHED AGGREGATION ALGEBRAIC MULTIGRID ALGORITHMS ON GPUS

by James Brannick, Yao Chen, Xiaozhe Hu, Ludmil Zikatanov
"... Abstract. We design and implement a parallel algebraic multigrid method for isotropic graph Laplacian problems on multicore Graphical Processing Units (GPUs). The proposed AMG method is based on the aggregation framework. The setup phase of the algorithm uses a parallel maximal independent set algor ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. We design and implement a parallel algebraic multigrid method for isotropic graph Laplacian problems on multicore Graphical Processing Units (GPUs). The proposed AMG method is based on the aggregation framework. The setup phase of the algorithm uses a parallel maximal independent set algorithm in forming aggregates and the resulting coarse level hierarchy is then used in a K-cycle iteration solve phase with a `1-Jacobi smoother. Numerical tests of a parallel implementation of the method for graphics processors are presented to demonstrate its effectiveness. 1.
(Show Context)

Citation Context

...xtensive research has been devoted to improving the performance of parallel coarsening algorithms, leading to notable improvements on CPU architectures [9, 28, 27, 21, 11, 8, 22, 28], on a single GPU =-=[3, 26, 19]-=-, and on multiple GPUs [12], the setup phase is still considered a bottleneck in parallel AMG methods. We mention the work in [3], where a smoothed aggregation setup is developed in CUDA for GPUs. Key...

ACCELERATING PRECONDITIONED ITERATIVE LINEAR SOLVERS ON GPU

by Hui Liu, Zhangxin Chen, Bo Yang
"... Abstract. Linear systems are required to solve in many scientific applications and the solution of these systems often dominates the total running time. In this paper, we introduce our work on developing parallel linear solvers and preconditioners for solving large sparse linear systems using NVIDIA ..."
Abstract - Add to MetaCart
Abstract. Linear systems are required to solve in many scientific applications and the solution of these systems often dominates the total running time. In this paper, we introduce our work on developing parallel linear solvers and preconditioners for solving large sparse linear systems using NVIDIA GPUs. We develop a new sparse matrix-vector multiplication kernel and a sparse BLAS library for GPUs. Based on the BLAS library, several Krylov subspace linear solvers, and algebraic multi-grid (AMG) solvers and commonly used preconditioners are developed, including GMRES, CG, BICGSTAB, ORTHOMIN, classical AMG solver, polynomial preconditioner, ILU(k) and ILUT preconditioner, and domain decomposition preconditioner. Numerical experiments show that these linear solvers and preconditioners are efficient for solving the large linear systems. Key words. Krylov subspace solver, algebraic multi-grid solver, parallel preconditioner, GPU computing, sparse matrix-vector multiplication, HEC
(Show Context)

Citation Context

...eral Krylov solvers and ILU preconditioner were implemented on GPU. Haase et al. developed a parallel AMG solvers using a GPU cluster [12]. Researchers from NVIDIA also developed an AMG solver on GPU =-=[9, 13]-=-. The setup and solving phases were both run on GPU, which made their AMG very fast. Chen et al. from University of Calgary designed a new matrix format HEC (Hybrid of Ell and CSR), fast SpMV kernel [...

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIX-MATRIX MULTIPLICATION

by Ariful Azad, Grey Ballard, James Demmel
"... Abstract. Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2. ..."
Abstract - Add to MetaCart
Abstract. Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdős-Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. Key words. Parallel computing, numerical linear algebra, sparse matrix-matrix multiplication, 2.5D algorithms, 3D algorithms, multithreading, SpGEMM, 2D decomposition, graph algorithms.
(Show Context)

Citation Context

... is also used in scientific computing. For instance, it is often a performance bottleneck for Algebraic Multigrid (AMG), where it is used in the setup phase for restricting and interpolating matrices =-=[7]-=-. Schur complement methods in hybrid linear solvers [49] also require fast SpGEMM. In electronic structure calculations, linear-scaling methods exploit Kohn’s “nearsightedness” principle of electrons ...

Parallelization Strategies of the Canny Edge Detector for Multi-core CPUs and Many-core GPUs

by Taieb Lamine, Ben Cheikh, Giovanni Beltrame, Gabriela Nicolescu, Farida Cheriet
"... Abstract—In this paper we study two parallelization strategies (loop-level parallelism and domain decomposition), and we inves-tigate their impact in terms of performance and scalability on two different parallel architectures. As a test application, we use the Canny Edge Detector due to its wide ra ..."
Abstract - Add to MetaCart
Abstract—In this paper we study two parallelization strategies (loop-level parallelism and domain decomposition), and we inves-tigate their impact in terms of performance and scalability on two different parallel architectures. As a test application, we use the Canny Edge Detector due to its wide range of parallelization op-portunities, and its frequent use in computer vision applications. Different parallel implementations of the Canny Edge Detector are run on two distinct hardware platforms, namely a multi-core CPU, and a many-core GPU. Our experiments uncover design rules that, depending on a set of applications and platform factors (parallel features, data size, and architecture), indicate which parallelization scheme is more suitable. I.
(Show Context)

Citation Context

...cation has a lower execution time for Section 1, as shown in Fig. 6. This is due to the fact that the GPU architecture is more suitable for fine-grained data parallelism, as opposed to multicore CPUs =-=[12]-=-. Fig. 6 shows that, however, coarse-grained parallelism offers a lower execution time for Section 3. This is essentially due to the structure of the last section, which consists of an outer loop nest...

Cray XE6, Intel R ○ Xeon R ○ E5-2670 and X5550 processorbased

by Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, M. Deshp, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker
"... Abstract—Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron ..."
Abstract - Add to MetaCart
Abstract—Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteronbased
(Show Context)

Citation Context

...lacian [5]. This approach is orthogonal to our implemented optimizations and their technique could be incorporated in future work. Studies have explored the performance of algebraic multigrid on GPUs =-=[1]-=-, [2], while Sturmer et al. examined geometric multigrid [25]. Perhaps the most closely related work is that performed in Treibig’s, which implements a 2D GSRB on SIMD architectures by separating and ...

project. Communication-Avoiding Optimization of Geometric Multigrid on GPUs

by On Gpus, Amik Singh, James Demmel, Amik Singh
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract - Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement I would like to thank Samuel Williams, Research Scientist at Lawrence Berkeley National Laboratory (LBNL) for guiding me throughout the project and giving valuable feedbacks. Also, I would like to thank my research adviser Professor James Demmel for his suggestions and feedback for my
(Show Context)

Citation Context

...amount of algorithmic innovation. 2. Related Work We explore the Geometric Multigrid on GPU’s. Bell et al. have explored the performance of algebraic (sparse rather than structured) multigrid on GPUs =-=[9]-=-. Sellapa et al. explore constant coefficient elliptical partial differential equations on structured grids [11]. Perhaps the most closely related work is that performed in Treibig’s, which implements...

Unstructured Forests of Octrees

by Parallel Geometric-algebraic Multigrid On, Hari Sundar, George Biros, Carsten Burstedde, Johann Rudi, Omar Ghattas, Georg Stadler
"... Abstract—We present a parallel multigrid method for solving variable-coefficient elliptic partial differential equations on arbi-trary geometries using highly adapted meshes. Our method is designed for meshes that are built from an unstructured hexa-hedral macro mesh, in which each macro element is ..."
Abstract - Add to MetaCart
Abstract—We present a parallel multigrid method for solving variable-coefficient elliptic partial differential equations on arbi-trary geometries using highly adapted meshes. Our method is designed for meshes that are built from an unstructured hexa-hedral macro mesh, in which each macro element is adaptively refined as an octree. This forest-of-octrees approach enables us to generate meshes for complex geometries with arbitrary levels of local refinement. We use geometric multigrid (GMG) for each of the octrees and algebraic multigrid (AMG) as the coarse grid solver. We designed our GMG sweeps to entirely avoid collectives, thus minimizing communication cost. We present weak and strong scaling results for the 3D variable-coefficient Poisson problem that demonstrate high parallel scal-ability. As a highlight, the largest problem we solve is on a non-uniform mesh with 100 billion unknowns on 262,144 cores of NCCS’s Cray XK6 “Jaguar”; in this solve we sustain 272 TFlops/s. I.
(Show Context)

Citation Context

... parallel implementations of algebraic multigrid target finescale parallelism as required on clusters of GPUs, where the problem must be split into many independent threads to obtain good performance =-=[26]-=-, [27], [28]. Finally, we briefly contrast AMG and GMG approaches. The advantages of AMG are that it can be used as a blackbox algorithm and does not require geometry or mesh information (except for t...

PIPELINED ITERATIVE SOLVERS WITH KERNEL FUSION FOR GRAPHICS PROCESSING UNITS

by K. Rupp, J. Weinbub, T. Grasser
"... ar ..."
Abstract - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...le Paralution [33], VexCL [44], or ViennaCL [45]. A substantial amount of research has been conducted on various preconditioning techniques for iterative solvers on GPUs including algebraic multigrid =-=[6, 13, 34, 46]-=-, incomplete factorizations [23, 30], or sparse approximate inverses [10, 25, 40]. Nevertheless, hardware-efficient and scalable black-box preconditioners for GPUs are not available, but instead the u...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University