#### DMCA

## Exposing fine-grained parallelism in algebraic multigrid methods (2012)

Citations: | 17 - 0 self |

### Citations

639 |
Basic linear algebra subprograms for Fortran usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ...ution} 6 xk ← postsmooth(Ak, xk, bk, µ2) {smooth µ2 times on Akxk = bk} reduction, parallel prefix-sum (or scan), and sorting. In short, these primitives are to general-purpose computations what BLAS =-=[26]-=- is to computations in linear algebra. Given the broad scope of their usage, special emphasis has been placed on the performance of primitives and very highly-optimized implementations are readily ava... |

621 | Multilevel k-way partitioning scheme for irregular graphs
- Karypis, Kumar
- 1995
(Show Context)
Citation Context ...lel is challenging, but several methods exist. With k = 1, our parallel version in Algorithm 5 can be considered a variant of Luby’s method [27] which has been employed in many codes such as ParMETIS =-=[24]-=-. A common characteristic of such schemes is the use of randomization to select independent set nodes in parallel. As with the serial method, all nodes are initially labeled (with a 0) as a candidate ... |

608 |
and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters
- Dean
(Show Context)
Citation Context ... fundamental component of common algorithms such as stream compaction. In Thrust, the inclusive scan algorithm computes the “inclusive” variant of the scan primitive, inclusive scan([3, 4, 1, 5, 2])→ =-=[3, 7, 8, 13, 15]-=-, while the exclusive scan algorithm computes the “exclusive” variant, exclusive scan([3, 4, 1, 5, 2], 10)→ [10, 13, 17, 18, 23], which incorporates a user-specified starting value and excludes the fi... |

448 | A Simple Parallel Algorithm for the Maximal Independent Set Problem
- Luby
- 1985
(Show Context)
Citation Context ...st of MIS nodes} Computing maximal independent sets in parallel is challenging, but several methods exist. With k = 1, our parallel version in Algorithm 5 can be considered a variant of Luby’s method =-=[27]-=- which has been employed in many codes such as ParMETIS [24]. A common characteristic of such schemes is the use of randomization to select independent set nodes in parallel. As with the serial method... |

300 | Vector Models for Data-Parallel Computing - Blelloch - 1990 |

286 | Sparse matrix solvers on the gpu: Conjugate gradients and multigrid
- Bolz, Farmer, et al.
(Show Context)
Citation Context ... partitioning and expose parallelism to the finest granularity — i.e., one thread per matrix row or one thread per nonzero entry. Geometric multigrid methods were the first to be parallelized on GPUs =-=[19, 9, 34]-=-. These “GPGPU” approaches, which preceded the introduction of the CUDA and OpenCL programming interfaces, programmed the GPU through existing graphics application programming interfaces (APIs) such a... |

151 | Optimization of sparse matrix-vector multiplication on emerging multicore platforms
- Williams, Oliker, et al.
- 2007
(Show Context)
Citation Context ...s and memory access patterns, is more challenging to implement than the aforementioned vector operations. Nevertheless efficient techniques exist for matrices with a wide variety of sparsity patterns =-=[11, 9, 38, 15, 39, 4, 5]-=-. Our implementations of sparse matrix-vector multiplication are described in [5, 7]. In Algorithm 3 sparse matrix-vector multiplication is used to compute the residual, to restrict the fine-level res... |

144 | Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems
- Vaněk, Mandel, et al.
- 1996
(Show Context)
Citation Context ...AMG) attempt to automatically construct a hierarchy of grids and intergrid transfer operators without explicit knowledge of the underlying problem — i.e., directly from the linear system of equations =-=[32, 37]-=-. In the remainder of this section, we outline the basic components of AMG in an aggregation context [37] and highlight the necessary sparse matrix computations used in the process. We restrict our at... |

135 | Implementing sparse matrix-vector multiplication on throughput-oriented processors
- Bell, Garland
- 2009
(Show Context)
Citation Context ... fundamental component of common algorithms such as stream compaction. In Thrust, the inclusive scan algorithm computes the “inclusive” variant of the scan primitive, inclusive scan([3, 4, 1, 5, 2])→ =-=[3, 7, 8, 13, 15]-=-, while the exclusive scan algorithm computes the “exclusive” variant, exclusive scan([3, 4, 1, 5, 2], 10)→ [10, 13, 17, 18, 23], which incorporates a user-specified starting value and excludes the fi... |

120 | Boomeramg: a parallel algebraic multigrid solver and preconditioner, Applied Numerical Mathematics 41
- Henson, Yang
- 2000
(Show Context)
Citation Context ...Urbana-Champaign, Urbana, IL 61801, lukeo@illinois.edu, http://www.cs.illinois.edu/homes/lukeo 1 2 Bell, Dalton, Olson by a less-effective but parallel heuristic to the interfaces between sub-domains =-=[22]-=-. An implicit assumption in this strategy is that the interiors of the partitions (collectively) contain the vast majority of the entire domain, otherwise the serial heuristic has little impact on the... |

117 | A multigrid solver for boundary value problems using programmable graphics hardware
- Goodnight, Woolley, et al.
- 2003
(Show Context)
Citation Context ... partitioning and expose parallelism to the finest granularity — i.e., one thread per matrix row or one thread per nonzero entry. Geometric multigrid methods were the first to be parallelized on GPUs =-=[19, 9, 34]-=-. These “GPGPU” approaches, which preceded the introduction of the CUDA and OpenCL programming interfaces, programmed the GPU through existing graphics application programming interfaces (APIs) such a... |

109 | Effïcient sparse matrix-vector multiplication on CUDA, NVIDIA
- Bell, Garland
- 2008
(Show Context)
Citation Context ...mponent is that of simplifying an array to a single value, or a reduction. In Thurst, the reduce algorithm reduces a range of numbers to a single value by successively summing values together: reduce(=-=[3, 4, 1, 5, 2]-=-)→ 15. The same algorithm can be used to determine the maximum entry, by specifying maximum for the reduction operator: reduce([3, 4, 1, 5, 2], maximum)→ 5. In general, any function that is both commu... |

54 |
Algebraic multigrid, in Multigrid methods
- Ruge, Stüben
- 1987
(Show Context)
Citation Context ...AMG) attempt to automatically construct a hierarchy of grids and intergrid transfer operators without explicit knowledge of the underlying problem — i.e., directly from the linear system of equations =-=[32, 37]-=-. In the remainder of this section, we outline the basic components of AMG in an aggregation context [37] and highlight the necessary sparse matrix computations used in the process. We restrict our at... |

50 | Robustness and scalability of algebraic multigrid
- Cleary, Falgout, et al.
- 1998
(Show Context)
Citation Context ...omposed into scalable parallel primitives. Parallel approaches to multigrid are plentiful. Algebraic multigrid methods have been successfully parallelized on distributed-memory CPU clusters using MPI =-=[12, 10]-=- and more recently with a combination of MPI and OpenMP [2], to better utilize multi-core CPU nodes. While such techniques have demonstrated scalability to large numbers of processors, they are not im... |

48 |
ML 5.0 smoothed aggregation user’s guide
- Gee, Siefert, et al.
- 2006
(Show Context)
Citation Context ...inclusive” variant of the scan primitive, inclusive scan([3, 4, 1, 5, 2])→ [3, 7, 8, 13, 15], while the exclusive scan algorithm computes the “exclusive” variant, exclusive scan([3, 4, 1, 5, 2], 10)→ =-=[10, 13, 17, 18, 23]-=-, which incorporates a user-specified starting value and excludes the final sum. As with reduction, the scan algorithms accept other binary operations such as maximum, inclusive scan([3, 4, 1, 5, 2], ... |

46 | Parallel multigrid smoothing: polynomial versus Gauss-Seidel
- Adams, Brezina, et al.
(Show Context)
Citation Context ... multigrid on GPUs [18, 21], however hierarchy construction remained on the CPU. A parallel aggregation scheme is described in [35] that is similar to ours based on maximal independent sets, while in =-=[1]-=- the effectiveness of parallel smoothers based on sparse matrix-vector products is demonstrated. Although these works were implemented for distributed CPU clusters, they are amenable to fine-grained p... |

45 |
Thrust: A parallel template library, 2010. Version 1.3.0
- Hoberock, Bell
(Show Context)
Citation Context ...of our solver, and hence the underlying parallel primitives, is demonstrated in Section 5. Our AMG solver is implemented almost exclusively with the parallel primitives provided by the Thrust library =-=[23]-=-. In the remaining part of this section we identify a few of the most important Thrust algorithms and illustrate their usage. For ease of exposition we omit some of the precise usage details, however ... |

41 |
Two fast algorithms for sparse matrices: Multiplication and permuted transposition
- Gustavson
- 1978
(Show Context)
Citation Context ...duct is of the form [n×n] ∗ [n×nc] (or the transpose), while the second product is of the form [nc × n] ∗ [n× nc]. Efficient sequential sparse matrix-matrix multiplication algorithms are described in =-=[20, 3]-=-. In these methods the Compressed Sparse Row (CSR) format is used, which provides O(1) indexing of the matrix rows. As a result, the restriction matrix Rk = P T k is formed explicitly in CSR format be... |

41 | Streaming multigrid for gradient-domain operations on large images
- KAZHDAN, HOPPE
- 2008
(Show Context)
Citation Context ...ammed the GPU through existing graphics application programming interfaces (APIs) such as OpenGL and Direct3d. Subsequent works demonstrated GPU-accelerated geometric multigrid for image manipulation =-=[25]-=- and CFD [13] problems. Previous works have implemented the cycling stage of algebraic multigrid on GPUs [18, 21], however hierarchy construction remained on the CPU. A parallel aggregation scheme is ... |

36 |
Understanding throughput-oriented architectures
- Garland, Kirk
- 2010
(Show Context)
Citation Context ...n contrast to traditional CPU architectures, which are optimized for completing scalar tasks with minimal latency, modern GPUs are tailored for parallel workloads that emphasize total task throughput =-=[16]-=-. Therefore, harnessing the computational resources of the such processors requires programmers to decompose algorithms into thousands or tens of thousands of separate, fine-grained threads of executi... |

32 | Parallel smoothed aggregation multigrid: Aggregation strategies on massively parallel machines, Report, Sandia National Laboratories
- Tuminaro, Tong
- 2000
(Show Context)
Citation Context ... problems. Previous works have implemented the cycling stage of algebraic multigrid on GPUs [18, 21], however hierarchy construction remained on the CPU. A parallel aggregation scheme is described in =-=[35]-=- that is similar to ours based on maximal independent sets, while in [1] the effectiveness of parallel smoothers based on sparse matrix-vector products is demonstrated. Although these works were imple... |

31 | Using GPUs to improve multigrid solver performance on a cluster
- Göddeke, Strzodka, et al.
- 2008
(Show Context)
Citation Context .... Subsequent works demonstrated GPU-accelerated geometric multigrid for image manipulation [25] and CFD [13] problems. Previous works have implemented the cycling stage of algebraic multigrid on GPUs =-=[18, 21]-=-, however hierarchy construction remained on the CPU. A parallel aggregation scheme is described in [35] that is similar to ours based on maximal independent sets, while in [1] the effectiveness of pa... |

30 | A survey of parallelization techniques for multigrid solvers
- Chow, Falgout, et al.
- 2006
(Show Context)
Citation Context ...omposed into scalable parallel primitives. Parallel approaches to multigrid are plentiful. Algebraic multigrid methods have been successfully parallelized on distributed-memory CPU clusters using MPI =-=[12, 10]-=- and more recently with a combination of MPI and OpenMP [2], to better utilize multi-core CPU nodes. While such techniques have demonstrated scalability to large numbers of processors, they are not im... |

27 | Parallel white noise generation on a GPU via cryptographic hash
- Tzeng, Wei
- 2008
(Show Context)
Citation Context ...integer hash function. Although not a source of high quality random numbers, the resulting values are adequate for our purpose. More sophisticated hash-based random number generators are discussed in =-=[36, 40]-=-. 28 Bell, Dalton, Olson 9 21 35 22 26 27 18 25 33 8 0 23 3 14 13 17 10 6 19 34 7 4 11 29 2 28 12 31 16 1 30 5 20 15 32 24 (a) 9 21 35 22 26 27 18 25 33 8 0 23 3 14 13 17 10 6 19 34 7 4 11 29 2 28 12 ... |

25 | Revisiting sorting for GPGPU stream architectures
- Merrill, Grimshaw
- 2010
(Show Context)
Citation Context ... in linear algebra. Given the broad scope of their usage, special emphasis has been placed on the performance of primitives and very highly-optimized implementations are readily available for the GPU =-=[33, 28, 29]-=-. The efficiency of our solver, and hence the underlying parallel primitives, is demonstrated in Section 5. Our AMG solver is implemented almost exclusively with the parallel primitives provided by th... |

24 | Fast sparse matrix-vector multiplication by exploiting variable block structure
- Vuduc, Moon
- 2005
(Show Context)
Citation Context ...s and memory access patterns, is more challenging to implement than the aforementioned vector operations. Nevertheless efficient techniques exist for matrices with a wide variety of sparsity patterns =-=[11, 9, 38, 16, 39, 4, 5]-=-. Our implementations of sparse matrixvector multiplication are described in [5, 7]. In Algorithm 3 sparse matrix-vector Fine-Grained Parallelism in AMG 19 multiplication is used to compute the residu... |

17 |
Sparse matrix multiplication package
- BANK, DOUGLAS
- 1993
(Show Context)
Citation Context ...ve whenever disordered data must be binned or easily indexed. This is helpful in many of our transformations in the AMG setup phase. By default, the sort algorithm sorts data in ascending order, sort(=-=[3, 4, 1, 5, 2]-=-)→ [1, 2, 3, 4, 5], which is equivalent to specifying that elements should be compared using the standard less comparison functor. Thrust also provides the sort by key algorithm for sorting (logical) ... |

17 |
Vectorized sparse matrix multiply for compressed row storage format
- D’Azevedo, Fahey, et al.
- 2005
(Show Context)
Citation Context ... fundamental component of common algorithms such as stream compaction. In Thrust, the inclusive scan algorithm computes the “inclusive” variant of the scan primitive, inclusive scan([3, 4, 1, 5, 2])→ =-=[3, 7, 8, 13, 15]-=-, while the exclusive scan algorithm computes the “exclusive” variant, exclusive scan([3, 4, 1, 5, 2], 10)→ [10, 13, 17, 18, 23], which incorporates a user-specified starting value and excludes the fi... |

17 |
Parallel scan for stream architectures
- Merrill, Grimshaw
- 2009
(Show Context)
Citation Context ... in linear algebra. Given the broad scope of their usage, special emphasis has been placed on the performance of primitives and very highly-optimized implementations are readily available for the GPU =-=[33, 28, 29]-=-. The efficiency of our solver, and hence the underlying parallel primitives, is demonstrated in Section 5. Our AMG solver is implemented almost exclusively with the parallel primitives provided by th... |

11 | GPU random numbers via the tiny encryption algorithm
- Zafar, Olano, et al.
(Show Context)
Citation Context ...t of adjacent keys that are equivalent, it reduces the corresponding values together and writes the key and the reduced value to separate output arrays. For example, reduce by key([0, 0, 1, 1, 1, 2], =-=[10, 20, 30, 40, 50, 60]-=-)→ [0, 1, 2], [30, 120, 60]. Note that the key and value sequences are stored in separate arrays. This “structure of arrays” representation is generally more computationally efficient than the alterna... |

8 |
Challenges of scaling algebraic multigrid across modern multicore architectures
- Baker, Gamblin, et al.
(Show Context)
Citation Context ... the plus functor, transform([3, 4, 1], [4, 5, 7], plus)→ [7, 9, 8], implements vector addition. 2.4. Gathering and Scattering. Related to transformation are the gather and scatter algorithms, gather(=-=[3, 0, 2]-=-, [11, 12, 13, 14])→ [14, 11, 13], scatter([3, 0, 2], [11, 12, 13], [∗, ∗, ∗, ∗])→ [12, ∗, 13, 11], which copy values based on an index map ([3, 0, 2] in the examples). Here, the placeholder ∗ represe... |

8 | General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform
- Christen, Schenk, et al.
- 2007
(Show Context)
Citation Context ...s and memory access patterns, is more challenging to implement than the aforementioned vector operations. Nevertheless efficient techniques exist for matrices with a wide variety of sparsity patterns =-=[11, 9, 38, 16, 39, 4, 5]-=-. Our implementations of sparse matrixvector multiplication are described in [5, 7]. In Algorithm 3 sparse matrix-vector Fine-Grained Parallelism in AMG 19 multiplication is used to compute the residu... |

6 |
A Fast Double Precision CFD Code using
- 16Cohen, Molemaker
- 2009
(Show Context)
Citation Context ... through existing graphics application programming interfaces (APIs) such as OpenGL and Direct3d. Subsequent works demonstrated GPU-accelerated geometric multigrid for image manipulation [25] and CFD =-=[13]-=- problems. Previous works have implemented the cycling stage of algebraic multigrid on GPUs [18, 21], however hierarchy construction remained on the CPU. A parallel aggregation scheme is described in ... |

3 |
A new perspective on strength measures in algebraic multigrid. Numerical Linear Algebra with
- Olson, Schroder, et al.
(Show Context)
Citation Context ...osing fine-grained parallelism in a well-known algebraic multigrid method, we remark that more robust strength-of-connection schemes should be employed to improve convergence for anisotropic problems =-=[31]-=-. In contrast with the results reported in Table 5.10, where two independent hierarchies generated on the CPU and GPU are used in the cycling phase of the solver, in Figure 5.11 the same multigrid hie... |

2 |
Parallel algebraic multigrid on general purpose gpus
- Haase, Liebmann, et al.
(Show Context)
Citation Context .... Subsequent works demonstrated GPU-accelerated geometric multigrid for image manipulation [25] and CFD [13] problems. Previous works have implemented the cycling stage of algebraic multigrid on GPUs =-=[18, 21]-=-, however hierarchy construction remained on the CPU. A parallel aggregation scheme is described in [35] that is similar to ours based on maximal independent sets, while in [1] the effectiveness of pa... |

2 |
Scan primitives for
- Sengupta, Harris, et al.
- 2007
(Show Context)
Citation Context ...ess structure. In Section 5 we examine the cost of the solve phase in more detail. 2. Parallel Primitives. Our method for exposing fine-grained parallelism in AMG leverages (data) parallel primitives =-=[8, 33]-=-. We use the term primitives to refer to a collection of fundamental algorithms that emerge in numerous contexts such as Fine-Grained Parallelism in AMG 7 Algorithm 3: AMG Solve: solve parameters: Ak,... |

2 |
How to optimize geometric multigrid methods on gpus
- Stürmer, Köstler, et al.
- 2011
(Show Context)
Citation Context ... partitioning and expose parallelism to the finest granularity — i.e., one thread per matrix row or one thread per nonzero entry. Geometric multigrid methods were the first to be parallelized on GPUs =-=[19, 9, 34]-=-. These “GPGPU” approaches, which preceded the introduction of the CUDA and OpenCL programming interfaces, programmed the GPU through existing graphics application programming interfaces (APIs) such a... |

2 |
Vuduc and Hyun-Jin Moon, Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure
- Richard
(Show Context)
Citation Context ...s and memory access patterns, is more challenging to implement than the aforementioned vector operations. Nevertheless efficient techniques exist for matrices with a wide variety of sparsity patterns =-=[11, 9, 38, 15, 39, 4, 5]-=-. Our implementations of sparse matrix-vector multiplication are described in [5, 7]. In Algorithm 3 sparse matrix-vector multiplication is used to compute the residual, to restrict the fine-level res... |

1 |
Generic parallel algorithms for sparse matrix and graph computations. http: //code.google.com/p/cusp-library
- CUSP
- 2009
(Show Context)
Citation Context ...mary contributions are parallel algorithms for aggregation and sparse matrix-matrix multiplication. The complete source code for the method presented here is available in the open-source Cusp library =-=[6]-=-. The methods described in this section are designed for the coordinate (COO) sparse matrix format. The COO format is comprised of three arrays I, J, and V, which store the row indices, column indices... |

1 |
Helmar Burkhar, General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform
- Christen, Schenk
- 2007
(Show Context)
Citation Context ...unctor, transform([3, 4, 1], [4, 5, 7], plus)→ [7, 9, 8], implements vector addition. 2.4. Gathering and Scattering. Related to transformation are the gather and scatter algorithms, gather([3, 0, 2], =-=[11, 12, 13, 14]-=-)→ [14, 11, 13], scatter([3, 0, 2], [11, 12, 13], [∗, ∗, ∗, ∗])→ [12, ∗, 13, 11], which copy values based on an index map ([3, 0, 2] in the examples). Here, the placeholder ∗ represents elements of th... |

1 |
Vectorized sparse matrix multiply for com34
- DAzevedo, Fahey, et al.
- 2005
(Show Context)
Citation Context ...s and memory access patterns, is more challenging to implement than the aforementioned vector operations. Nevertheless efficient techniques exist for matrices with a wide variety of sparsity patterns =-=[11, 9, 38, 16, 39, 4, 5]-=-. Our implementations of sparse matrixvector multiplication are described in [5, 7]. In Algorithm 3 sparse matrix-vector Fine-Grained Parallelism in AMG 19 multiplication is used to compute the residu... |