#### DMCA

## Tiled QR factorization algorithms (2011)

Citations: | 6 - 3 self |

### Citations

163 | A class of parallel tiled linear algebra algorithms for multicore architectures - Buttari, Langou, et al. - 2009 |

157 | Roofline: an insightful visual performance model for multicore architectures
- Williams, Waterman, et al.
- 2009
(Show Context)
Citation Context ...tained by modeling the limiting factor of the execution time as either the critical path, or the sequential time divided by the number of processors. This is similar in approach to the Roofline model =-=[19]-=-. Taking γseq as the sequential performance, T as the total number of flops, cp as the length of the critical path, and P as the number of processors, the predicted performance, γpred, is γpred = γseq... |

79 | Parallel tiled QR factorization for multicore architectures
- Buttari, Langou, et al.
(Show Context)
Citation Context ...is confined to an extent much less than the full column span, which enables concurrency with other reflections. Tiled QR factorization in the context of multicore architectures has been introduced in =-=[5, 6, 15]-=-. Initially the focus was on square matrices and the sequence of unitary transformations presented was analogous to Sameh-Kuck [16], which corresponds to reducing the panels with flat trees. The possi... |

50 |
On stable parallel linear system solvers
- Sameh, Kuck
- 1978
(Show Context)
Citation Context ... the context of multicore architectures has been introduced in [5, 6, 15]. Initially the focus was on square matrices and the sequence of unitary transformations presented was analogous to Sameh-Kuck =-=[16]-=-, which corresponds to reducing the panels with flat trees. The possibility of using any tree in order to either maximize parallelism or minimize communication is explained in [10]. The focus of this ... |

44 | Programming matrix algorithms-by-blocks for thread-level parallelism
- Quintana-Ort́ı, Quintana-Ort́ı, et al.
- 2009
(Show Context)
Citation Context ...is confined to an extent much less than the full column span, which enables concurrency with other reflections. Tiled QR factorization in the context of multicore architectures has been introduced in =-=[5, 6, 15]-=-. Initially the focus was on square matrices and the sequence of unitary transformations presented was analogous to Sameh-Kuck [16], which corresponds to reducing the panels with flat trees. The possi... |

35 | Minimizing Communication in Sparse Matrix Solvers
- Mohiyuddin, Hoemmen, et al.
- 2009
(Show Context)
Citation Context ...in [1]. The ScaLAPACK algorithm is used independently on each cluster on a large parallel distributed rectangular tile; then, a binary tree is used at the grid level among the clusters. Demmel et al. =-=[9]-=- use a binary tree on top of a flat tree for tall and skinny matrices. The binary tree is therefore used on rectangular tiles. The flat tree is used locally on the nodes to reduce sequential communica... |

34 | Comparative study of one-sided factorizations with multiple software packages on multi-core hardware
- Agullo, Hadri, et al.
- 2009
(Show Context)
Citation Context ...perations. Each tile is of size nb×nb, where nb is a parameter tuned to squeeze the most out of arithmetic units and memory hierarchy. Typically, nb ranges from 80 to 200 on state-of-the-art machines =-=[3]-=-. Algorithm 1 outlines a naive tiled QR algorithm, where loop indices represent tiles: Algorithm 1: Naive QR algorithm for a tiled p× q matrix. for k = 1 to min(p, q) do for i = k + 1 to p do elim(i, ... |

28 | Scheduling dense linear algebra operations on multicore processors
- Kurzak, Ltaief, et al.
- 2010
(Show Context)
Citation Context ...SQRT . In order to alleviate these, we altered the dependency designation within each of the update kernels for the matrix of Householder reflectors, V, from INPUT to NODEP as is further explained in =-=[13]-=-. The dependencies between the tasks are still consistent since the T matrix within each update kernel continues to be designated as INPUT so that any subsequent task which overwrites this T matrix ca... |

26 | Communicationavoiding parallel and sequential QR factorizations
- Demmel, Grigori, et al.
(Show Context)
Citation Context ...ogous to Sameh-Kuck [16], which corresponds to reducing the panels with flat trees. The possibility of using any tree in order to either maximize parallelism or minimize communication is explained in =-=[10]-=-. The focus of this manuscript is in maximizing parallelism. Stemming from 2 the observation that a binary tree is best for tall and skinny matrices and a flat tree is best for square matrices, Hadri ... |

20 | QR factorization of tall and skinny matrices in a grid computing environment
- Agullo, Coti, et al.
- 2010
(Show Context)
Citation Context ...uction trees to perform the QR factorization in parallel. Experimental results are given using a binary tree on tall and skinny matrices. The same algorithms is used on the grid (grid of clusters) in =-=[1]-=-. The ScaLAPACK algorithm is used independently on each cluster on a large parallel distributed rectangular tile; then, a binary tree is used at the grid level among the clusters. Demmel et al. [9] us... |

20 |
Achieving accurate and context-sensitive timing for code optimization
- Whaley, Castaldo
- 2008
(Show Context)
Citation Context ... the kernel selection towards the performance of the algorithms, Figures 4 and 5 show both the in cache and out of cache performance using the No Flush and MultCallFlushLRU strategies as presented in =-=[2, 18]-=-. Since an algorithm using TT kernels will need to call 3When q = 1, Greedy and FlatTree exhibit close performance. They both perform a binary tree reduction, albeit with different row pairings. 20 TS... |

14 |
An alternative Givens ordering
- MODI, CLARKE
- 1984
(Show Context)
Citation Context ...trees which combine flat trees at the bottom level with a binary tree at the top level in order to exhibit more parallelism. Our theoretical and experimental work explains that we can adapt Fibonacci =-=[14]-=- and Greedy [7, 8] to tiles, resulting in yet better algorithms in terms of parallelism. Moreover our new algorithms do not have any tuning parameter such as the domain size in the case of [12]. The f... |

13 |
Parallel QR decomposition of a rectangular matrix
- Cosnard, Muller, et al.
- 1986
(Show Context)
Citation Context ...ine flat trees at the bottom level with a binary tree at the top level in order to exhibit more parallelism. Our theoretical and experimental work explains that we can adapt Fibonacci [14] and Greedy =-=[7, 8]-=- to tiles, resulting in yet better algorithms in terms of parallelism. Moreover our new algorithms do not have any tuning parameter such as the domain size in the case of [12]. The focus of this manus... |

13 | Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
- Hadri, Ltaief, et al.
- 2010
(Show Context)
Citation Context ...ocus of this manuscript is in maximizing parallelism. Stemming from 2 the observation that a binary tree is best for tall and skinny matrices and a flat tree is best for square matrices, Hadri et al. =-=[12]-=-, propose to use trees which combine flat trees at the bottom level with a binary tree at the top level in order to exhibit more parallelism. Our theoretical and experimental work explains that we can... |

12 |
Installation Guide for LAPACK
- Blackford, Dongarra
- 1999
(Show Context)
Citation Context ...sing our unit task weight of n3b/3, with m = pnb, and n = qnb, we obtain 2mn 2− 2/3n3 flops which is the exact same number as for a standard Householder reflection algorithm as found in LAPACK (e.g., =-=[4]-=-). We note that this results is true if (a) we use TS kernels as well and if (b) we use any tiling, (e.g. rectangular tiles). 2.3 Execution schemes In essence, the execution of a generic tiled algorit... |

11 | Complexity of parallel QR factorization - Cosnard, Robert - 1986 |

7 | Enhancing parallelism of tile QR factorization for multicore architectures
- Hadri, Ltaief, et al.
- 2009
(Show Context)
Citation Context ...nly contains TS kernels. We have mapped the PLASMA algorithm to TT kernel algorithm using this conversion. Going from a TS kernel algorithm to a TT kernel algorithm is implicitly done by Hadri et al. =-=[11]-=- when going from their “Semi-Parallel” to their “FullyParallel” algorithms. 2.2 Elimination lists As stated above, any algorithm factorizing a tiled matrix of size p× q is characterized by its elimina... |

5 | A fully empirical autotuned dense QR factorization for multicore architectures
- Agullo, Dongarra, et al.
- 2011
(Show Context)
Citation Context ... the kernel selection towards the performance of the algorithms, Figures 4 and 5 show both the in cache and out of cache performance using the No Flush and MultCallFlushLRU strategies as presented in =-=[2, 18]-=-. Since an algorithm using TT kernels will need to call 3When q = 1, Greedy and FlatTree exhibit close performance. They both perform a binary tree reduction, albeit with different row pairings. 20 TS... |