12 citations found. Retrieving documents...
Petitet, A. Algorithmic Redistribution Methods for Block Cyclic Decompositions, PhD thesis, Department of Computer Science, University of Tennessee, Knoxville, Tennessee 37996-3450

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Deploying Parallel Numerical Library Routines To Cluster.. - Roche, Dongarra (2002)   (Correct)

....the user in a timely fashion. For dense linear algebra kernels being studied in parallel computing environments it is known that a 2d block cyclic mapping of the naturally structured data (e.g. matrix A and vector b in Ax: b) provides excellent load balance during parallel runs. See references [33, 34, 35] for instance. The mapping is a function of the problem size (ran, or n for square matrices where [A n,n (n 2 , sizeof(double) bytes] problem size n ) the block sizes (nb row, nb coluran, or nb) the number of process columns in the logical rectangular process grid (npcols) the number of ....

Petitet, A. Algorithmic Redistribution Methods for Block Cyclic Decompositions, PhD thesis, Department of Computer Science, University of Tennessee, Knoxville, Tennessee 37996-3450


Parallelizing the Divide and Conquer Algorithm for the.. - Tisseur, Dongarra (2000)   (2 citations)  (Correct)

....a two dimensional block cyclic distribution for U , the eigenvector matrix of the rank one update, and Q, the matrix of the back transformation. For linear algebra routines and matrix multiplications, two dimensional block cyclic distribution has been shown to be efficient and scalable [7] 21] [28]. The ScaLAPACK software has adopted this data layout. The block cyclic distribution is a generalization of the block and the cyclic distributions. The processes of the parallel computer are first mapped onto a twodimensional rectangular grid of size P Theta Q. Any general m Theta n dense ....

....cost of the whole algorithm, it does not matter if the work is not well distributed there. However, good load balancing of the work is assured when the grid P Theta Q of 10 processes is such that lcm(P; Q) 1. In this case, at the leaves of the tree, all the processes hold a subproblem [28]. The worst case happens when lcm(P; Q) P or lcm(P; Q) Q. P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 1 D block distribution Figure 4.2: Active part of the matrix Q held by each process. P0 P3 P0 P3 P0 P1 P2 P3 P0 P1 P2 P3 P2 P3 P2 P1 P0 P1 P3 P0 P2 P0 P1 P0 P1 2 D block ....

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, TN, 1996.


A Parallel Divide And Conquer Algorithm For The Symmetric.. - Tisseur, Dongarra (1999)   (Correct)

....corner. These blocks are then uniformly distributed in each dimension of the process grid. See [4, Chap. 4] for more details. We chose this data layout for several reasons. For linear algebra routines two dimensional block cyclic distribution has been shown to be e#cient and scalable [7] 20] [26]. ScaLAPACK has adopted this distribution and our aim is to write a code in the style of this software. Moreover, with this data layout, we can block partition 2226 FRANCOISE TISSEUR AND JACK DONGARRA our algorithm in order to reduce the frequency with which data is transferred between processes ....

....Then, at the leaves of the tree, processes that hold a diagonal block solve their own subproblems of size n b n b using the QR algorithm or the serial divide and conquer algorithm. For a grid such that lcm(P r , P c ) 1, all the processes hold a subproblem at the leaves of the tree [26] and then good load balancing of the work is ensured. When lcm(P r , P c ) P r or lcm(P r , P c ) P c some processes hold several subproblems at the leaves of the tree and some of them hold none. However, as the computational cost of this first step is negligible compared with the ....

A. Petitet, Algorithmic Redistribution Methods for Block Cyclic Decompositions, Ph.D. thesis, University of Tennessee, Knoxville, TN, 1996.


A Comparison of Lookahead and Algorithmic Blocking Techniques.. - Strazdins (1998)   (1 citation)  (Correct)

.... distribution over a P Q logical processor grid [3] where, for an N N global matrix A, block (i; j) of A will be on processor (i mod P; j mod Q) We will now review two established techniques for parallel panel formation, known as storage blocking, where = r = s, and algorithmic blocking [7, 12, 9, 11], where m ; r = s 1. Storage blocking su ers from load imbalance on the panel formation stage, in that only one row or column of processors of the grid will be involved in this stage, which is an O( N) fraction of the overall computation. Furthermore, there is an O(N 2 (r=Q s=P ) load ....

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, December 1996. xv+193p.


A New Parallel Matrix Multiplication Algorithm on.. - Choi (1997)   (4 citations)  (Correct)

....data distribution as the first LCM block, that is, when an operation is executed on the first LCM block, the same operation can be done simultaneously on other LCM blocks. And the LCM concept is applied to design software libraries for dense linear algebra computations with algorithmic blocking [17, 19]. A 0 2 4 6 8 10 1 3 5 7 9 11 0 3 6 9 1 4 7 10 2 5 8 11 B 0 2 4 6 8 10 1 3 5 7 9 11 0 3 6 9 1 4 7 10 2 5 8 11 Figure 2: A snapshot of SUMMA. The darkest blocks are broadcast first, and lightest blocks are broadcast later. 3. Algorithms 3.1. SUMMA SUMMA is basically a sequence of rank k b ....

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. 1996. Ph.D. Thesis, University of Tennessee, Knoxville.


Parallelizing the Divide and Conquer Algorithm for the.. - Tisseur, Dongarra (1998)   (2 citations)  (Correct)

....a two dimensional block cyclic distribution for U , the eigenvector matrix of the rank one update, and Q, the matrix of the back transformation. For linear algebra routines and matrix multiplications, two dimensional block cyclic distribution has been shown to be efficient and scalable [7] 21] [28]. The ScaLAPACK software has adopted this data layout. The block cyclic distribution is a generalization of the block and the cyclic distributions. The processes of the parallel computer are first mapped onto a twodimensional rectangular grid of size P Theta Q. Any general m Theta n dense matrix ....

....cost of the whole algorithm, it does not matter if the work is not well distributed there. However, good load balancing of the work is assured when the grid P Theta Q of processes is such that lcm(P; Q) 1. In this case, at the leaves of the tree, all the processes hold a subproblem [28]. The worst case happens when lcm(P; Q) P or lcm(P; Q) Q. P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Level 2 Level 1 Level 0 1 D block distribution Figure 4.2: Active part of the matrix Q held by each process. P0 P3 P0 P3 P0 P1 P2 P3 P0 P1 P2 P3 P2 P3 P2 P1 P0 P1 P3 P0 P2 P0 P1 P0 P1 P3 P2 P3 2 D ....

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, TN, 1996.


Reducing Software Overheads in Parallel Linear Algebra Libraries - Strazdins (1997)   (Correct)

....of r; s; P; Q is a power of 2) The latter is rather high, considering that it will be repeated N 2 2P times per processor during an N Theta N LLT factorization on a P Theta P processor grid. A more efficient way of performing the calculation of Equation 4 is via the concept of LCM tables [14], which effectively exploits the principle of locality. Equation 2.5.20 of [14] states a useful property of these tables: for D 0 = r 0 ; r; p 0 ; P ) D 1 = s 1 ; s; q 0 ; Q) local block (l; m) will contain part of the diagonal if and only if: 1 Gamma s LCMT l;m r Gamma 1 where Definition ....

....it will be repeated N 2 2P times per processor during an N Theta N LLT factorization on a P Theta P processor grid. A more efficient way of performing the calculation of Equation 4 is via the concept of LCM tables [14] which effectively exploits the principle of locality. Equation 2.5. 20 of [14] states a useful property of these tables: for D 0 = r 0 ; r; p 0 ; P ) D 1 = s 1 ; s; q 0 ; Q) local block (l; m) will contain part of the diagonal if and only if: 1 Gamma s LCMT l;m r Gamma 1 where Definition 2.5.2 of [14] has been adapted slightly for this situation to LCMT l;m = mQs ....

[Article contains additional citation context not shown here]

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, December 1996. xv+193p.


Parallelizing the Divide and Conquer Algorithm for the.. - Tisseur, Dongarra (1998)   (2 citations)  (Correct)

....a two dimensional block cyclic distribution for U , the eigenvector matrix of the rank one update, and Q, the matrix of the back transformation. For linear algebra routines and matrix multiplications, two dimensional block cyclic distribution has been shown to be efficient and scalable [7] 21] [28]. The ScaLAPACK software has adopted this data layout. The block cyclic distribution is a generalization of the block and the cyclic distributions. The processes of the parallel computer are first mapped onto a two dimensional rectangular grid of size P Theta Q. Any general m Theta n dense ....

....cost of the whole algorithm, it does not matter if the work is not well distributed there. However, good load balancing of the work is assured when the grid P Theta Q of processes is such that lcm(P; Q) 1. In this case, at the leaves of the tree, all the processes hold a subproblem [28]. The worst case happens when lcm(P; Q) P or lcm(P; Q) Q. For a given rank one update Q(D aezz T )Q T the processes that collaborate are those that hold a part of the global matrix Q. By contrast with previous implementations, with the two dimensional block cyclic distribution all the ....

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, TN, 1996.


Optimal Load Balancing Techniques for Block-Cyclic.. - Strazdins (1998)   (1 citation)  (Correct)

.... over a P Q logical processor grid (see Figure 1) 5] where, for an N N global matrix A, block (i; j) of A will be on processor (i mod P; j mod Q) We will now review two established techniques for parallel panel formation, known as storage blocking, where = r = s, and algorithmic blocking [6, 13, 9, 10], where m ; r s 1. Storage blocking su ers from load imbalance on the panel formation stage, in that only one row or column of processors of the grid will be involved in this stage, which is an O( N) fraction of the overall computation. Furthermore, there is an O(N 2 (r=Q s=P ) ....

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, December 1996. xv+193p.


Scheduling Block-Cyclic Array Redistribution - Desprez, Dongarra, Petitet.. (1997)   (8 citations)  Self-citation (Petitet)   (Correct)

....see this, note that pr Gamma qs = Gammar mod g (because g divides P r) hence, no multiple of g can be added to pr Gamma qs so that it lies in the interval [1 Gamma r; s Gamma 1] Therefore, no message will be sent from p to q during the redistribution. 2 2 For another proof, see Petitet [14]. INRIA Scheduling Block Cyclic Array Redistribution 13 In the following, our aim is to characterize the pairs of processors that need to communicate during the redistribution operation (in the case g r s) Consider the following function f : ae [0: P Gamma 1] Theta [0: Q Gamma 1] Gamma ....

Antoine Petitet. Algorithmic redistribution methods for block cyclic decompositions. PhD thesis, University of Tennessee at Knoxville, December 1996.


Scheduling Block-Cyclic Array Redistribution - Desprez, Dongarra, Petitet.. (1997)   (8 citations)  Self-citation (Petitet)   (Correct)

....extended Euclid algorithm provides such numbers for relatively prime r and s) We have the following result. Proposition 1 Assume that gcd(r; s) 1. For 0 k g, class(k) f p q = s r k u v mod P Q ; 0 PQ g g: 3 For another proof, see Petitet [13]. 13 Proof First, to see that PQ g indeed is an integer, note that PQ = PQ(ru Gamma sv) PrQu Gamma QsPv. Since g divides both Pr and Qs, it divides PQ. Two different classes are disjoint (by definition) It turns out that all classes have the same number of elements. To see this, note ....

Antoine Petitet. Algorithmic redistribution methods for block cyclic decompositions. PhD thesis, University of Tennessee at Knoxville, December 1996.


Algorithmic Redistribution Methods for Block Cyclic.. - Petitet, Dongarra (1998)   (9 citations)  Self-citation (Petitet)   (Correct)

....of local coordinates ( m) minus the global number of rows up to the blocks of local coordinates (l; This constructive definition is more general than the one used in this document. In particular, Definition 2. 2 can easily be adapted to a block cyclic distribution with a partial first block [52]. In other words, the first block of rows (respectively columns) is of size ir (respectively is) instead of r (respectively s) Such a generalization is convenient to allow for the specification of sub matrix operands which upper left corner is not aligned on block boundaries [45] The equation ....

....( P Q ( r s ) g; P Q) if gcd(r; s) divides k; min( P Q ( r s Gamma gcd(r; s) g; P Q) otherwise: It is difficult to compute analytically the probability that all processes will own k diagonal entries. However, it is likely that this probability, if it exists, converges rapidly [52]. It is possible to rely on a computer to enumerate all 4 tuples in a finite and practical range such that the quantities r s Gamma gcd(r; s) or r s are greater or equal to g. The results are presented in Figure 4. It is important to notice that in practice, i.e. for a finite range of ....

[Article contains additional citation context not shown here]

A. Petitet. Algorithmic Redistribution Methods for Block Cyclic Decompositions. PhD thesis, University of Tennessee, Knoxville, 1996. (also LAPACK Working Note No.128).

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC