| G. Karypis and V. Kumar, "A High Performance Sparse Cholesky Factorization Algorithm for Scalable Parallel Computers", pp. 140-147, in Proc. Fifth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, 1995. 115 |
....of the input matrix. An early processor mapping algorithm which attempts to reduce inter process communication is subtree to subcube [34, 36] which works well for balanced tree topologies. Later research efforts have improved upon the load balancing aspects for more general elimination trees [29, 32, 50, 55, 75, 76, 81]. Our static partitioning heuristic shares similarities with several of these methods. In less predictable environments than a dedicated homogeneous system, or even in less efficient communication architectures where contention induces large communication imbalances (e.g. software shared memory ....
G. Karypis and V. Kumar, "A High Performance Sparse Cholesky Factorization Algorithm for Scalable Parallel Computers", pp. 140-147, in Proc. Fifth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, 1995. 115
....revised simplex implementation, if efficient, would be much more generally useful and desirable than the kind of implementations described in this paper. However, such advances will at the very least have to await possible improvements in parallel general sparse matrix factoring. In this regard, [24] holds some promise) What we have shown here is that parallel dense simplex methods are neither trivial to implement nor completely without promise. Revised to Tableau Revised Method Time per Total Run Time Iteration Ratios Name 8K CM 2 16K CM 2 8K CM 2 16K CM 2 25fv47 127.74 97.52 1.22 ....
G. Karypis and V. Kumar, 1994. A High Performance Sparse Cholesky Factorization Algorithm For Scalable Parallel Computers, Technical Report 94--41, Department of Computer Science, University of Minnesota, Minneapolis, MN.
..... I l i r ;j 1 C C C C A 0 B B B B 1 0 : 0 0 . U j 0 1 C C C C A 0 B B B B l j;j l i 1 ;j : l i r ;j 0 . I 0 1 C C C C A end Figure 2: Multifrontal algorithm for the computation of L The parallelization of the multifrontal method is based on the algorithm presented in [11, 14]. It can best be described in terms of a simple example. Let us assume that a balanced elimination tree is given, and that we want to use 4 processors for the computation of L (cf. Figure 3) The nodes in the 4 subtrees T 0 ; T 1 ; T 2 ; T 3 are completely mapped to the processors P 00 ; P 01 ; P ....
G. Karypis, V. Kumar, A High Performance Sparse Cholesky Factorization Algorithm for Scalable Parallel Computers, TR 94-41, Dept. of Comp. Science, Univ. of Minnesota, 1994.
.... Peyton, and Simon, 1987) Benner, Montry, and Weigand, 1987) and (Lucas, Blank, and Tieman, 1987) More recent parallel implementations include those of (Gilbert and Schreiber, 1992) Heath and Raghavan, 1994a) Conroy, Kratzer, and Lucas, 1994) Rothberg, 1994) Gupta and Kumar, 1994) and (Karypis and Kumar, 1994). To summarize, at the current state of the art, the principal ingredients in an efficient parallel algorithm for sparse Cholesky factorization are ffl an ordering that yields a short and well balanced elimination tree while also limiting fill, ffl a multifrontal approach to exploit dense ....
Karypis, G., and Kumar, V., 1994. "A high performance sparse Cholesky factorization algorithm for scalable parallel computers, " Tech. Rept. 94-41, Dept. of Computer Science, University of Minnesota, Minneapolis, MN.
....by assigning subtrees of the front tree to subsets of the processes. We use a variant found in [25] While the map works fairly well for perfectly balanced trees, it suffers from load imbalance in most situations. ffl domain decomposition map This map, very similar to the subtree subforest map [17], attempts to remove the imbalance of operations. The basic idea is to designate a set of subtrees, or domains, to be owned by a single process and then assign these domains to processes to balance the operations. The remaining fronts are assigned to processes using a balanced map. The owner map ....
G. Karypis and V. Kumar. A high performance sparse Cholesky factorization algorithm for scalable parallel computers. Technical Report 94-41, Dept. of Computer Science, University of Minnesota, 1994.
....dissection [29, 30, 19] has been found to generate orderings that have both low fill in and good parallelism. For the experiments presented in this paper we used spectral nested dissection. For a more extensive discussion on the effect of orderings to the performance of our algorithm refer to [21]. In the multifrontal method for Cholesky factorization, a frontal matrix F k and an update matrix U k is associated with each node k of the elimination tree. The rows and columns of F k corresponds to t 1 indices of L in increasing order. In the beginning F k is initialized to an (s 1) Theta ....
....exchanges data with only one other processor during each one of these log p distributed extend adds. The above is achieved by a careful embedding of the processor grids on the hypercube, and by carefully mapping rows and columns of each frontal matrix onto this grid. This mapping is described in [21], and is also given in Appendix B. 4 The New Algorithm As mentioned in the introduction, the subtree to subcube mapping scheme used in [13] does not distribute the work equally among the processors. This load imbalance puts an upper bound on the achievable efficiency. For example, consider the ....
[Article contains additional citation context not shown here]
George Karypis and Vipin Kumar. A High Performance Sparse Cholesky Factorization Algorithm For Scalabale Parallel Computers. Technical Report 94-41, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1994.
....nested dissection [22, 23] has been found to generate orderings that have both low fill in and good parallelism. For the experiments presented in this paper we used spectral nested dissection. For a more extensive discussion on the effect of orderings on the performance of our algorithm refer to [16]. In the multifrontal method for Cholesky factorization, a frontal matrix F k and an update matrixU k is associated with each node k of the elimination tree. The rows and columns of F k corresponds to t 1 indices of L in increasing order. In the beginning F k is initialized to an (s 1) s 1) ....
....k and forms the column k of L . The remaining t t matrix is called the update matrix U k and is passed on to the parent of k in the elimination tree. Since matrices are symmetric, only the upper triangular part is stored. For further details on the multifrontal method, the reader should refer to [16], and to the excellent tutorial by Liu [18] If some consecutively numbered nodes form a chain in the elimination tree, and the corresponding rows of L have identical nonzero structure, then this chain is called a supernode. The supernodal elimination tree is similar to the elimination tree, but ....
[Article contains additional citation context not shown here]
George Karypis and Vipin Kumar. A High Performance Sparse Cholesky Factorization Algorithm For Scalabale Parallel Computers. TR 94-41, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1994. TR available in users/kumar/cholesky-forest.ps at anonymous FTP site ftp.cs.umn.edu.
....of up to 364 on 1024 processors and 230 on 512 processors over a highly efficient sequential implementation for fairly small problems. A recent implementation of a variation of this scheme with improved load balancing reports 4 6 GFLOPS on moderate sized problems on a 256 processor Cray T3D [17]. To the best of our knowledge, this is the first parallel implementation of sparse Cholesky factorization that has delivered speedups of this magnitude and has been able to benefit from over a thousand processors. Although we focus on Cholesky factorization of symmetric positive definite matrices ....
....and have observed significant speedups in solving linear programming problems. Although we have observed through our experiments (Section 5) that the upper bound on efficiency due to load imbalance does not fall below 60 70 for hundreds of processors, even this bound can be improved further. In [17], Karypis and Kumar relax the subtree to subcube mapping to a subforestto subcube mapping, which significantly reduces load imbalances at the cost of a little increase in communication. Their variation of the algorithm achieves 4 6 GFLOPS on a 256 processor Cray T3D for moderate size problems, ....
George Karypis and Vipin Kumar. A high performance sparse Cholesky factorization algorithm for scalable parallel computers. Technical report TR 94-41, Department of Computer Science, University of Minnesota, Minneapolis, MN, 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC