| Li X. Sparse Gaussian elimination on high performance computers. PhD Thesis, Computer Science Department, University of California at Berkeley, 1996. 25 |
....[14] and factor only the diagonal blocks in the reduced form. Many of the matrices in our test suite have numerous tiny diagonal blocks (most of them 1 by 1) we report the performance of factoring all the diagonal blocks with dimension at least 250. We factor the reordered matrix using SuperLU [6, 13] version 2.0, a state of the art sparse partial pivoting LU code. SuperLU uses the Basic Linear Algebra Subroutines (blas) we used atlas 2 , a highperformance implementation of the blas. We conducted the experiments on a 600MHz dual Pentium III computer with 2 GBytes of main memory running ....
Xiaoye S. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Department of Computer Science, UC Berkeley, 1996.
....(and even developing) a variety of systems solver interfaces. The point of interest to the end user is that via NetSolve, he can seamlessly access all these solvers in a uniform way. We have integrated numerous solvers, including some from LAPACK [1] ARPACK [8] PETSc [2] Aztec [7] SuperLU [9] and MA28 [5] Each of these packages has its own set of options and settings. We have used the PDF, as described in Section 2 to make these functions available from the NetSolve interface. After installing any of the NetSolve client interfaces, the user can then access our pool of NetSolve ....
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, University of California at Berkeley, 1996. Computer Science Dept.
....it is still desirable to use optimized BLAS. This poses additional problems since sparse matrix data is usually stored in a form which is not suitable for the BLAS. The two major approaches that cope with this are multifrontal and supernodal. We compare our recursive code only with SuperLU [14] which is a supernodal code, because until recently, multifrontal packages were suited only for certain class of matrices. 2 function ######(matrix #) begin if (# # # ### ) ## ###### ### #### ### ####### begin ## ## # : max ##### ## ## # #### ###### # ## : # ## ######## # ## #### # ## ....
....supernodes allows only calls to Level 2 BLAS routines whichhave a performance limited by the CPU memory bandwidth. To alleviate this problem, SuperLU reorganizes calls to BLAS routines to gain extra reuse of data already present in cache (this technique is referred to as the use of Level 2. 5 BLAS [14]) 5 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # original nonzero value 0 zero value introduced due to blocking x zero value introduced due to ll in ....
[Article contains additional citation context not shown here]
X. Li, Sparse Gaussian Elimination on High Performance Computers, PhD thesis, University of California at Berkeley, Computer Science Department, 1996.
....to respect wide separators, and similarly for COLAMD and WS COLAMD. In one set of experiments we first reduced the matrices to block triangular form (see [12] applied the ordering and factorization to the diagonal blocks in the reduced form. We always factor the reordered matrix using SuperLU [4, 11], a state ofthe art sparse LU with partial pivoting code. SuperLU uses the BLAS; we used the standard Fortran BLAS for the experiments. We plan to use a higherperformance implementation of the BLAS for the final version of the paper. We conducted the experiments on a 500MHz dual Pentium III ....
Xiaoye S. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Department of Computer Science, UC Berkeley, 1996.
....(see [13] and factor only the diagonal blocks in the reduced form. Many of the matrices in our test suite have numerous tiny diagonal blocks (most of them 1 by 1) we report the performance of factoring all the diagonal blocks of size 250 or larger. We factor the reordered matrix using SuperLU [5, 12] version 2.0, a stateof the art sparse LU with partial pivoting code. SuperLU uses the BLAS. we used ATLAS 2 , a high performance implementation of the BLAS. We conducted the experiments on a 600MHz dual Pentium III computer with 2 GBytes of main memory running Linux. The machine was ....
Xiaoye S. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Department of Computer Science, UC Berkeley, 1996.
....the graph [7] Meanwhile, great strides have been made in new algorithms and hand optimizations of unstructured problems for uniprocessors with caches, and shared and distributed memory multiprocessors. The optimization techniques include identification of dense sublocks within a sparse matrix [102, 60] to improve cache performance, use of non blocking remote memory operations to enable overlap with fine grained communication [24] graph partitioning algorithms to simultaneously optimize communication and load balance, 4 multi dimensional blocked layouts of sparse matrices [89] and ....
.... operations to enable overlap with fine grained communication [24] graph partitioning algorithms to simultaneously optimize communication and load balance, 4 multi dimensional blocked layouts of sparse matrices [89] and algorithms to predict and schedule communication and computation costs [60]. Unfortunately, the implementation of optimized algorithms lags far behind the algorithmic advances. The reason is largely inadequate language support. In a research field where a uniprocessor implementation of sparse matrix vector multiply is a publishable result, and the design and ....
[Article contains additional citation context not shown here]
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Computer Science Division, Department of Electrical Engineering and Computer Science, University of California, Berkeley, 1996.
....well balanced task distribution requires dynamic job scheduling. Processors are dynamically assigned to supernodes in the elimination tree in such a way that new tasks are given to processors when the previous tasks are finished. Dynamic job scheduling is implemented by a pool of tasks approach [2, 4, 8, 10]. The pool contains the list of tasks that can be performed by available processors. Therefore, the queue is initialized with all leaves of the elimination tree (line 5) Then, a group of p processes is created. Each process asks the queue for a new task until all supernodes have been factorized. ....
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, University of California at Berkely, Department of Computer Science, 1996.
....load imbalance on modern architectures with memory hierarchies. The previous work has addressed parallelization on shared memory platforms or with restricted pivoting [4, 13, 15, 19] Most notably, the recent shared memory implementation of SuperLU has achieved up to 2. 58GFLOPS on 8 Cray C90 nodes [4, 5, 23]. For distributed memory machines, we proposed an approach that adopts a static symbolic factorization scheme to avoid data structure variation [10, 11] Static symbolic factorization eliminates the runtime overhead of dynamic symbolic factorization with a price of over estimated fill ins and ....
....and space is allocated for all possible nonzero entries. Static symbolic factorization annihilates data structure variation, and hence it improves predictability of resource requirements and enables static optimization strategies. On the other hand, dynamic factorization, which is used in SuperLU [4, 23], provides more accurate control of data structures on the fly. But it is challenging to parallelize dynamic factorization with low runtime overhead on distributed memory machines. The static symbolic factorization for an n Theta n matrix is outlined as follows. At each step k(1 k n) each ....
[Article contains additional citation context not shown here]
X. S. Li, Sparse Gaussian Elimination on High Performance Computers, PhD thesis, Computer Science Division, EECS, UC Berkeley, 1996.
....can be very fast. For example, it costs less than one second for most of our tested matrices, at worst it costs 2 seconds on a single node of Cray T3E, and the memory requirement is relatively small. The dynamic factorization, which is used in the sequential and share memory versions of SuperLU [7, 20], provides more accurate data structure prediction on the fly, but it is challenging to parallelize SuperLU with low runtime control overhead on distributed memory 13 A Figure 3.1: A sample sparse matrix. Nonzero Fill in k=1 k=2 k=3 Figure 3.2: The first 3 steps of the symbolic factorization ....
X. S. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Computer Science Division, EECS, UC Berkeley, 1996.
....were the authors who reported probably the best performance for the sparse cholesky. A significant more difficult problem appears when matrices are not symmetric. Here, the supernode tree is the tool to exploit task level parallelism. A parallel version of the SuperLU is presented by Li et al. [16], achieving on 8 processors shared memory machines, and for 21 unsymmetric sparse matrices, an average speed up of less than 4. Better results can be achieved on distributed memory machines as recently shown by Fu, Jiao, and Yang (1998) 11] However, they have parallelized the factorize stage ....
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, CS, UC Berkeley, 1996.
....be parallelized at the loop level. That means, that we could face this sparse generic problems from the data parallel compilers point of view in which we are interested [4, 5] As far as we know, the parallel versions for multifrontal or supernode codes only exploit parallelism at the task level [18, 26] which is more sensitive to load balance and scalability problems. 2.1 Loop level parallelism To exploit loop level parallelism presents some advantages. A loop level parallelized code has the same structure as the sequential code except for these two issues: Iteration space for parallel ....
....when task level one starts to be exhausted when approaching the root of the tree. A significant more difficult problem appears when matrices are not symmetric. Here, the supernode tree is the tool to exploit task level parallelism. A parallel version of the SuperLU is presented by Li et al. [26], achieving on 8 processors shared memory machines, and for 21 unsymmetric sparse matrices, the following average speed up: 3.59 in the SGI Power Challenge, 4.01 in the DEC AlphaServer 8400, 3.85 in the Cray C90 and 4.29 in the Cray J90. Better results can be achieved on distributed memory ....
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, CS, UC Berkeley, 1996.
....elimination process, and cause severe caching miss and load imbalance on modern computers with memory hierarchies. The previous work has addressed parallelization using shared memory platforms or restricted pivoting [3, 12, 13, 16] Most notably, the recent shared memory implementation of SuperLU [3, 4, 18] has achieved up to 2.58GFLOPS on 8 Cray C90 nodes. For distributed memory machines, in [10] we proposed a novel approach called S that integrates three key strategies together in parallelizing this algorithm: 1) adopt a static symbolic factorization scheme [13] to eliminate the data structure ....
....can be very fast. For example, it costs less than one second for most of our tested matrices, at worst it costs 2 seconds on a single node of Cray T3E, and the memory requirement is relatively small. The dynamic factorization, which is used in the sequential and share memory versions of SuperLU [3, 18], provides more accurate data structure prediction on the fly, but it is challenging to parallelize SuperLU with low runtime control overhead on distributed memory machines. In [8, 10] we show that static factorization does not produce too many fill ins for most of the tested matrices, even for ....
[Article contains additional citation context not shown here]
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Computer Science Division, EECS, UC Berkeley, 1996.
.... T[3,10] T[4,8] T[5,8] T[5,9] T[5,10] T[7,8] T[7,10] T[5] Proc0 Proc1 Proc0 Proc1 (b) c) T[3] T[5] T[7] T[4] T[2] T[3,8] T[4,8] T[5,8] T[1,6] T[1,10] T[7,8] T[8] T[7,10] T[1] T[3,9] T[5,9] T[3] T[5] T[7] T[4] T[2] T[3,8] T[4,8] T[5,8] T[7,8] T[8] T[1,10] T[3,10] T[5,10] T[7,10] T[1] T[3,9] T[5,9] T[8,11] T[8,11] T[8,11] T[3,10] T[5,10] 13 T[1,6] 3 5 7 1 8 Figure 2: a) A DAG; b) A schedule for the DAG on 2 processors; c) Another schedule. 3 Active memory management 3.1 Basic ideas Naturally if memory space is not sufficient to hold all data objects, space recycling for volatile data will be ....
.... T[4,8] T[5,8] T[5,9] T[5,10] T[7,8] T[7,10] T[5] Proc0 Proc1 Proc0 Proc1 (b) c) T[3] T[5] T[7] T[4] T[2] T[3,8] T[4,8] T[5,8] T[1,6] T[1,10] T[7,8] T[8] T[7,10] T[1] T[3,9] T[5,9] T[3] T[5] T[7] T[4] T[2] T[3,8] T[4,8] T[5,8] T[7,8] T[8] T[1,10] T[3,10] T[5,10] T[7,10] T[1] T[3,9] T[5,9] T[8,11] T[8,11] T[8,11] T[3,10] T[5,10] 13 T[1,6] 3 5 7 1 8 Figure 2: a) A DAG; b) A schedule for the DAG on 2 processors; c) Another schedule. 3 Active memory management 3.1 Basic ideas Naturally if memory space is not sufficient to hold all data objects, space recycling for volatile data will be ....
[Article contains additional citation context not shown here]
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, CS, UC Berkeley, 1996.
....can be very fast. For example, it costs less than one second for most of our tested matrices, at worst it costs 2 seconds on a single node of Cray T3E, and the memory requirement is relatively small. The dynamic factorization, which is used in the sequential and sharememory versions of SuperLU [14], provides more accurate data structure prediction on the fly, but it is challenging to parallelize SuperLU with low runtime control overhead on distributed memory machines. In [9, 10] we show that static factorization does not produce too many fill ins for most of the tested matrices, even for ....
....if and only if vertex i is an ancestor of vertex j in the elimination forest. 2D L=U supernode partitioning and amalgamation. After the nonzero fill in pattern of a matrix is predicted, the matrix is further partitioned using a supernodal approach to improve the caching performance. In [14], a nonsymmetric supernode is define as a group of consecutive columns in which the corresponding L factor has a dense lower triangular block on the diagonal and the same nonzero pattern below the diagonal. Based on this definition, in each column block the L part only contains dense subrows. We ....
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Computer Science Division, EECS, UC Berkeley, 1996.
....Scientific applications continue to grow in the sophistication of the data structures and algorithms they use. Techniques such as adaptive unstructured meshes [4] in computational fluid dynamics (CFD) hierarchical n body simulations [10, 18, 23, 27] and supernodal sparse Cholesky [2] and LU [21] factorization are difficult to write in an architecture independent manner in traditional performance oriented languages. One programming model capable of expressing such irregular algorithms is nested data parallelism [5] While data parallelism allows the expression of parallelism through ....
X. S. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, Department of Computer Science, University of California at Berkeley, Berkeley, CA, Sept. 1996. Available as technical report CSD-96-919.
....pattern of a sparse matrix is given at the run time preprocessing stage. Sparse Gaussian Elimination (LU factorization) with partial pivoting. This problem has unpredictable dependence and storage structures due to dynamic pivoting. Its parallelization on shared memory platforms is addressed in Li [1996]. However, its efficient parallelization on distributed memory machines still remains an open problem in the scientific computing literature. We have used a static symbolic factorization approach to estimate the worst case dependence structure and storage need. In Fu and Yang [1996b] we show that ....
....tree are normally computation intensive and have sufficient parallelism. For sparse LU, since our approach uses a static symbolic factorization which overestimates computation, we only list the megaflops performance. In calculating megaflops, we use more accurate operation counts from SuperLU [Li 1996] and divide them by corresponding numerical factorization time. RAPID with Active Memory Management. Table III examines performance degradation after using active memory management. RCP is still used for task ordering, and we show later on how much improvement on space efficiency can be obtained ....
Li, X. 1996. Sparse Gaussian Elimination on High Performance Computers. Ph.D. thesis, CS, UC Berkeley.
....all the possible nonzeros that would be introduced by any pivoting sequence that could occur during the numerical factorization. The static approach avoids data structure expansions during the numerical factorization. The dynamic factorization, which is used in an efficient sequential code SuperLU [11], provides more accurate data structure prediction on the fly, but it is challenging to parallelize SuperLU on distributed memory machines. Currently the SuperLU group has been working on shared memory parallelizations [11] L U supernode partitioning. After the nonzero fill in patterns of a ....
....factorization, which is used in an efficient sequential code SuperLU [11] provides more accurate data structure prediction on the fly, but it is challenging to parallelize SuperLU on distributed memory machines. Currently the SuperLU group has been working on shared memory parallelizations [11]. L U supernode partitioning. After the nonzero fill in patterns of a matrix is predicted, the matrix is further partitioned using a supernode approach to improve the cache performance. In [11] a nonsymmetric supernode is defined as a group of consecutive columns in which the corresponding L ....
[Article contains additional citation context not shown here]
X. Li, Sparse Gaussian Elimination on High Performance Computers , PhD thesis, CS, UC Berkeley, 1996.
....take advantage both of sparsity and the computer architecture, in particular memory hierarchies (caches) and parallelism. In this introduction we refer to all three libraries collectively as SuperLU. The three libraries within SuperLU are as follows. Detailed references are also given (see also [21]) Sequential SuperLU is designed for sequential processors with one or more layers of memory hierarchy (caches) 5] Multithreaded SuperLU (SuperLU MT) is designed for shared memory multiprocessors (SMPs) and can e ectively use up to 16 or 32 parallel processors on suciently large matrices ....
....architectures, in particular, the multi level cache organization and parallelism. We have conducted extensive experiments on various platforms, with a large collection of test matrices. The Sequential SuperLU achieved up to 40 of the theoretical oating point rate on a number of processors, see [5, 21]. The mega op rate usually increases with increasing ratio of oating point operations count over the number of nonzeros in the L and U factors. The parallel LU factorization in SuperLU MT demonstrated 5 10 fold speedups on a range of commercially popular SMPs, and up to 2.5 Giga ops factorization ....
[Article contains additional citation context not shown here]
Xiaoye S. Li. Sparse Gaussian elimination on high performance computers. Technical Report UCB//CSD-96-919, Computer Science Division, U.C. Berkeley, September 1996. Ph.D dissertation.
....tree of A, as de ned below. Our algorithms run in time almost linear in the number of nonzeros in A. Thus they may be used as a fast way to predict and allocate the storage necessary for the QR or LU factorization. In particular, our work was motivated by the LU factorization code SuperLU [3, 4, 17]. Both the sequential and parallel versions of SuperLU use the column elimination tree to cluster 1 In this factorization matrix A can also be rectangular. Here, for simplicity, we consider only square matrices. 2 similarly structured columns of L for eciency; the shared memory parallel version ....
....A) A fundamental supernode is a supernode that is maximal subject to the property that every vertex on that path, with the possible exception of the rst one, has exactly one child in the tree. See Ng and Peyton [22] for more on fundamental supernodes. The following theorem, which is due to Li [17], characterizes the fundamental supernodes of H in terms of the column elimination tree. Theorem 2 (Supernodal structure of H) 17] Let T = T (A T A) be the column elimination tree of A, and assume that T is postordered. Let H be the Householder matrix. Vertex j is the rst vertex in a ....
[Article contains additional citation context not shown here]
X. S. Li. Sparse Gaussian elimination on high performance computers. Technical Report UCB//CSD-96-919, Computer Science Division, U.C. Berkeley, September 1996. Ph.D. dissertation.
....to take advantage both of sparsity and the computer architecture, in particular memory hierarchies (caches) and parallelism. In this introduction we refer to all three libraries collectively as SuperLU. The three libraries within SuperLU are as follows. Detailed references are also given (see also [19]) ffl Sequential SuperLU is designed for sequential processors with one or more layers of memory hierarchy (caches) 5] ffl Multithreaded SuperLU (SuperLU MT) is designed for shared memory multiprocessors (SMPs) and can effectively use up to 16 or 32 parallel processors on sufficiently large ....
....architectures, in particular, the multi level cache organization and parallelism. We have conducted extensive experiments on various platforms, with a large collection of test matrices. The Sequential SuperLU achieved up to 40 of the theoretical floating point rate on a number of processors, see [5, 19]. The megaflop rate usually increases with increasing ratio of floating point operations count over the number of nonzeros in the L and U factors. The parallel LU factorization in SuperLU MT demonstrated 5 10 fold speedups on a range of commercially popular SMPs, and up to 2.5 Gigaflops ....
[Article contains additional citation context not shown here]
Xiaoye S. Li. Sparse Gaussian elimination on high performance computers. Technical Report UCB//CSD-96-919, Computer Science Division, U.C. Berkeley, September 1996. Ph.D dissertation.
....pivoting is used as an effective mechanism to control the element growth during Gaussian elimination on general matrices, thereby stabilizing the underlying algorithm. In our earlier work we developed efficient algorithms and software to perform Gaussian elimination with partial pivoting (GEPP) [3, 4, 5, 9]. Since the computational graph does not unfold until runtime due to partial pivoting, our shared memory parallel GEPP algorithm uses a centralized task queue for dynamic scheduling and load balancing. However, this is too expensive on distributed memory machines. Instead, for distributed memory ....
X. S. Li, Sparse Gaussian elimination on high performance computers, Tech. Rep. UCB//CSD96 -919, Computer Science Division, U.C. Berkeley, September 1996. Ph.D dissertation.
....for both sparse Gaussian elimination and triangular solve and we show that they are suitable for large scale distributed memory machines. Keywords: sparse unsymmetric linear systems, static pivoting, iterative refinement, MPI, 2 D matrix decomposition. 1 Introduction In our earlier work [8, 9, 22], we developed new algorithms to solve unsymmetric sparse linear systems using Gaussian elimination with partial pivoting (GEPP) The new algorithms are highly efficient on workstations with deep memory hierarchies and shared memory parallel machines with a modest number of processors. The ....
Xiaoye S. Li. Sparse Gaussian elimination on high performance computers. Technical Report UCB//CSD-96-919, Computer Science Division, U.C. Berkeley, September 1996. Ph.D dissertation.
....the overall column order would be a two level postorder, first within the subtrees (panels) and then among them. Again, it might be possible to use information about the Cholesky supernodes of A T A to guide this grouping. We are also developing a parallel sparse LU algorithm based on SuperLU [11, 33]. In this context, we target large problems, especially those too big to be solved on a uniprocessor system. Therefore, we plan to parallelize the 2 D blocked supernode panel algorithm, which has very good asymptotic behavior for large problems. The 2 D block oriented layout has been shown to ....
X. S. Li, Sparse Gaussian elimination on high performance computers, Tech. Report UCB//CSD-96-919, Computer Science Division, U.C. Berkeley, September 1996. Ph.D dissertation.
No context found.
Li X. Sparse Gaussian elimination on high performance computers. PhD Thesis, Computer Science Department, University of California at Berkeley, 1996. 25
No context found.
X. Li. Sparse Gaussian Elimination on High Performance Computers. PhD thesis, University of California at Berkeley, Computer Science Department, 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC