Results 21  30
of
527
On twodimensional sparse matrix partitioning: Models, methods, and a recipe
 SIAM J. SCI. COMPUT
, 2010
"... We consider twodimensional partitioning of general sparse matrices for parallel sparse matrixvector multiply operation. We present three hypergraphpartitioningbased methods, each having unique advantages. The first one treats the nonzeros of the matrix individually and hence produces finegrain ..."
Abstract

Cited by 35 (18 self)
 Add to MetaCart
(Show Context)
We consider twodimensional partitioning of general sparse matrices for parallel sparse matrixvector multiply operation. We present three hypergraphpartitioningbased methods, each having unique advantages. The first one treats the nonzeros of the matrix individually and hence produces finegrain partitions. The other two produce coarser partitions, where one of them imposes a limit on the number of messages sent and received by a single processor, and the other trades that limit for a lower communication volume. We also present a thorough experimental evaluation of the proposed twodimensional partitioning methods together with the hypergraphbased onedimensional partitioning methods, using an extensive set of public domain matrices. Furthermore, for the users of these partitioning methods, we present a partitioning recipe that chooses one of the partitioning methods according to some matrix characteristics.
DirectionOptimizing BreadthFirst Search
"... Abstract—BreadthFirst Search is an important kernel used by many graphprocessing applications. In many of these emerging applications of BFS, such as analyzing social networks, the input graphs are lowdiameter and scalefree. We propose a hybrid approach that is advantageous for lowdiameter grap ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Abstract—BreadthFirst Search is an important kernel used by many graphprocessing applications. In many of these emerging applications of BFS, such as analyzing social networks, the input graphs are lowdiameter and scalefree. We propose a hybrid approach that is advantageous for lowdiameter graphs, which combines a conventional topdown algorithm along with a novel bottomup algorithm. The bottomup algorithm can dramatically reduce the number of edges examined, which in turn accelerates the search as a whole. On a multisocket server, our hybrid approach demonstrates speedups of 3.3–7.8 on a range of standard synthetic graphs and speedups of 2.4–4.6 on graphs from real social networks when compared to a strong baseline. We also typically double the performance of prior leading shared memory (multicore and GPU) implementations. I.
A workefficient parallel breadthfirst search algorithm (or how to cope with the nondeterminism of reducers
 In SPAA ’10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures
, 2010
"... We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel imple ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
(Show Context)
We have developed a multithreaded implementation of breadthfirst search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standard C++ breadthfirst search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a “bag, ” in place of the FIFO queue usually employed in serial breadthfirst search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices — a condition met by many realworld graphs — PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstanttime “reducer ” — a “hyperobject” feature of Cilk++ — the work inherent in a PBFS execution depends nondeterministically on how the underlying workstealing scheduler loadbalances the computation. We provide a general method for analyzing nondeterministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutualexclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, we show that for a graph G =(V,E) with diameter D and bounded outdegree, this dataracefree version of PBFS algorithm runs in time O((V + E)/P + Dlg3 (V /D)) on P processors, which means that it attains nearperfect linear speedup if P ≪ (V + E)/Dlg3 (V /D).
Engineering a Scalable High Quality Graph Partitioner
 24TH IEEE INTERNATIONAL PARALLAL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS
, 2010
"... We describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality. For example, for many instances from Walshaw’s benchmark collection we improve the best known partitioning. We use the well known framework of multilevel graph partiti ..."
Abstract

Cited by 33 (19 self)
 Add to MetaCart
We describe an approach to parallel graph partitioning that scales to hundreds of processors and produces a high solution quality. For example, for many instances from Walshaw’s benchmark collection we improve the best known partitioning. We use the well known framework of multilevel graph partitioning. All components are implemented by scalable parallel algorithms. Quality improvements compared to previous systems are due to better prioritization of edges to be contracted, better approximation algorithms for identifying matchings, better local search heuristics, and perhaps most notably, a parallelization of the FM local search algorithm that works more locally than previous approaches.
A finegrain hypergraph model for 2D decomposition of sparse matrices
 in: Proceedings of the 15th International Parallel and Distributed Processing Symposium, 2001, p. 118. C. Aykanat
"... We propose a new hypergraph model for the decomposition of irregular computational domains. This work focuses on the decomposition of sparse matrices for parallel matrixvector multiplication. However, the proposed model can also be used to decompose computational domains of other parallel reduction ..."
Abstract

Cited by 33 (8 self)
 Add to MetaCart
(Show Context)
We propose a new hypergraph model for the decomposition of irregular computational domains. This work focuses on the decomposition of sparse matrices for parallel matrixvector multiplication. However, the proposed model can also be used to decompose computational domains of other parallel reduction problems. We propose a “finegrain” hypergraph model for twodimensional decomposition of sparse matrices. In the proposed finegrain hypergraph model, vertices represent nonzeros and hyperedges represent sparsity patterns of rows and columns of the matrix. By partitioning the finegrain hypergraph into equally weighted vertex parts (processors) so that hyperedges are split among as few processors as possible, the model correctly minimizes communication volume while maintaining computationalload balance. Experimental results on a wide range of realistic sparse matrices confirm the validity of the proposed model, by achieving up to 50 percent better decompositionsthan the existing models, in terms of totalcommunication volume. 1
Multilevel preconditioners constructed from inversebased ILUs
, 2004
"... This paper analyzes dropping strategies in a multilevel incomplete LU decomposition context and presents a few of strategies for obtaining related ILUs with enhanced robustness. The analysis shows that the Incomplete LU factorization resulting from dropping small entries in Gaussian elimination prod ..."
Abstract

Cited by 32 (9 self)
 Add to MetaCart
(Show Context)
This paper analyzes dropping strategies in a multilevel incomplete LU decomposition context and presents a few of strategies for obtaining related ILUs with enhanced robustness. The analysis shows that the Incomplete LU factorization resulting from dropping small entries in Gaussian elimination produces a good preconditioner when the inverses of these factors have norms that are not too large. As a consequence a few strategies are developed whose goal is to achieve this feature. A number of “templates” for enabling implementations of these factorizations are presented. Numerical experiments show that the resulting ILUs offer a good compromise between robustness and efficiency.
GMap: Visualizing Graphs and Clusters as Maps
, 2009
"... Information visualization is essential in making sense out of large data sets. Often, highdimensional data are visualized as a collection of points in 2dimensional space through dimensionality reduction techniques. However, these traditional methods often do not capture well the underlying structu ..."
Abstract

Cited by 32 (21 self)
 Add to MetaCart
(Show Context)
Information visualization is essential in making sense out of large data sets. Often, highdimensional data are visualized as a collection of points in 2dimensional space through dimensionality reduction techniques. However, these traditional methods often do not capture well the underlying structural information, clustering, and neighborhoods. In this paper, we describe GMap, a practical tool for visualizing relational data with geographiclike maps. We illustrate the effectiveness of this approach with examples from several domains. All the maps referenced in this paper can be found in www.research.att.com/˜yifanhu/GMap.
COMPUTING THE ACTION OF THE MATRIX EXPONENTIAL, WITH AN APPLICATION TO EXPONENTIAL INTEGRATORS
, 2010
"... A new algorithm is developed for computing etAB, where A is an n × n matrix and B is n×n0 with n0 ≪ n. The algorithm works for any A, its computational cost is dominated by the formation of products of A with n × n0 matrices, and the only input parameter is a backward error tolerance. The algorithm ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
(Show Context)
A new algorithm is developed for computing etAB, where A is an n × n matrix and B is n×n0 with n0 ≪ n. The algorithm works for any A, its computational cost is dominated by the formation of products of A with n × n0 matrices, and the only input parameter is a backward error tolerance. The algorithm can return a single matrix etAB or a sequence etkAB on an equally spaced grid of points tk. It uses the scaling part of the scaling and squaring method together with a truncated Taylor series approximation to the exponential. It determines the amount of scaling and the Taylor degree using the recent analysis of AlMohy and Higham [SIAM J. Matrix Anal. Appl. 31 (2009), pp. 970989], which provides sharp truncation error bounds expressed in terms of the quantities ‖Ak‖1/k for a few values of k, where the norms are estimated using a matrix norm estimator. Shifting and balancing are used as preprocessing steps to reduce the cost of the algorithm. Numerical experiments show that the algorithm performs in a numerically stable fashion across a wide range of problems, and analysis of rounding errors and of the conditioning of the problem provides theoretical support. Experimental comparisons with two Krylovbased MATLAB codes show the new algorithm to be sometimes much superior in terms of computational cost and accuracy. An important application of the algorithm is to exponential integrators for ordinary differential equations. It is shown that the sums of the form ∑p k=0 ϕk(A)uk that arise in exponential integrators, where the ϕk are related to the exponential function, can be expressed in terms of a single exponential of a matrix of dimension n + p built by augmenting A with additional rows and columns, and the algorithm of this paper can therefore be employed.
Engineering Multilevel Graph Partitioning Algorithms
"... We present a multilevel graph partitioning algorithm using novel local improvement algorithms and global search strategies transferred from multigrid linear solvers. Local improvement algorithms are based on maxflow mincut computations and more localized FM searches. By combining these technique ..."
Abstract

Cited by 31 (16 self)
 Add to MetaCart
We present a multilevel graph partitioning algorithm using novel local improvement algorithms and global search strategies transferred from multigrid linear solvers. Local improvement algorithms are based on maxflow mincut computations and more localized FM searches. By combining these techniques, we obtain an algorithm that is fast on the one hand and on the other hand is able to improve the best known partitioning results for many inputs. For example, in Walshaw’s well known benchmark tables we achieve 317 improvements for the tables at 1%, 3 % and 5 % imbalance. Moreover, in 118 out of the 295 remaining cases we have been able to reproduce the best cut in this benchmark.
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.