Results 31  40
of
526
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
Orderings for factorized sparse approximate inverse preconditioners
 SIAM J. SCI. COMPUT
, 2000
"... The influence of reorderings on the performance of factorized sparse approximate inverse preconditioners is considered. Some theoretical results on the effect of orderings on the fillin and decay behavior of the inverse factors of a sparse matrix are presented. It is shown experimentally that certa ..."
Abstract

Cited by 26 (9 self)
 Add to MetaCart
The influence of reorderings on the performance of factorized sparse approximate inverse preconditioners is considered. Some theoretical results on the effect of orderings on the fillin and decay behavior of the inverse factors of a sparse matrix are presented. It is shown experimentally that certain reorderings, like minimum degree and nested dissection, can be very beneficial. The benefit consists of a reduction in the storage and time required for constructing the preconditioner, and of faster convergence of the preconditioned iteration in many cases of practical interest.
M.: Generic Topology Mapping Strategies for Largescale Parallel Architectures
 In: Proceedings of ICS’11
, 2011
"... Thesteadilyincreasingnumberofnodesinhighperformance computingsystemsandthetechnologyandpowerconstraints lead to sparse network topologies. Efficient mapping of application communication patterns to the network topology gains importance as systems grow to petascale and beyond. Such mapping is suppor ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Thesteadilyincreasingnumberofnodesinhighperformance computingsystemsandthetechnologyandpowerconstraints lead to sparse network topologies. Efficient mapping of application communication patterns to the network topology gains importance as systems grow to petascale and beyond. Such mapping is supported in parallel programming frameworks such as MPI, but is often not well implemented. We show that the topology mapping problem is NPcomplete and analyze and compare different practical topology mapping heuristics. We demonstrate an efficient and fast new heuristicwhichisbasedongraphsimilarity andshowitsutility with application communication patterns on real topologies. Our mapping strategies support heterogeneous networks and show significant reduction of congestion on torus, fattree, and the PERCS network topologies, for irregular communication patterns. Wealso demonstratethatthebenefit of topology mapping grows with the network size and showhowouralgorithms canbeusedinapracticalsettingto optimize communication performance. Our efficient topology mapping strategies are shown to reduce network congestion by up to 80%, reduce average dilation by up to 50%, and improve benchmarked communication performance by
An OutofCore Sparse Cholesky Solver
, 2009
"... Direct methods for solving large sparse linear systems of equations are popular because of their generality and robustness. Their main weakness is that the memory they require usually increases rapidly with problem size. We discuss the design and development of the first release of a new symmetric d ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
Direct methods for solving large sparse linear systems of equations are popular because of their generality and robustness. Their main weakness is that the memory they require usually increases rapidly with problem size. We discuss the design and development of the first release of a new symmetric direct solver that aims to circumvent this limitation by allowing the system matrix, intermediate data, and the matrix factors to be stored externally. The code, which is written in Fortran and called HSL MA77, implements a multifrontal algorithm. The first release is for positivedefinite systems and performs a Cholesky factorization. Special attention is paid to the use of efficient dense linear algebra kernel codes that handle the fullmatrix operations on the frontal matrix and to the input/output operations. The input/output operations are performed using a separate package that provides a virtualmemory system and allows the data to be spread over many files; for very large problems these may be held on more than one device. Numerical results are presented for a collection of 30 large realworld problems, all of which were solved successfully.
Metrics and models for reordering transformations
 inProceedings of the 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP04
, 2004
"... Irregular applications frequently exhibit poor performance on contemporary computer architectures, in large part because of their inefficient use of the memory hierarchy. Runtime data and iterationreordering transformations have been shown to improve the locality and therefore the performance of i ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
(Show Context)
Irregular applications frequently exhibit poor performance on contemporary computer architectures, in large part because of their inefficient use of the memory hierarchy. Runtime data and iterationreordering transformations have been shown to improve the locality and therefore the performance of irregular benchmarks. This paper describes models for determining which combination of runtime data and iterationreordering heuristics will result in the best performance for a given dataset. We propose that the data and iterationreordering transformations be viewed as approximating minimal linear arrangements on two separate hypergraphs: a spatial locality hypergraph and a temporal locality hypergraph. Our results measure the efficacy of locality metrics based on these hypergraphs in guiding the selection of dataand iterationreordering heuristics. We also introduce new iteration and datareordering heuristics based on the hypergraph models that result in better performance than do previous heuristics.
Faulttolerant iterative methods via selective reliability
, 2011
"... Current iterative methods for solving linear equations assume reliability of data (no “bit flips”) and arithmetic (correct up to rounding error). If faults occur, the solver usually either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduce ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
Current iterative methods for solving linear equations assume reliability of data (no “bit flips”) and arithmetic (correct up to rounding error). If faults occur, the solver usually either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduces performance. As processor counts continue to grow, these costs will become unbearable. Instead, we show that if the system lets applications apply reliability selectively, we can develop iterations that compute the right answer despite faults. These “faulttolerant ” methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. If faults are infrequent, these algorithms spend most of their time in unreliable mode. This can save energy, improve performance, and avoid restarting from checkpoints. We illustrate convergence for a sample algorithm, FaultTolerant GMRES, for representative test problems and fault rates.
Multilevel direct Kway hypergraph partitioning with multiple constraints and fixed vertices
, 2007
"... ..."
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
A parallel approximation algorithm for the weighted maximum matching problem
 In Proc. Seventh Int. Conf. on Parallel Processing and Applied Mathematics (PPAM
, 2007
"... Abstract. We consider the problem of computing a weighted edge matching in a large graph using a parallel algorithm. This problem has application in several areas of combinatorial scientific computing. Since an exact algorithm for the weighted matching problem is both fairly expensive to compute and ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of computing a weighted edge matching in a large graph using a parallel algorithm. This problem has application in several areas of combinatorial scientific computing. Since an exact algorithm for the weighted matching problem is both fairly expensive to compute and hard to parallelise we instead consider fast approximation algorithms. We analyse a distributed algorithm due to Hoepman [8] and show how this can be turned into a parallel algorithm. Through experiments using both complete as well as sparse graphs we show that our new parallel algorithm scales well using up to 32 processors. 1
A Flexible OpenSource Toolbox for Scalable Complex Graph Analysis
, 2011
"... The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a highlevel language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Py ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
The Knowledge Discovery Toolbox (KDT) enables domain experts to perform complex analyses of huge datasets on supercomputers using a highlevel language without grappling with the difficulties of writing parallel code, calling parallel libraries, or becoming a graph expert. KDT provides a flexible Python interface to a small set of highlevel graph operations; composing a few of these operations is often sufficient for a specific analysis. Scalability and performance are delivered by linking to a stateoftheart backend compute engine that scales from laptops to large HPC clusters. KDT delivers very competitive performance from a generalpurpose, reusable library for graphs on the order of 10 billion edges and greater. We demonstrate speedup of 1 and 2 orders of magnitude over PBGL and Pegasus, respectively, on some tasks. Examples from simple use cases and key graphanalytic benchmarks illustrate the productivity and performance realized by KDT users. Semantic graph abstractions provide both flexibility and high performance for realworld use cases. Graphalgorithm researchers benefit from the ability to develop algorithms quickly using KDT’s graph and underlying matrix abstractions for distributed memory. KDT is available as opensource code to foster experimentation.