Results 1  10
of
16
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
Runtime Data Flow Scheduling of Matrix Computations
, 2009
"... We investigate the scheduling of matrix computations expressed as directed acyclic graphs for sharedmemory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Wellknown scheduling algorithms su ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
We investigate the scheduling of matrix computations expressed as directed acyclic graphs for sharedmemory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Wellknown scheduling algorithms such as work stealing have proven time and space bounds, but these bounds do not provide a discernable indicator of performance between different scheduling algorithms and heuristics. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. By building software solutions based on hardware techniques through leveraging a cache coherence protocol, we develop a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits.
On sharedmemory parallelization of a sparse matrix scaling algorithm
"... Abstract—We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrixvector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by smal ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrixvector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by small number of operations per each data access. Our main application is a sparse matrix scaling algorithm which is more memory bound than the sparse matrix vector multiplication operation. We take the application and parallelize it using the standard OpenMP programming principles. Apart from the common race condition avoiding constructs, we do not reorganize the algorithm. Rather, we identify associated performance metrics and describe models to optimize them. By using these models, we implement parallel matrix scaling algorithms for two wellknown sparse matrix storage formats. Experimental results show that simple parallelization attempts which leave data/work partitioning to the runtime scheduler can suffer from the overhead of avoiding race conditions especially when the number of threads increases. The proposed algorithms perform better than these algorithms by optimizing the identified performance metrics and reducing the overhead. KeywordsSharedmemory parallelization, sparse matrices, hypergraphs, matrix scaling I.
Hypergraph partitioning through vertex separators on graphs
, 2010
"... The modeling flexibility provided by hypergraphs has drawn a lot of interest from the combinatorial scientific community, leading to novel models and algorithms, their applications, and development of associated tools. Hypergraphs are now a standard tool in combinatorial scientific computing. The ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
The modeling flexibility provided by hypergraphs has drawn a lot of interest from the combinatorial scientific community, leading to novel models and algorithms, their applications, and development of associated tools. Hypergraphs are now a standard tool in combinatorial scientific computing. The modeling flexibility of hypergraphs however, comes at a cost: algorithms on hypergraphs are inherently more complicated than those on graphs, which sometimes translate to nontrivial increases in processing times. Neither the modeling flexibility of hypergraphs, nor the runtime efficiency of graph algorithms can be overlooked. Therefore, the new research thrust should be how to cleverly tradeoff between the two. This work addresses one method for this tradeoff by solving the hypergraph partitioning problem by finding vertex separators on graphs. Specifically, we investigate how to solve the hypergraph partitioning problem by seeking a vertex separator on its net intersection graph (NIG), where each net of the hypergraph is represented by a vertex, and two vertices share an edge if their nets have a common vertex. We propose a vertexweighting scheme to attain good nodebalanced hypergraphs, since NIG model cannot preserve node balancing information. Vertexremoval and vertexsplitting techniques are described to optimize cutnet and connectivity metrics, respectively, under the recursive bipartitioning paradigm. We also developed an
Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact (Thesis format: Monograph)
"... The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rearrangement of data in order to reduce memory traffic, code optimization at all levels, designing parallel algorithms with smaller span or reduced overhead are some of the attractive areas that HPC researchers are working on. In this thesis, we investigate HPC techniques for the implementation of basic routines in computer algebra targeting hardware acceleration technologies. We start with a sorting algorithm and its application to sparse matrixvector multiplication for which we focus on work on cache complexity issues. Since basic routines in computer algebra often provide a lot of fine grain parallelism, we then turn our attention to manycore architectures on which we consider dense polynomial and matrix operations ranging from plain to fast arithmetic. Most of these operations are combined within a bivariate system solver running entirely on a graphics processing unit (GPU).
Abusing a hypergraph partitioner for unweighted graph partitioning
 CONTEMPORARY MATHEMATICS
, 2013
"... ..."
A GEOMETRIC APPROACH TO MATRIX ORDERING
"... Abstract. We present a recursive way to partition hypergraphs which creates and exploits hypergraph geometry and is suitable for manycore parallel architectures. Such partitionings are then used to bring sparse matrices in a recursive Bordered Block Diagonal form (for processoroblivious parallel L ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present a recursive way to partition hypergraphs which creates and exploits hypergraph geometry and is suitable for manycore parallel architectures. Such partitionings are then used to bring sparse matrices in a recursive Bordered Block Diagonal form (for processoroblivious parallel LU decomposition) or recursive Separated Block Diagonal form (for cacheoblivious sparse matrix–vector multiplication). We show that the quality of the obtained partitionings and orderings is competitive by comparing obtained fillin for LU decomposition with SuperLU (with better results for 8 of the 28 test matrices) and comparing cut sizes for sparse matrix–vector multiplication with Mondriaan (with better results for 4 of the 12 test matrices). The main advantage of the new method is its speed: it is on average 21.6 times faster than Mondriaan.
PARTITIONING HYPERGRAPHS IN SCIENTIFIC COMPUTING APPLICATIONS THROUGH VERTEX SEPARATORS ON GRAPHS
"... Abstract. The modeling flexibility provided by hypergraphs has drawn a lot of interest from the combinatorial scientific community, leading to novel models and algorithms, their applications, and development of associated tools. Hypergraphs are now a standard tool in combinatorial scientific computi ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The modeling flexibility provided by hypergraphs has drawn a lot of interest from the combinatorial scientific community, leading to novel models and algorithms, their applications, and development of associated tools. Hypergraphs are now a standard tool in combinatorial scientific computing. The modeling flexibility of hypergraphs however, comes at a cost: algorithms on hypergraphs are inherently more complicated than those on graphs, which sometimes translates to nontrivial increases in processing times. Neither the modeling flexibility of hypergraphs, nor the runtime efficiency of graph algorithms can be overlooked. Therefore, the new research thrust should be how to cleverly tradeoff between the two. This work addresses one method for this tradeoff by solving the hypergraph partitioning problem by finding vertex separators on graphs. Specifically, we investigate how to solve the hypergraph partitioning problem by seeking a vertex separator on its net intersection graph (NIG), where each net of the hypergraph is represented by a vertex, and two vertices share an edge if their nets have a common vertex. We propose a vertexweighting scheme to attain good nodebalanced hypergraphs, since the NIG model cannot preserve node balancing information. Vertexremoval and vertexsplitting techniques are described to optimize cutnet and connectivity metrics, respectively, under the recursive bipartitioning paradigm. We also developed implementations of our proposed hypergraph partitioning formulations by adopting and modifying a stateoftheart graph partitioning by vertex separator tool onmetis. Experiments conducted on a large collection of sparse matrices demonstrate the effectiveness of our proposed techniques. Key words. hypergraph partitioning; combinatorial scientific computing; graph partitioning by vertex separator; sparse matrices. AMS subject classifications.
MulticoreBSP for C: a highperformance
, 2013
"... library for sharedmemory parallel programming ..."
(Show Context)