Results 1  10
of
65
Performance Optimizations and Bounds for Sparse MatrixVector Multiply
 In Proceedings of Supercomputing
, 2002
"... We consider performance tuning, by code and data structure reorganization, of sparse matrixvector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how ..."
Abstract

Cited by 57 (10 self)
 Add to MetaCart
We consider performance tuning, by code and data structure reorganization, of sparse matrixvector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
 In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS
"... Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and be ..."
Abstract

Cited by 56 (6 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for registerlevel optimizations. We demonstrate speedups of up to 2 for the single vector case and 5 for the multiple vector case.
When cache blocking sparse matrix vector multiply works and why
 In Proceedings of the PARA’04 Workshop on the Stateoftheart in Scientific Computing
, 2004
"... Abstract We present new performance models and more compact data structures for cache blocking when applied to sparse matrixvector multiply (SpM×V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimu ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
Abstract We present new performance models and more compact data structures for cache blocking when applied to sparse matrixvector multiply (SpM×V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimum block sizes. In addition, we determine criteria that predict when cache blocking improves performance. We conclude with architectural suggestions that would make memory systems execute SpM×V faster.
Combining Performance Aspects of Irregular GaussSeidel via Sparse Tiling
 in 15th Workshop on Languages and Compilers for Parallel Computing (LCPC
, 2002
"... Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as GaussSeidel. To improve performance, iterative smoothers can exploit parallelism, intraiteration data reuse, and interiteration data reuse. Current met ..."
Abstract

Cited by 25 (12 self)
 Add to MetaCart
(Show Context)
Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as GaussSeidel. To improve performance, iterative smoothers can exploit parallelism, intraiteration data reuse, and interiteration data reuse. Current methods for parallelizing GaussSeidel on irregular grids, such as multicoloring and ownercomputes based techniques, exploit parallelism and possibly intraiteration data reuse but not interiteration data reuse. Sparse tiling techniques were developed to improve intraiteration and interiteration data locality in iterative smoothers. This paper describes how sparse tiling can additionally provide parallelism. Our results show the effectiveness of GaussSeidel parallelized with sparse tiling techniques on shared memory machines, specifically compared to ownercomputes based GaussSeidel methods. The latter employ only parallelism and intraiteration locality. Our results support the premise that better performance occurs when all three performance aspects (parallelism, intraiteration, and interiteration data locality) are combined.
Sparse Tiling for Stationary Iterative Methods
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applicati ..."
Abstract

Cited by 25 (8 self)
 Add to MetaCart
In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a runtime reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.
Optimal sparse matrix dense vector multiplication in the I/OModel
, 2010
"... We study the problem of sparsematrix densevector multiplication (SpMV) in external memory. The task of SpMV is to compute y: = Ax, where A is a sparse N × N matrix and x is a vector. We express sparsity by a parameter k, and for each choice of k consider the class of matrices where the number of n ..."
Abstract

Cited by 25 (5 self)
 Add to MetaCart
(Show Context)
We study the problem of sparsematrix densevector multiplication (SpMV) in external memory. The task of SpMV is to compute y: = Ax, where A is a sparse N × N matrix and x is a vector. We express sparsity by a parameter k, and for each choice of k consider the class of matrices where the number of nonzero entries is kN, i.e., where the average number of nonzero entries per column is k. We investigate what is the external worstcase complexity, i.e., the best possible upper bound on the number of I/Os, as a function of k, N and the parameters M (memory size) and B (track size) of the I/Omodel. We determine this complexity up to a constant factor for all meaningful choices of these parameters, as long as k ≤ N 1−ε, where ε depends on the problem variant. Our model of computation for the lower bound is a combination of the I/Omodels of Aggarwal and Vitter, and of Hong and Kung. We study variants of the problem, differing in the memory layout of A. If A is stored in n column major layout, we prove that SpMV has I/O comkN plexity Θ min B max
Performance models for evaluation and automatic tuning of symmetric sparse matrixvector multiply
 In Proceedings of the International Conference on Parallel Processing
, 2004
"... We present optimizations for sparse matrixvector multiply SpMV and its generalization to multiple vectors, SpMM, when the matrix is symmetric: (1) symmetric storage, (2) register blocking, and (3) vector blocking. Combined with register blocking, symmetry saves more than 50 % in matrix storage. We ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
(Show Context)
We present optimizations for sparse matrixvector multiply SpMV and its generalization to multiple vectors, SpMM, when the matrix is symmetric: (1) symmetric storage, (2) register blocking, and (3) vector blocking. Combined with register blocking, symmetry saves more than 50 % in matrix storage. We also show performance speedups of 2.1× for SpMV and 2.6 × for SpMM, when compared to the best nonsymmetric register blocked implementation. We present an approach for the selection of tuning parameters, based on empirical modeling and search that consists of three steps: (1) Offline benchmark, (2) Runtime search, and (3) Heuristic performance model. This approach generally selects parameters to achieve performance with 85 % of that achieved with exhaustive search. We evaluate our implementations with respect to upper bounds on performance. Our model bounds performance by considering only the cost of memory operations and using lower bounds on the number of cache misses. Our optimized codes are within 68 % of the upper bounds. 1
Better Tiling and Array Contraction for Compiling Scientific Prograrns
 In Proceedings of the IEEE/ACM SC2002 Conference
, 2002
"... Scientific programs often include multiple loops over the same data; interleaving parts of different loops may greatly improve performance. We exploit this in a compiler for Titanium, a dialect of Java. Our compiler combines reordering optimizations such as loop fusion and tiling with storage optimi ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Scientific programs often include multiple loops over the same data; interleaving parts of different loops may greatly improve performance. We exploit this in a compiler for Titanium, a dialect of Java. Our compiler combines reordering optimizations such as loop fusion and tiling with storage optimizations such as array contraction (eliminating or reducing the size of temporary arrays).