| S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--726, 1997. 115 |
....this sparsity structure. The most tractable solution involves presenting to the user many different specialized storage formats, and allowing the user to choose the format that best exploits his her sparsity structure in order to maximize cache reuse. This approach is widely used today, as in [31, 11, 37, 45, 47, 28]. The advantage of this solution is that the input format is fixed and assumed to be appropriate to the data structure, just as with dense BLAS. Choosing one of the more optimizable data structures (such as one of the block compressed storage schemes) should allow us to directly leverage the ....
S. Toledo. Improving memory-system performance of sparse-matrix vector multiplication. In In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, SIAM, 1997.
....bound assumes that accesses to the source vector cache lines will always miss, due either to capacity or conflict misses. This suggests that the non FEM matrices, possibly due to their particular sparsity patterns, exhibit more conflicts or reduced spatial locality. Some form of matrix reordering [30, 19, 24, 15], or the use of multiple rc block sizes are likely to be the most e#ective way to address this performance issue. On the Power3, the performance of all implementations falls between 60 70 of the estimated upper bound, a smaller fraction than on the other machines. One factor di#erentiating ....
....27 First, matrices without natural block structures remain di#cult. Techniques such as reordering the rows and columns to create more blocks, using multiple block sizes, or cache blocking (storing large rectangular submatrices as separate sparse matrices) show promise and we are pursuing them [16, 30, 19, 24]. Second, our register blocking techniques have not been very e#ective on the Power3 architecture. We need to determine whether we can do better. Third, we need to exploit more structure in the sparse matrices that will let us improve data reuse. For example, if the matrix is symmetric, then ....
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
....an e#cient version of the algorithm, reminiscent of the block nested loops join algorithm, that substantially lowers the main memory requirements. Similar strategies for fast in memory matrix vector multiplies are commonly used in the scientific computing community for improving caching behavior [9, 14]. As we are dealing with massive web repositories in which the data structures are measured in gigabytes and are stored on disk, we will take a more i o centric approach in presenting the algorithm and its cost analysis. Empirical timing results for both the naive and block based approaches are ....
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. In IBM Journal of Research and Development, volume 41. IBM, 1997. 15
....This type of analysis is orthogonal to ours, and it is likely that the combination would prove useful. The Bernoulli compiler also takes a program written for dense matrices and compiles it for sparse ones, although it does not specialize the code to a particular matrix structure. Toledo [Tol97] demonstrated some of the performance benefits or register blocking, including a scheme that mixed multiple block sizes in a single matrix, and PETSc (Portable, Extensible Toolkit for Scientific Computation) BGMS00] uses a application specified notion of register blocking for Finite Element ....
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997. 1 This research is supported in part by U.S. Army Research O#ce, by the Department of Energy and by Kookmin University, Korea.
....9 ###### ###### Optimization Software from the cache location. Since contiguous data tends to be accessed in programs, a cache will insert not only data currently needed, but also some items around that too to get more cache hits by guessing it will need what is nearby this is spatial locality [83]. If data is not consistently accessed, is widely separated from other data, or is requested for the rst time, we have the risk of a cache miss. For a direct mapped cache, data is retrieved from memory, a location is isolated in the cache and before it is placed there any existing item is ejected ....
....at a cache location, but they are slower than the direct mapped cache due to scanning operations that must be performed to try to get acached item or signal a cache miss. Cache misses can delay aCPUfor hundreds of cycles [38] but they are easily caused, such as through sparse matrix computations [83]. Research into software manipulation of caches is underway which will be of bene t to optimization software [38] RAM and the hard disk drive have the slowest access times, but sizes typically in hundreds of megabytes, and many tens of gigabytes respectively. RAM works at the nanosecond level, ....
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal Of Research And Development, 41(6):711-725, November 1997. 35 ###### ###### Optimization Software
....the graph [7] Meanwhile, great strides have been made in new algorithms and hand optimizations of unstructured problems for uniprocessors with caches, and shared and distributed memory multiprocessors. The optimization techniques include identification of dense sublocks within a sparse matrix [102, 60] to improve cache performance, use of non blocking remote memory operations to enable overlap with fine grained communication [24] graph partitioning algorithms to simultaneously optimize communication and load balance, 4 multi dimensional blocked layouts of sparse matrices [89] and ....
....and the opposite transformation is useful for data that is stored with a pointer but frequently is accessed with the data at its source. Sorting structures in memory can reduce cache misses, although the overhead of applying this transformation often outweighs the benefit. Related work by others [102, 60, 53] identifies several data structure reorganizations for sparse matrices that improve cache performance. Even the most basic of these requires information about the lack of aliases into the middle of a data structure. Dropping down a level in memory hierarchy to communication between processors, ....
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, Mar. 1997.
....each category can also be specified. Two simple heuristics can be used to order the vertices: random order, and a natural order. Meshes may be initially ordered in a block regular order (i.e. an assemblage of logically regular blocks) or ordered in a cache optimizing order like Cuthill McKee [29]. Both of these ordering types are what we call natural orders, and we assume that the initial order of our mesh is of this type (if not then 7 Coarse Grid Fine Grid Figure 7: MIS and coarse mesh we can make it so) The MISs produced from natural orderings tend to be rather dense, random ....
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
.... last iteration: no prefetch s = v x[j] y[j] yj v xi; y[i] yi s; 4 2.2 Register Blocking We reduce the number of memory accesses by register blocking, i.e. splitting the matrix A into a sum of matrices A = A 1 . Am , consisting of small dense blocks of a fixed size [3]. When multiplying with such a matrix consisting of small dense blocks the code has to load fewer indices j because only one is needed per block. When multiplying with a dense block, elements of x and y can be loaded once and reused several times. In our approach we store at least two matrices: ....
....to reduce its bandwidth. Because of the smaller bandwidth it is likely that during matrix vector multiplication vector elements that are accessed in a particular matrix row will be accessed again in the following row. Thus matrix reordering can reduce cache misses that accesses to x and y generate [3] and more importantly also lowers the number of messages to be sent in the parallel implementation (see section 3) 5 Name cav1 cav2 Size of matrix 17215 17215 54295 54295 # nonzeros 929159 3172021 # nonzeros per row 26.49 28.71 Storage 4.41 MBytes 18.46 MBytes Properties symmetric ....
[Article contains additional citation context not shown here]
S. Toledo, Improving memory-system performance of sparse matrix-vector multiplication, in: Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, SIAM, 1997.
....to order the vertices: random order, or a natural order. Meshes may 78 be initially ordered in either a block regular order (i.e. an assemblage of logically regular blocks) but this depends on the mesh generation method used. Initial vertex orders can also be ordered in a cache optimizing order [82] like Cuthill McKee. Both of these ordering types are what we call natural orders, and we assume that the initial order of our mesh is of this type (if not then we can make it so) The MISs produced from natural orderings tend to be rather dense, random ordering on the other hand tend to be more ....
....of the memory hierarchy on todays computers. Our problems (multiple degrees of freedom per vertex) are naturally partitioned for registers, though perhaps not optimally. Local matrix vector products require matrix ordering to optimize cache performance (utilizing a graph partitioner is one method [82]) In the case of CLUMPs, we partition our domain to the SMPs first, as communication is much faster (at least in theory) within an SMP, we then partition to each processor within the SMP. We use a flat MPI programming model (one thread per processor) the message passing layer can be optimized ....
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
....each category can also be specified. Two simple heuristics can be used to order the vertices: a natural order and a random order. Meshes may be initially ordered in a block regular order (ie, an assemblage of logically regular blocks) or ordered in a cache optimizing order like Cuthill McKee [24]; both of these ordering types are what we call natural orders. The MISs produced from natural orderings tend to be rather dense, random ordering on the other hand tend to be more sparse. That is, the MISs with natural orderings tend to be larger than those produced with random orders. For a ....
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
....techniques. Exceptions to this are all concerned with the multiplication of a sparse matrix with a vector these techniques cannot, however, be generalized easily to handle the multiplication of two sparse matrices. The sparse matrix vector multiplication technique proposed by Toledo [Tol 96] uses cache access reordering, inspired by the work of Das et al. in [DMS 94] to speed up matrix vector multiplication. Another work along similar lines is reported in [TeJa 92] Here, the cache access pattern for a direct mapped cache during the multiplication of a sparse matrix and a vector ....
S. Toledo. "Improving Memory--System Performance of Sparse Matrix--Vector Multiplication". Proc. 8th SIAM Conf. on Parallel Processing for Scientific Computing, March 1997.
....by the U.S. Department of Energy through the University of California under subcontract number B341494. 1 We propose alternative data structures, as well as reordering algorithms to increase the effectiveness of those data structures, to reduce the number of memory indirections in SpMxV. Toledo [7] proposed identifying 1 Theta 2 blocks of a matrix and writing the matrix as the sum of two matrices, the first of which contains all the 1 Theta 2 blocks and the second contains the remaining entries. Thus, the number of memory indirections is reduced to only one for each 1 Theta 2 block. In ....
....of the bandwidth for various cache parameters. Burgess and Giles [2] experimented with various ordering algorithms and found that reordering the matrix improves performance compared with random ordering, but they did not detect a notable sensitivity to the particular ordering method used. Toledo [7] studied reordering, along with other techniques, and reported that Reverse Cuthill McKee ordering yielded slightly better performance, but the differences were not significant. Another problem with SpMxV is that the ratio of load operations is higher than with dense matrix operations. One extra ....
[Article contains additional citation context not shown here]
S. Toledo, "Improving Memory-System Performance of Sparse Matrix-Vector Multiplication", Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
....each category can also be specified. Two simple heuristics can be used to order the vertices: random order, and a natural order. Meshes may be initially ordered in a block regular order (i.e. an assemblage of logically regular blocks) or ordered in a cache optimizing order like Cuthill McKee [24]. Both of these ordering types are what we call natural orders, and we assume that the initial order of our mesh is of this type (if not then we can make it so) The MISs produced from natural orderings tend to be rather dense, random ordering on the other hand tend to be more sparse. That is, ....
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
No context found.
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. IBM Journal of Research and Development, 41(6):771--725, 1997.
No context found.
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--726, 1997. 115
No context found.
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
No context found.
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
No context found.
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--726, 1997.
No context found.
Sivan Toledo. Improving memory-system performance of sparse matrixvector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
No context found.
Sivan Toledo. Improving memory-system performance of sparse matrixvector multiplication. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
No context found.
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--725, 1997.
No context found.
S. Toledo, Improving the memory-system performance of sparse matrix-vector multiplication, IBM Journal of Research and Development 41 (6).
No context found.
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM J. Res. and Dev., 41:711--725, 1997.
No context found.
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM J. Res. and Dev., 41:711--725, 1997.
No context found.
S. Toledo. Improving memory-system performance of sparse-matrix vector multiplication. In Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing. SIAM, 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC