Results 1  10
of
40
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
(Show Context)
Tuning Strassen's Matrix Multiplication for Memory Efficiency
 IN PROCEEDINGS OF SC98 (CDROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of thi ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memoryfriendly. First, the algorithm internally uses a nonstandard array layout known as Morton order that is based on a quadtree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
(Show Context)
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
Statistical models for empirical searchbased performance tuning
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementa ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a searchbased system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a nearoptimal implementation is found. Second, we show how to construct
FFTbased dense polynomial arithmetic on multicores
 Proc. of HPCS’09
, 2009
"... Abstract. We report efficient implementation techniques for FFTbased dense multivariate polynomial arithmetic over finite fields, targeting multicores. We have extended a preliminary study dedicated to polynomial multiplication and obtained a complete set of efficient parallel routines in Cilk++ f ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
Abstract. We report efficient implementation techniques for FFTbased dense multivariate polynomial arithmetic over finite fields, targeting multicores. We have extended a preliminary study dedicated to polynomial multiplication and obtained a complete set of efficient parallel routines in Cilk++ for polynomial arithmetic such as normal form computation. Since bivariate multiplication applied to balanced data is a good kernel for these routines, we provide an indepth study on the performance and the cutoff criteria of our different implementations for this operation. We also show that, not only optimized parallel multiplication can improve the performance of higherlevel algorithms such as normal form computation but also this composition is necessary for parallel normal form computation to reach peak performance on a variety of problems that we have tested.
Adaptive Winograd’s Matrix Multiplications
, 2008
"... Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of cou ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive StrassenWinograd’s matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in doubleprecision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent incomplexity subMMs and does not require any extra computation to combine the intermediary subMM results. We achieve up to 22 % executiontime reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19 % for a 2 dualcore processor system. Most importantly, even for small matrices such as 1500×1500, our approach attains already 10 % executiontime reduction and, for MM of matrices larger than 3000×3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to fasterthanprocessor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
Efficient Multiplication of Dense Matrices over GF(2)
, 2008
"... We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (F2). In particular we present our implementation – in the M4RI library – of StrassenWinograd matrix multiplication and the “Method of the Four Russians” multip ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (F2). In particular we present our implementation – in the M4RI library – of StrassenWinograd matrix multiplication and the “Method of the Four Russians” multiplication (M4RM) and compare it against other available implementations. Good performance is demonstrated on on AMD’s Opteron and particulary good performance on Intel’s Core 2 Duo. The opensource M4RI library is available standalone as well as part of the Sage mathematics software. In machine terms, addition in F2 is logicalXOR, and multiplication is logicalAND, thus a machine word of 64bits allows one to operate on 64 elements of F2 in parallel: at most one CPU cycle for 64 parallel additions or multiplications. As such, elementwise operations over F2 are relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks are memory reads and writes and issues of data locality. We present our empirical findings in relation to minimizing these and give an analysis thereof.