(Enter summary)
Abstract: During the last decade, a number of projects have pursued the high-performance implementation
of matrix multiplication. Typically, these projects organize the computation around an
\inner kernel," C = A
B +C, that keeps one of the operands in the L1 cache, while streaming
parts of the other operands through that cache. Variants include approaches that extend this
principle to multiple levels of cache or that apply the same principle to the L2 cache while
essentially ignoring the L1... (Update)
Cited by: More
Performance Modeling and Analysis of Cache Blocking.. - Nishtala, Vuduc.. (2004)
(Correct)
Is Search Really Necessary to Generate High-Performance .. - Yotov, Li, Ren.. (2005)
(Correct)
The Opie Compiler: from Row-major Source to Morton-ordered.. - Gabriel, Wise (2004)
(Correct)
Active bibliography (related documents): More All
0.7: A Systematic Approach to the Design and Analysis of Linear.. - Gunnels
(Correct)
0.6: Recursive Blocked Algorithms for Solving Triangular.. - Jonsson, Kågström (2001)
(Correct)
0.6: Recursive Blocked Algorithms for Solving Triangular.. - Jonsson, Kågström (2001)
(Correct)
Similar documents based on text: More All
0.3: Adapting Radix Sort to the Memory Hierarchy - Rahman, Raman (2000)
(Correct)
0.2: Software Prefetching and Caching for Translation Lookaside.. - Bala, Kaashoek, Weihl (1994)
(Correct)
0.2: Data Sequence Locality: a Generalization of Temporal Locality - Loechner, Meister, Clauss
(Correct)
Related documents from co-citation: More All
3: Exact analysis of the cache behavior of nested loops (context) - Chatterjee, Parker et al. - 2001
3: Automatically Tuned Linear Algebra Software
- Whaley, Dongarra - 1997
2: Modeling and improving locality for irregular problems: sparse matrix-vector pro..
- Heras, Perez et al. - 1999
BibTeX entry: (Update)
K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. Technical Report TR-2002-55, The University of Texas at Austin, Department of Computer Sciences, 2002. FLAME Working Note #9. http://citeseer.ist.psu.edu/goto02reducing.html More
@misc{ goto02reducing,
author = "K. Goto and R. Geijn",
title = "On reducing TLB misses in matrix multiplication",
text = "K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication.
Technical Report TR-2002-55, The University of Texas at Austin, Department
of Computer Sciences, 2002. FLAME Working Note #9.",
year = "2002",
url = "citeseer.ist.psu.edu/goto02reducing.html" }
Citations (may not include all citations):
532
LAPACK Users' Guide (context) - Anderson, Bai et al. - 1992
387
A set of level 3 basic linear algebra subprograms (context) - Dongarra, Croz et al. - 1990
248
Solving Linear Systems on Vector and Shared Memory Computers (context) - Dongarra, Du et al. - 1991
216
Performance of various computers using standard linear equat..
- Dongarra - 2002
157
Automatically tuned linear algebra software
- Whaley, Dongarra - 1998
147
LINPACK Users' Guide (context) - Dongarra, Bunch et al. - 1979
123
Optimizing matrix multiply using PHiPAC: a Portable
- Bilmes, Asanovi et al. - 1997
122
Scalapack: A scalable linear algebra library for distributed.. (context) - Choi, Dongarra et al. - 1992
72
LAPACK: A portable linear algebra library for highperformanc..
- Anderson, Bai et al. - 1990
60
Recursion leads to automatic variable blocking for dense lin.. (context) - Gustavson - 1997
41
The impact of hierarchical memory systems on linear algebra .. (context) - Gallivan, Jalby et al. - 1987
38
Locality of reference in lu decomposition with partial pivot..
- Toledo - 1997
20
Using PLAPACK: Parallel Linear Algebra Package (context) - Geijn - 1997
20
Prospectus for the development of a linear algebra library f..
- Demmel, Dongarra et al. - 1987
16
Exploiting functional parallelism of POWER2 to design high-p.. (context) - Agarwal, Gustavson et al. - 1994
15
Applying recursion to serial and parallel QR factorization l..
- Elmroth, Gustavson - 2000
13
Guide and Reference (context) - Engineering, Library - 1988
10
Flame: Formal linear algebra methods environment (context) - Gunnels, Gustavson et al. - 2001
8
GEMM-based level 3 BLAS: High performance model implementati.. (context) - agstr, Ling et al. - 1998
8
Superscalar GEMMbased level 3 BLAS { the on-going evolution ..
- Gustavson, Henriksson et al. - 1998
8
Minimal storage high-performance Cholesky factorization via .. (context) - Gustavson, Jonsson - 2000
6
A family of high-performance matrix multiplication algorithm.. (context) - Gunnels, Henry et al. - 2001
5
A framework for high-performance matrix multiplication based.. (context) - Valsalam, Skjellum - 2002
5
BLAS based on block data structures (context) - Henry - 1992
4
Recursive blocked algorithms for solving triangular matrix e..
- Jonsson, agstr - 2001
1
Gemm-based level 3 blas: High-performance model (context) - agstr, Ling et al. - 1995
1
New generalized matrix data structures lead to a variety of .. (context) - Gustavson - 2001
1
Flexible high-performance matrix multiply via self-modifying..
- Henry - 2001
www.netlib.org/benchmark/hpl/
Documents on the same site (http://www.cs.utexas.edu/ftp/pub/techreports/): More
Parametric Quantitative Temporal Reasoning - Emerson, Trefler (1999)
(Correct)
Two Problems of TCP AIMD Congestion Control - Yang, Kim, Zhang, Lam (2000)
(Correct)
Verifying Adder Circuits Using Powerlists - Adams (1994)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC