Results 11  20
of
523
An Updated Set of Basic Linear Algebra Subprograms (BLAS)
 ACM Transactions on Mathematical Software
, 2001
"... This paper summarizes the BLAS Technical Forum Standard, a speci #cation of a set of kernel routines for linear algebra, historically called the Basic Linear Algebra Subprograms and commonly known as the BLAS. The complete standard can be found in #1#, and on the BLAS Technical Forum webpage #http: ..."
Abstract

Cited by 120 (7 self)
 Add to MetaCart
(Show Context)
This paper summarizes the BLAS Technical Forum Standard, a speci #cation of a set of kernel routines for linear algebra, historically called the Basic Linear Algebra Subprograms and commonly known as the BLAS. The complete standard can be found in #1#, and on the BLAS Technical Forum webpage #http:##www.netlib.org#blas#blastforum##
Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate
 ACM Transactions on Mathematical Software
, 2008
"... CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for b ..."
Abstract

Cited by 111 (8 self)
 Add to MetaCart
CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for both symmetric and unsymmetric matrices. Its supernodal Cholesky factorization relies on LAPACK and the Level3 BLAS, and obtains a substantial fraction of the peak performance of the BLAS. Both real and complex matrices are supported. CHOLMOD is written in ANSI/ISO C, with both C and MATLAB TM interfaces. It appears in MATLAB 7.2 as x=A\b when A is sparse symmetric positive definite, as well as in several other sparse matrix functions.
GEMMBased Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1998
"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."
Abstract

Cited by 108 (10 self)
 Add to MetaCart
(Show Context)
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and highperformance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMMbased level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMMbased level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMMbased model implementations.
Optimizing for Parallelism and Data Locality
 In Proceedings of the 1992 ACM International Conference on Supercomputing
, 1992
"... Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on sharedmemory mu ..."
Abstract

Cited by 97 (13 self)
 Add to MetaCart
(Show Context)
Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the tradeoffs between effectively utilizing parallelism and memory hierarchy on sharedmemory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results. 1 Introduction Transformations to exploit parallelism and to improve data locality are two of the most valuable compiler techniques in use today. Independently, each of these optimizations has been shown to result in dramatic improvements. This paper seeks to combine t...
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing ..."
Abstract

Cited by 86 (6 self)
 Add to MetaCart
(Show Context)
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
SLICOT  A Subroutine Library in Systems and Control Theory
 Applied and Computational Control, Signals, and Circuits
, 1997
"... This article describes the subroutine library SLICOT that provides Fortran 77 implementations of numerical algorithms for computations in systems and control theory. Around a nucleus of basic numerical linear algebra subroutines, this library builds methods for the design and analysis of linear cont ..."
Abstract

Cited by 82 (55 self)
 Add to MetaCart
(Show Context)
This article describes the subroutine library SLICOT that provides Fortran 77 implementations of numerical algorithms for computations in systems and control theory. Around a nucleus of basic numerical linear algebra subroutines, this library builds methods for the design and analysis of linear control systems. A brief history of the library is given together with a description of the current version of the library and the ongoing activities to complete and improve the library in several aspects. 1 Introduction Systems and control theory are disciplines widely used to describe, control, and optimize industrial and economical processes. There is now a huge amount of theoretical results available which has lead to a variety of methods and algorithms used throughout industry and academia. Although based on theoretical results, these methods often fail when applied to reallife problems, which often tend to be illposed or of high dimensions. This failing is frequently due to the lack of...
Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software
 SIAM REVIEW VOL. 46, NO. 1, PP. 3–45
, 2004
"... Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today’s memorytiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today’s high performance computing systems.
NetSolve: A Networkenabled Server for Solving Computational Science Problems
, 1997
"... ..."
(Show Context)
SVDPACKC (Version 1.0) User's Guide
, 1993
"... SVDPACKC comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using ANSI C. This software package implements Lanczos and subspace iterationbased methods for determining several of the largest singular triplets (singular values an ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
SVDPACKC comprises four numerical (iterative) methods for computing the singular value decomposition (SVD) of large sparse matrices using ANSI C. This software package implements Lanczos and subspace iterationbased methods for determining several of the largest singular triplets (singular values and corresponding left and rightsingular vectors) for large sparse matrices. The package has been ported to a variety of machines ranging from supercomputers to workstations: CRAY YMP, IBM RS/6000550, DEC 5000100, HP 9000750, SPARCstation 2, and Macintosh II/fx. This document (i) explains each algorithm in some detail, (ii) explains the input parameters for each program, (iii) explains how to compile/execute each program, and (iv) illustrates the performance of each method when we compute lower rank approximations to sparse termdocument matrices from information retrieval applications. A userfriendly software interface to the package for UNIXbased systems and the Macintosh II/fx is als...