Results 1  10
of
185
Design Tradeoffs for BLAS Operations on Reconfigurable Hardware
 In ICPP ’05: Proceedings of the 2005 International Conference on Parallel Processing
, 2005
"... Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linea ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
of linear algebra applications using FPGAs (Field Programmable Gate Arrays) has become feasible. In this paper, we propose FPGAbased designs for several BLAS operations, including vector product, matrixvector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we
Accurate Matrix Multiplication by using Level 3 BLAS Operation
"... Abstract—This paper is concerned with an accurate computation of matrix multiplication. Recently, an accurate summation algorithm was developed by Rump, Ogita and Oishi. One of the key techniques of their method is a new type of errorfree splitting. To use this strategy, we investigate a method of ..."
Abstract
 Add to MetaCart
of obtaining an accurate result of matrix multiplication by mainly using Level 3 BLAS operation. Finally, we present numerical examples showing the effectiveness of the proposal algorithm. 1.
Automatically tuned linear algebra software
 CONFERENCE ON HIGH PERFORMANCE NETWORKING AND COMPUTING
, 1998
"... This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and ..."
Abstract

Cited by 478 (26 self)
 Add to MetaCart
much ofthe technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations.
Are There Iterative BLAS?
, 1994
"... A technique for optimizing software is proposed that involves the use of a standardized set of computational kernels that are common to many iterative methods for solving large sparse linear systems of equations. These kernels, referred to as "Iterative Basic Linear Algebra Subprograms" or ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
" or "Iterative BLAS", are defined and techniques for their optimization on vector computers are presented. Several sparse matrix storage formats for different classes of matrix problems are proposed that allow the vectorization of fundamental operations in various iterative methods using
Brook for GPUs: Stream Computing on Graphics Hardware
 ACM TRANSACTIONS ON GRAPHICS
, 2004
"... In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtua ..."
Abstract

Cited by 224 (9 self)
 Add to MetaCart
and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS
The Combinatorial BLAS: Design, Implementation, and Applications
, 2010
"... This paper presents a scalable highperformance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of highperformance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse mat ..."
Abstract

Cited by 58 (10 self)
 Add to MetaCart
matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse grained parallelism for thousands of processors, which can be uncovered by using the right
Implementation of the BLAS Level 3 and
 Tech. J
, 1993
"... This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS3) library and the LINPACK Benchmark on the Fujitsu AP1000. The performance of these applications is regarded as important for distributed memory architectures such as the AP1000. We discuss the techniq ..."
Abstract
 Add to MetaCart
This paper describes an implementation of Level 3 of the Basic Linear Algebra Subprogram (BLAS3) library and the LINPACK Benchmark on the Fujitsu AP1000. The performance of these applications is regarded as important for distributed memory architectures such as the AP1000. We discuss
CADNA in the BLAS: Example of
, 2012
"... Several approximations occur during a numerical simulation: physical phenomena are modelled using mathematical equations, continuous functions are replaced by discretized ones and real numbers are replaced by finiteprecision representations (floatingpoint numbers). The use of the IEEE754 arithmet ..."
Abstract
 Add to MetaCart
arithmetic generates roundoff errors at each elementary arithmetic operation. By accumulation, these errors can affect the accuracy of computed results, possibly leading to partial or total inaccuracy. The effect of these rounding errors can be analyzed and studied by some methods like forward
BLIS: A Framework for Rapidly Instantiating BLAS Functionality
"... The BLASlike Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level2 (matrixvector) and level3 (matrixmatrix) BLAS operatio ..."
Abstract
 Add to MetaCart
is generalized and implemented in ISO C99 so that it can be reused and/or reparameterized for different operations (and different architectures) with little to no modification. Inserting highperformance kernels into the framework facilitates the immediate optimization of any BLASlike operations which are cast
A class of parallel tiled linear algebra algorithms for multicore architectures
"... Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a ..."
Abstract

Cited by 169 (58 self)
 Add to MetaCart
comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. 1
Results 1  10
of
185