Results 1  10
of
101
CSDP, a C library for semidefinite programming.
, 1997
"... this paper is organized as follows. First, we discuss the formulation of the semidefinite programming problem used by CSDP. We then describe the predictor corrector algorithm used by CSDP to solve the SDP. We discuss the storage requirements of the algorithm as well as its computational complexity. ..."
Abstract

Cited by 209 (1 self)
 Add to MetaCart
this paper is organized as follows. First, we discuss the formulation of the semidefinite programming problem used by CSDP. We then describe the predictor corrector algorithm used by CSDP to solve the SDP. We discuss the storage requirements of the algorithm as well as its computational complexity. Finally, we present results from the solution of a number of test problems. 2 The SDP Problem We consider semidefinite programming problems of the form max tr (CX)
Self adapting linear algebra algorithms and software
, 2004
"... One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe approaches for obtaining tuned highperformance kernels, and for automatically choosing suitable algorithms. S ..."
Abstract

Cited by 94 (23 self)
 Add to MetaCart
(Show Context)
One of the main obstacles to the efficient solution of scientific problems is the problem of tuning software, both to the available architecture and to the user problem at hand. We describe approaches for obtaining tuned highperformance kernels, and for automatically choosing suitable algorithms. Specifically, we describe the generation of dense and sparse blas kernels, and the selection of linear solver algorithms. However, the ideas presented here extend beyond these areas, which can be considered proof of concept.
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
, 2008
"... Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight ..."
Abstract

Cited by 68 (18 self)
 Add to MetaCart
Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems, and we present a set of techniques that can be used to efficiently program them. The presentation is in the context of Dense Linear Algebra (DLA), a major building block for many scientific computing applications. We motivate the need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components offers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue for its importance in view of what future architectures may look like. We therefore envision the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We illustrate the main ideas with an LUfactorization algorithm where particular techniques are used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s for single and up to 99.4 GFlop/s for double precision factorization on a hybrid Intel Xeon
ScaLAPACK: A Linear Algebra Library for MessagePassing Computers
 In SIAM Conference on Parallel Processing
, 1997
"... This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributedmemory computers. The importance of developing standards for computational and messagepassing interfaces is discusse ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
This article outlines the content and performance of some of the ScaLAPACK software. ScaLAPACK is a collection of mathematical software for linear algebra computations on distributedmemory computers. The importance of developing standards for computational and messagepassing interfaces is discussed. We present the different components and building blocks of ScaLAPACK and provide initial performance results for selected PBLAS routines and a subset of ScaLAPACK driver routines.
Statistical models for empirical searchbased performance tuning
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementa ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a searchbased system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a nearoptimal implementation is found. Second, we show how to construct
Anatomy of highperformance matrix multiplication
 ACM Transactions on Mathematical Software
, 2008
"... We present the basic principles that underlie the highperformance implementation of the matrixmatrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective alg ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
We present the basic principles that underlie the highperformance implementation of the matrixmatrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve nearpeak performance.
A recursive formulation of Cholesky factorization of a matrix in packed storage
, 2001
"... A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format is included. A new algorithm, called RPC for Recursive Packed Cholesky that operates on the RPF format is presente ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format is included. A new algorithm, called RPC for Recursive Packed Cholesky that operates on the RPF format is presented. Algorithm RPC is level 3 BLAS based and require algorithms TRSM and SYRK that work on RPF. We thus introduce and fully describe novel recursive algorithms RP TRSM and RP SYRK that the RPC algorithm requires. It turns out, that both RP TRSM and RP SYRK only call GEMM. Hence RPC mostly calls GEMM during execution. The advantage of this storage scheme compared to traditional packed storage is demonstrated. First, both storage schemes use the minimal amount of storage for the symmetric or triangular matrix. Second, RPC gives a level 3 implementation of Cholesky factorization that only requires standard full format GEMM whereas standard packed implementations are only level 2. Hence...
Evaluation and tuning of the level 3 CUBLAS for graphics processors
 In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing – PDSEC’08
, 2008
"... The increase in performance of the last generations of graphics processors (GPUs) has made this class of platform a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the performance of the Level 3 operations in CUBLAS, the implementation of BLAS f ..."
Abstract

Cited by 22 (12 self)
 Add to MetaCart
(Show Context)
The increase in performance of the last generations of graphics processors (GPUs) has made this class of platform a coprocessing tool with remarkable success in certain types of operations. In this paper we evaluate the performance of the Level 3 operations in CUBLAS, the implementation of BLAS for NVIDIA R © GPUs with unified architecture. From this study, we gain insights on the quality of the kernels in the library and we propose several alternative implementations that are competitive with those in CUBLAS. Experimental results on a GeForce 8800 Ultra compare the performance of CUBLAS and the new variants.