Results 1  10
of
18
Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate
 ACM Transactions on Mathematical Software
, 2008
"... CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for b ..."
Abstract

Cited by 109 (8 self)
 Add to MetaCart
CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for both symmetric and unsymmetric matrices. Its supernodal Cholesky factorization relies on LAPACK and the Level3 BLAS, and obtains a substantial fraction of the peak performance of the BLAS. Both real and complex matrices are supported. CHOLMOD is written in ANSI/ISO C, with both C and MATLAB TM interfaces. It appears in MATLAB 7.2 as x=A\b when A is sparse symmetric positive definite, as well as in several other sparse matrix functions.
Largescale deep unsupervised learning using graphics processors
 International Conf. on Machine Learning
, 2009
"... The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two wellknown unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have rec ..."
Abstract

Cited by 50 (8 self)
 Add to MetaCart
The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two wellknown unsupervised learning models, deep belief networks (DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for largescale applications, forcing researchers to focus on smallerscale models, or to use fewer training examples. In this paper, we suggest massively parallel methods to help resolve these problems. We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dualcore CPU implementation for large models. For example, we are able to reduce the time required to learn a fourlayer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15fold speedup over previous methods.
Anatomy of highperformance matrix multiplication
 ACM Transactions on Mathematical Software
, 2008
"... We present the basic principles that underlie the highperformance implementation of the matrixmatrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective alg ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
We present the basic principles that underlie the highperformance implementation of the matrixmatrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve nearpeak performance.
Multifrontal multithreaded rankrevealing sparse QR factorization
"... SuiteSparseQR is a sparse QR factorization package based on the multifrontal method. Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
SuiteSparseQR is a sparse QR factorization package based on the multifrontal method. Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading Building Blocks library. The symbolic analysis and ordering phase preeliminates singletons by permuting the input matrix into the form [R11 R12; 0 A22] where R11 is upper triangular with diagonal entries above a given tolerance. Next, the fillreducing ordering, column elimination tree, and frontal matrix structures are found without requiring the formation of the pattern of A T A. Rankdetection is performed within each frontal matrix using Heath’s method, which does not require column pivoting. The resulting sparse QR factorization obtains a substantial fraction of the theoretical peak performance of a multicore computer.
Exploiting Vector Instructions with Generalized Stream Fusion
 ICFP '13
, 2013
"... Stream fusion is a powerful technique for automatically transforming highlevel sequenceprocessing functions into efficient implementations. It has been used to great effect in Haskell libraries for manipulating byte arrays, Unicode text, and unboxed vectors. However, some operations, like vector a ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Stream fusion is a powerful technique for automatically transforming highlevel sequenceprocessing functions into efficient implementations. It has been used to great effect in Haskell libraries for manipulating byte arrays, Unicode text, and unboxed vectors. However, some operations, like vector append, still do not perform well within the standard stream fusion framework. Others, like SIMD computation using the SSE and AVX instructions available on modern x86 chips, do not seem to fit in the framework at all. In this paper we introduce generalized stream fusion, which solves these issues. The key insight is to bundle together multiple stream representations, each tuned for a particular class of stream consumer. We also describe a stream representation suited for efficient computation with SSE instructions. Our ideas are implemented in modified versions of the GHC compiler and vector library. Benchmarks show that highlevel Haskell code written using our compiler and libraries can produce code that is faster than both compiler and handvectorized C.
New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead For High Performance
, 2007
"... Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the pote ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of noncanonical data structures for dense linear algebra can be better exploited with the use of specialized inner kernels. The use of noncanonical data structures together with specialized inner kernels has low overhead and can produce excellent performance.
A fast and robust mixedprecision solver for the solution of sparse symmetric linear systems
 ISSN
, 2010
"... On many current and emerging computing architectures, singleprecision calculations are at least twice as fast as doubleprecision calculations. In addition, the use of single precision may reduce pressure on memory bandwidth. The penalty for using single precision for the solution of linear systems ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
On many current and emerging computing architectures, singleprecision calculations are at least twice as fast as doubleprecision calculations. In addition, the use of single precision may reduce pressure on memory bandwidth. The penalty for using single precision for the solution of linear systems is a potential loss of accuracy in the computed solutions. For sparse linear systems, the use of mixed precision in which doubleprecision iterative methods are preconditioned by a singleprecision factorization can enable the recovery of highprecision solutions more quickly and use less memory than a sparse direct solver run using doubleprecision arithmetic. In this article, we consider the use of single precision within direct solvers for sparse symmetric linear systems, exploiting both the reduction in memory requirements and the performance gains. We develop a practical algorithm to apply a mixedprecision approach and suggest parameters and techniques to minimize the number of solves required by the iterative recovery process. These experiments provide the basis for our new code HSL MA79—a fast, robust, mixedprecision sparse symmetric solver that is included in the mathematical software library HSL. Numerical results for a wide range of problems from practical applications are presented.
SCISSORS: A LinearAlgebraical Technique to Rapidly Approximate Chemical Similarities
 J. Chem. Inf. Model. 2010
"... Algorithms for several emerging largescale problems in cheminformatics have as their ratelimiting step the evaluation of relatively slow chemical similarity measures, such as structural similarity or threedimensional (3D) shape comparison. In this article we present SCISSORS, a linearalgebraica ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Algorithms for several emerging largescale problems in cheminformatics have as their ratelimiting step the evaluation of relatively slow chemical similarity measures, such as structural similarity or threedimensional (3D) shape comparison. In this article we present SCISSORS, a linearalgebraical technique (related to multidimensional scaling and kernel principal components analysis) to rapidly estimate chemical similarities for several popular measures. We demonstrate that SCISSORS faithfully reflects its source similarity measures for both Tanimoto calculation and rank ordering. After an efficient precalculation step on a database, SCISSORS affords several orders of magnitude of speedup in database screening. SCISSORS furthermore provides an asymptotic speedup for large similarity matrix construction problems, reducing the number of conventional slow similarity evaluations required from quadratic to linear scaling.
Algorithm 9xx, SuiteSparseQR: multifrontal multithreaded rankrevealing sparse QR factorization
"... SuiteSparseQR is a sparse QR factorization package based on the multifrontal method. Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
SuiteSparseQR is a sparse QR factorization package based on the multifrontal method. Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading Building Blocks library. The symbolic analysis and ordering phase preeliminates singletons by permuting the input matrix A into the form [R11R12; 0A22] where R11 is upper triangular with diagonal entries above a given tolerance. Next, the fillreducing ordering, column elimination tree, and frontal matrix structures are found without requiring the formation of the pattern of A T A. Approximate rankdetection is performed within each frontal matrix using Heath’s method. While Heath’s method is not always exact, it has the advantage of not requiring column pivoting and thus does not interfere with the fillreducing ordering. For sufficiently large problems, the resulting sparse QR factorization obtains a substantial fraction of the theoretical peak performance of a multicore computer.