Results 11  20
of
602
Cacheoblivious algorithms
, 1999
"... requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on ..."
Abstract

Cited by 90 (1 self)
 Add to MetaCart
(Show Context)
requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z = (L2), the number of cache misses for an m x n matrix transpose is E(1 + mn/L). The number of cache misses for either an npoint FFT or the sorting of n numbers is 0(1 + (n/L)(1 + logzn)). The cache complexity of computing n time steps of a Jacobistyle multipass filter on an array of size n is E(1 + n/L + n2 /ZL). We also give an 8(mnp)work algorithm to multiply an m x n matrix by an n x p matrix
SPIRAL: A Generator for PlatformAdapted Libraries of Signal Processing Algorithms
 Journal of High Performance Computing and Applications
, 2004
"... SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical fra ..."
Abstract

Cited by 82 (20 self)
 Add to MetaCart
(Show Context)
SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical framework that concisely describes signal transforms and their fast algorithms; the formula generator that captures at the algorithmic level the degrees of freedom in expressing a particular signal processing transform; the formula translator that encapsulates the compilation degrees of freedom when translating a specific algorithm into an actual code implementation; and, finally, an intelligent search engine that finds within the large space of alternative formulas and implementations
Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software
 SIAM REVIEW C ○ 2004 SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS VOL. 46, NO. 1, PP. 3–45
, 2004
"... Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on today’s memorytiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on today’s high performance computing systems.
PetaBricks: a language and compiler for algorithmic choice, volume 44
, 2009
"... It is often impossible to obtain a onesizefitsall solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, ..."
Abstract

Cited by 79 (10 self)
 Add to MetaCart
(Show Context)
It is often impossible to obtain a onesizefitsall solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarsegrained, library level selections or hand coded cutoffs between base cases and recursive cases. We present PetaBricks, a new implicitly parallel language
Programming by Sketching for BitStreaming Programs
, 2005
"... This paper introduces the concept of programming with sketches, an approach for the rapid development of highperformance applications. This approach allows a programmer to write clean and portable reference code, and then obtain a highquality implementation by simply sketching the outlines of the ..."
Abstract

Cited by 76 (9 self)
 Add to MetaCart
This paper introduces the concept of programming with sketches, an approach for the rapid development of highperformance applications. This approach allows a programmer to write clean and portable reference code, and then obtain a highquality implementation by simply sketching the outlines of the desired implementation. Subsequently, a compiler automatically fills in the missing details while also ensuring that a completed sketch is faithful to the input reference code. In this paper, we develop StreamBit as a sketching methodology for the important class of bitstreaming programs (e.g., coding and cryptography). A sketch is a partial specification of the implementation, and as such, it affords several benefits to programmer in terms of productivity and code robustness. First, a sketch is easier to write compared to a complete implementation. Second, sketching allows the programmer to focus on exploiting algorithmic properties rather
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
(Show Context)
Is Search Really Necessary to Generate HighPerformance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract

Cited by 67 (12 self)
 Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional modeldriven optimization cannot compete with searchbased empirical optimization because tractable analytical models cannot capture all the complexities of modern highperformance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a modeldriven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that modeldriven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, modeldriven optimization, compilers, library generators, BLAS, highperformance computing
Managing performance vs. accuracy tradeoffs with loop perforation
 In SIGSOFT FSE
, 2011
"... Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use adhoc, domainspecific techniques developed specifically for the com ..."
Abstract

Cited by 66 (18 self)
 Add to MetaCart
(Show Context)
Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use adhoc, domainspecific techniques developed specifically for the computation at hand. Loop perforation provides a general technique to trade accuracy for performance by transforming loops to execute a subset of their iterations. A criticality testing phase filters out critical loops (whose perforation produces unacceptable behavior) to identify tunable loops (whose perforation produces more efficient and still acceptably accurate computations). A perforation space exploration algorithm perforates combinations of tunable loops to find Paretooptimal perforation policies. Our results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factor of seven) while changing the result that the application produces by less than 10%.
Automatically Tuned Collective Communications
 In Proceedings of SC99: High Performance Networking and Computing
, 2000
"... The performance of the MPI's collective communications is critical in most MPIbased applications. A general algorithm for a given collective communication operation may not give good performance on all systems due to the di#erences in architectures, network parameters and the storage capacity ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
(Show Context)
The performance of the MPI's collective communications is critical in most MPIbased applications. A general algorithm for a given collective communication operation may not give good performance on all systems due to the di#erences in architectures, network parameters and the storage capacity of the underlying MPI implementation. In this paper, we discuss an approach in which the collective communications are tuned for a given system by conducting a series of experiments on the system. We also discuss a dynamic topology method that uses the tuned static topology shape, but reorders the logical addresses to compensate for changing run time variations. A series of experiments were conducted comparing our tuned collective communication operations to various native vendor MPI implementations. The use of the tuned collective communications resulted in about 30%650% improvement in performance over the native MPI implelementations. 1. INTRODUCTION This project developed out of an attempt...