Results 1 - 10
of
18
The Uniform Memory Hierarchy Model of Computation
- Algorithmica
, 1992
"... The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use im ..."
Abstract
-
Cited by 108 (9 self)
- Add to MetaCart
The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies. A sequential computer's memory is modelled as a sequence hM 0 ; M 1 ; :::i of increasingly large memory modules. Computation takes place in M 0 . Thus, M 0 might model a computer's central processor, while M 1 might be cache memory, M 2 main memory, and so on. For each module M U , a bus B U connects it with the next larger module M U+1 . All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parameterized by the rate at which the blocksizes i...
Automatic Program Parallelization
, 1993
"... This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last s ..."
Abstract
-
Cited by 97 (8 self)
- Add to MetaCart
This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last section of the paper surveys several experimental studies on the effectiveness of parallelizing compilers.
A comparison of empirical and model-driven optimization
- In ACM Symp. on Programming Language Design and Implementation (PLDI’03
, 2003
"... Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose thes ..."
Abstract
-
Cited by 63 (9 self)
- Add to MetaCart
Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that empirical optimization is more effective than model-driven optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating dense numerical linear algebra libraries) with a model-based optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that although model-based optimization can be surprisingly effective, useful models may have to consider not only hardware parameters but also the ability of back-end compilers to exploit hardware resources. 1.
Is Search Really Necessary to Generate High-Performance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, model-driven optimization, compilers, library generators, BLAS, high-performance computing
MOB Forms: A Class of Multilevel Block Algorithms for Dense Linear Algebra Operations
- INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 1994
"... Multilevel block algorithms exploit the data locality in linear algebra operations when executed in machines with several levels in the memory hierarchy. It is shown that the family we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design and that using the multilevel appro ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
Multilevel block algorithms exploit the data locality in linear algebra operations when executed in machines with several levels in the memory hierarchy. It is shown that the family we call Multilevel Orthogonal Block (MOB) algorithms is optimal and easy to design and that using the multilevel approach produces significant performance improvements. The effect of interference in the cache, of the TLB misses, and of page faults are also considered. The multilevel block algorithms are evaluated analytically for an ideal memory system with M cache levels without interferences. Moreover, experimental results of the MOB forms in some present high performance workstations are presented.
Programming with tiles
- In PPoPP ’08: Proceedings of the Thirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2008
"... The importance of tiles or blocks in scientific computing cannot be overstated. Many algorithms, both iterative and recursive, can be expressed naturally if tiles are represented explicitly. From the point of view of performance, tiling, either as a code or a data layout transformation, is one of th ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
The importance of tiles or blocks in scientific computing cannot be overstated. Many algorithms, both iterative and recursive, can be expressed naturally if tiles are represented explicitly. From the point of view of performance, tiling, either as a code or a data layout transformation, is one of the most effective ways to exploit locality, which is a must to achieve good performance in current computers because of the significant difference in speed between processor and memory. Furthermore, tiles are also useful to express data distribution in parallel computations. However, despite the importance of tiles, most languages do not support them directly. This gives place to bloated programs populated with numerous subscript expressions which make the code difficult to read and coding mistakes more likely. This paper discusses Hierarchically Tiled Arrays (HTAs), a data type which facilitates the easy manipulation of tiles in objectoriented languages with emphasis on two new features, dynamic partitioning and overlapped tiling. These features facilitate the expression of locality and communication while maintaining the same performance of algorithms written using conventional languages.
FFT program generation for shared memory: SMP and multicore
- In Proc. Supercomputing
, 2006
"... The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but different performance characteristics. This situation considerably increases the burden on library developers and stre ..."
Abstract
-
Cited by 12 (9 self)
- Add to MetaCart
The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks such as Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate parallelization speed-up for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms. 1
Matrix Algebra and Applicative Programming
- Functional Programming Languages and Computer Architecture (Proceedings
, 1987
"... General Term: Algorithms. The broad problem of matrix algebra is taken up from the perspective of functional program-ming. Akey question is how arrays should be represented in order to admit good implementations of well-known e cient algorithms, and whether functional architecture sheds any new ligh ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
General Term: Algorithms. The broad problem of matrix algebra is taken up from the perspective of functional program-ming. Akey question is how arrays should be represented in order to admit good implementations of well-known e cient algorithms, and whether functional architecture sheds any new light on these or other solutions. It relates directly to disarming the \aggregate update " problem. The major thesis is that 2 d-ary trees should be used to represent d-dimensional arrays � ex-amples are matrix operations (d = 2), and a particularly interesting vector (d = 1) algorithm. Sparse and dense matrices are represented homogeneously, but at some overhead that appears tolerable � encouraging results are reviewed and extended. A Pivot Step algorithm is described which o ers optimal stability at no extra cost for searching. The new results include proposed sparseness measures for matrices, improved performance of stable matrix inversion through re-peated pivoting while deep within a matrix-tree (extendible to solving linear systems), and a clean matrix derivation of the vector algorithm for the fast Fourier transform. Running code is o ered in the appendices.
FFT Program Generation for Shared Memory
- SMP and Multicore”, SC2006
, 2006
"... The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
The chip maker’s response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms. CR Categories: F.2.1 [Analysis of algorithms and problem complexity]: Numerical algorithms and problems—Fast
EXPLOITING LOCALITY AND PARALLELISM WITH HIERARCHICALLY TILED ARRAYS
"... The importance of tiles or blocks in mathematics and thus computer science cannot be overstated. From a high level point of view, they are the natural way to express many algorithms, both in iterative and recursive forms. Tiles or sub-tiles are used as basic units in the algorithm description. From ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The importance of tiles or blocks in mathematics and thus computer science cannot be overstated. From a high level point of view, they are the natural way to express many algorithms, both in iterative and recursive forms. Tiles or sub-tiles are used as basic units in the algorithm description. From a low level point of view, tiling, either as the unit maintained by the algorithm, or as a class of data layouts, is one of the most effective ways to exploit locality, which is a must to achieve good performance in current computers given the growing gap between memory and processor speed. Finally, tiles and operations on them are also basic to express data distribution and parallelism. Despite the importance of this concept, which makes inevitable its widespread usage, most languages do not support it directly. Programmers have to understand and manage the low-level details along with the introduction of tiling. This gives place to bloated potentially error-prone programs in which opportunities for performance are lost. On the other hand, the disparity between the algorithm and the actual implementation enlarges. This thesis illustrates the power of Hierarchically Tiled Arrays (HTAs), a data type which

