Results 1 - 10
of
573
A Data Locality Optimizing Algorithm
, 1991
"... This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifi ..."
Abstract
-
Cited by 804 (16 self)
- Add to MetaCart
This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix transformations. The algorithm has been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. Performance evaluation indicates that locality optimization is especially crucial for scaling up the performance of parallel code.
Design and Evaluation of a Compiler Algorithm for Prefetching
- in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1992
"... Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the ..."
Abstract
-
Cited by 501 (20 self)
- Add to MetaCart
(Show Context)
Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a result, care must be taken to ensure that such overheads do not exceed the benefits.
Improving Data Locality with Loop Transformations
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1996
"... ..."
(Show Context)
Tile Size Selection Using Cache Organization and Data Layout
, 1995
"... When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algori ..."
Abstract
-
Cited by 231 (2 self)
- Add to MetaCart
When dense matrix computations are too large to fit in cache, previous research proposes tiling to reduce or eliminate capacity misses. This paper presents a new algorithm for choosing problem-size dependent tile sizes based on the cache size and cache line size for a direct-mapped cache. The algorithm eliminates both capacity and self-interference misses and reduces cross-interference misses. We measured simulated miss rates and execution times for our algorithm and two others on a variety of problem sizes and cache organizations. At higher set associativity, our algorithm does not always achieve the best performance. However on direct-mapped caches, our algorithm improves simulated miss rates and measured execution times when compared with previous work. 1 Introduction Due to the wide gap between processor and memory speed in current architectures, achieving good performance requires high cache efficiency. Compiler optimizations to improve data locality for uniprocessors is increas...
Memory bandwidth limitations of future microprocessors
- IN PROCEEDINGS OF THE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1996
"... This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for ..."
Abstract
-
Cited by 226 (12 self)
- Add to MetaCart
(Show Context)
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches—implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
Compiler Optimizations for Improving Data Locality
- IN PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 1994
"... In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality base ..."
Abstract
-
Cited by 220 (17 self)
- Add to MetaCart
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality basedon a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the origi...
Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA
- PPOPP'08
, 2008
"... GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of tradition ..."
Abstract
-
Cited by 215 (11 self)
- Add to MetaCart
(Show Context)
GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor’s organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread’s resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.
Cache-Conscious Structure Layout
, 1999
"... Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary appro ..."
Abstract
-
Cited by 207 (9 self)
- Add to MetaCart
(Show Context)
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement technique-lustering and colorinet improve cache performance by increasing a pointer structure’s spatial and temporal locality, and by reducing cache-conflicts. To reduce the cost of applying these techniques, this paper discusses two strategies-cache-conscious reorganization and cacheconscious allocation--and describes two semi-automatic toolsccmorph and ccmalloc-that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefit-n most cases, significantly outperforming state-of-the-art prefetching.
Data and Computation Transformations for Multiprocessors
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract
-
Cited by 177 (15 self)
- Add to MetaCart
(Show Context)
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
Cache Profiling and the SPEC Benchmarks: A Case Study
- IEEE Computer
, 1994
"... As VLSI technology improvements continue to widen the gap between processor and main memory cycle times, cache performance becomes increasingly important to overall system performance. Cache memories help alleviate the cycle time disparity, but only for programs that exhibit sufficient spatial an ..."
Abstract
-
Cited by 160 (7 self)
- Add to MetaCart
(Show Context)
As VLSI technology improvements continue to widen the gap between processor and main memory cycle times, cache performance becomes increasingly important to overall system performance. Cache memories help alleviate the cycle time disparity, but only for programs that exhibit sufficient spatial and temporal locality. Programs with unruly access patterns spend much of their time transferring data to and from the cache. To fully exploit the performance potential of fast processors, programmers must explicitly consider cache behavior, restructuring their codes to increase locality. As these fast processors proliferate, techniques for improving cache performance must move beyond the supercomputer and multiprocessor communities and into the mainstream of computing. In this paper, we examine some of the techniques that programmers can use to improve cache performance. We show how to use CPROF, a cache profiler, to identify cache performance bottlenecks and gain insight into their o...