Results 1 - 10
of
15
Spatio-Temporal Memory Streaming
"... Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
Recent research advocates memory streaming techniques to alleviate the performance bottleneck caused by the high latencies of off-chip memory accesses. Temporal memory streaming replays previously observed miss sequences to eliminate long chains of dependent misses. Spatial memory streaming predicts repetitive data layout patterns within fixed-size memory regions. Because each technique targets a different subset of misses, their effectiveness varies across workloads and each leaves a significant fraction of misses unpredicted. In this paper, we propose Spatio-Temporal Memory Streaming (STeMS) to exploit the synergy between spatial and temporal streaming. We observe that the order of spatial accesses repeats both within and across regions. STeMS records and replays the temporal sequence of region accesses and uses spatial relationships within each region to dynamically reconstruct a predicted total miss order. Using trace-driven and cycle-accurate simulation across a suite of commercial workloads, we demonstrate that with similar implementation complexity as temporal streaming, STeMS achieves equal or higher coverage than spatial or temporal memory streaming alone, and improves performance by 31%, 3%, and 18% over stride, spatial, and temporal prediction, respectively. Categories and Subject Descriptors B.3.2 [Memory Structures]: Design styles—cache memories
Software Data Spreading: Leveraging Distributed Caches to Improve Single Thread Performance
"... Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages the cache capacity of extra cores and extra sockets rather than their computational resources. Software data spreading is a software-only technique that uses compiler-directed thread migration to aggregate cache capacity across cores and chips and improve performance. This paper describes an automated scheme that applies data spreading to various types of loops. Experiments with a set of SPEC2000, SPEC2006, NAS, and microbenchmark workloads show that data spreading can provide speedup of over 2, averaging 17 % for the SPEC and NAS applications on two systems. In addition, despite using more cores for the same computation, data spreading actually saves power since it reduces access to DRAM. D.3.4 [Programming Lan-Categories and Subject Descriptors
Performance driven data cache prefetching in a dynamic software optimization system
- In Proceeding of the 2007 ACM International Conference on Supercomputing (ICS’07
, 2007
"... Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximize their positive impact while minimizing their negative impact on performance. In previous proposed dynamic frameworks, t ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximize their positive impact while minimizing their negative impact on performance. In previous proposed dynamic frameworks, the monitoring scheme is either achieved using processor performance counters or using specific hardware. In this work, we propose a prefetching strategy which does not use any specific hardware component or processor performance counter. Our dynamic framework wants to be portable on any modern processor architecture providing at least a prefetch instruction. Opportunity and effectiveness of prefetching loads is simply guided by the time spent to effectively obtain the data. Every load of a program is monitored periodically and can be either associated to a dynamically inserted prefetch instruction or not. It can be associated to a prefetch instruction at some disjoint periods of the whole program run as soon as it is efficient. Our framework has been implemented for Itanium-2 machines. It involves several dynamic instrumentations of the binary code whose overhead is limited to only 4 % on average. On a large set of benchmarks, our system is able to speed up some programs by 2%-143%.
Online phase-adaptive data layout selection
- In Proceedings of the European Conference on Object-Oriented Programming
, 2008
"... Abstract. Good data layouts improve cache and TLB performance of object-oriented software, but unfortunately, selecting an optimal data layout a priori is NP-hard. This paper introduces layout auditing, a technique that selects the best among a set of layouts online (while the program is running). L ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Good data layouts improve cache and TLB performance of object-oriented software, but unfortunately, selecting an optimal data layout a priori is NP-hard. This paper introduces layout auditing, a technique that selects the best among a set of layouts online (while the program is running). Layout auditing randomly applies different layouts over time and observes their performance. As it becomes confident about which layout performs best, it selects that layout with higher probability. But if a phase shift causes a different layout to perform better, layout auditing learns the new best layout. We implemented our technique in a product Java virtual machine, using copying generational garbage collection to produce different layouts, and tested it on 20 long-running benchmarks and 4 hardware platforms. Given any combination of benchmark and platform, layout auditing consistently performs close to the best layout for that combination, without requiring offline training. 1
MATS: Multicore Adaptive Trace Selection
"... Dynamically optimizing programs is worthwhile only if the overhead created by the dynamic optimizer is less than the benefit gained from the optimization. Program trace selection is one of the most important, yet time consuming, components of many dynamic optimizers. The dynamic application of monit ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Dynamically optimizing programs is worthwhile only if the overhead created by the dynamic optimizer is less than the benefit gained from the optimization. Program trace selection is one of the most important, yet time consuming, components of many dynamic optimizers. The dynamic application of monitoring and profiling can often result in an execution slowdown rather than speedup. Achieving significant performance gain from dynamic optimization has proven to be quite challenging. However, current technological advances, namely multicore architectures, enable us to design new approaches to meet this challenge. Selecting traces in current dynamic optimizers is typically achieved through the use of instrumentation to collect control flow information from a running application. Using instrumentation for runtime analysis requires the trace selection algorithms to be light weight, and this limits how sophisticated these algorithms can be. This is problematic because the quality of the traces can determine the potential benefits that can be gained from optimizing the traces. In many cases, even when using a lightweight approach, the overhead incurred is more than the benefit of the optimizations. In this paper we exploit the multicore architecture to design an aggressive trace selection approach that produces better traces and does not perturb the running application. 1.
When Prefetching Works, When It Doesn’t, and Why
"... In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware and software, is among our most important available techniques for doing so; yet, we claim ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware and software, is among our most important available techniques for doing so; yet, we claim that prefetching is perhaps also the least well-understood. Thus, the goal of this study is to develop a novel, foundational understanding of both the benefits and limitations of hardware and software prefetching. Our study includes: source code-level analysis, to help in understanding the practical strengths and weaknesses of compiler- and software-based prefetching; a study of the synergistic and antagonistic effects between software and hardware prefetching; and an evaluation of hardware prefetching training policies in the presence of software prefetching requests. We use both simulation and measurement on real systems. We find, for instance, that although there are many opportunities for compilers to prefetch much more aggressively than they currently do, there is also a tangible risk of interference with training existing hardware prefetching mechanisms. Taken together, our observations suggest new research directions for cooperative hardware/software prefetching.
Runtime Parallelization of Legacy Code on a Transactional Memory System
- In Proceedings of the International Conference on High Performance and Embedded Architectures and Compilers
, 2011
"... Thispaperproposesanewruntimeparallelization technique, based on a dynamic optimization framework, to automatically parallelize single-threaded legacy programs. It heavily leverages the optimistic concurrency of transactional memory. This work addresses a number of challenges posed by this type of pa ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Thispaperproposesanewruntimeparallelization technique, based on a dynamic optimization framework, to automatically parallelize single-threaded legacy programs. It heavily leverages the optimistic concurrency of transactional memory. This work addresses a number of challenges posed by this type of parallelization and quantifies the trade-offs of some of the design decisions, such as how to select good loops for parallelization, how to partition the iteration space among parallel threads, how to handle loop-carried dependencies, and how to transition from serial to parallel execution and back. The simulated implementation of runtime parallelization shows apotentialspeedupof1.36 for theNAS benchmarks and a 1.34 speedup for the SPEC 2000 CPU floating point benchmarks when using two cores for parallel execution. Categories andSubject Descriptors
A Reactive Unobtrusive Prefetcher for Multicore and Manycore Architectures
"... Processor performance continues to out pace memory performance by a large margin. The growing popularity of multicore and manycore architectures further exacerbates this problem. The challenge of keeping the processor(s) fed with data becomes more difficult. One approach for mitigating this gap is t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Processor performance continues to out pace memory performance by a large margin. The growing popularity of multicore and manycore architectures further exacerbates this problem. The challenge of keeping the processor(s) fed with data becomes more difficult. One approach for mitigating this gap is to employ software-based speculative prefetching. Software dynamic prefetchers are able to identify more complex patterns than hardware prefetchers, while retaining the ability to respond to dynamic program behavior. However, modern techniques incur prohibitively high application overheads to detect and to exploit these data access patterns, and do little to accommodate multicore and manycore architectures. In this work, we present an unobtrusive software prefetcher that takes advantage of underutilized cores to improve the performance of neighboring cores. We leverage multicore and manycore design to decouple the tasks of profiling, pattern detection and prefetching away from the application. Our approach takes advantage of cache coherence snooping mechanisms at the ISA level such that the cache miss patterns can be observed by a neighboring processor core. With this capability, it is possible to create a reactive solution that complements a hardware prefetcher, while isolating the tasks of pattern recognition and prefetching from altering the code or perturbing the performance of the running application. This allows our prefetching engine to be seamlessly deployed by the OS to any free core to assist neighboring cores, and terminated if those cores are needed. We call our approach unobtrusive reactive prefetching. In this paper, we outline our system, discuss our hardware extensions, and present our unobtrusive speculative hot stream extraction and prefetching algorithms for detecting and mitigating recurring cache miss patterns. Using an aggressive hardware prefetcher baseline our unobtrusive core hopping prefetcher is able to reduce the number of cache misses by an average of 26 % and in our best case our technique reduces the miss rate by 84%. 1.