Results 1 -
7 of
7
A Statistical Multiprocessor Cache Model
, 2005
"... The introduction of general purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The introduction of general purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trendcorrect and intuitive feedback. This paper describes an efficient and flexible approach for modeling the memory system of a multiprocessor, such as those of chip multiprocessors (CMPs). Sparse data is sampled during a multithreaded execution. The data collected consist of the reuse distance and invalidation distribution for a small subset of the memory accesses. Based on the sampled data from a single run, a new mathematical formula is used to estimate the miss rate for a multiprocessor memory hierarchy built from caches of arbitrarily size, cache-line size and degree of sharing. The formula further divides the misses into six categories to aid the software developer. The method is evaluated using a large number of commercial and technical multithreaded applications. The result produced by our algorithm fed with sparse sampling data is shown to be consistent with results gathered during traditional architecture simulation.
Discovery of locality-improving refactoring by reuse path analysis
- In Proceedings of HPCC. Springer. Lecture Notes in Computer Science
, 2006
"... Abstract. Due to the huge speed gaps in the memory hierarchy of modern computer architectures, it is important that programs maintain a good data locality. Improving temporal locality implies reducing the distance of data reuses that are far apart. The best existing tools indicate locality bottlenec ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. Due to the huge speed gaps in the memory hierarchy of modern computer architectures, it is important that programs maintain a good data locality. Improving temporal locality implies reducing the distance of data reuses that are far apart. The best existing tools indicate locality bottlenecks by highlighting both the source locations generating the use and the subsequent cache-missing reuse. Even with this knowledge of the bottleneck locations in the source code, it often remains hard to find an effective code refactoring that improves temporal locality, due to the unclear interaction of function calls and loop iterations occurring between use and reuse. The contributions in this paper are two-fold. First, the locality analysis is enhanced to not only pinpoint the cache bottlenecks, but to also suggest code refactorings that may resolve them. The refactorings are found by analyzing the dynamic hierarchy of function calls and loops on the code path between reuses, called reuse paths. Secondly, reservoir sampling of the reuse paths results in a significant reduction of the execution time and memory requirements during profiling, enabling the analysis of realistic programs. An interactive GUI, called SLO (Suggestions for Locality Optimizations), has been used to explore the most appropriate refactorings in a number of SPEC2000 programs. After refactoring, the execution time of the selected programs was halved, on the average. 1
Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization
"... Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe per ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Reuse distance analysis is a well-established tool for predicting cache performance, driving compiler optimizations, and assisting visualization and manual optimization of programs. Existing reuse distance analysis methods either do not account for the effects of multithreading, or suffer severe performance penalties. This paper presents a sampled, parallelized method of measuring reuse distance profiles for multithreaded programs, modeling private and shared cache configurations. The sampling technique allows it to spend much of its execution in a fast low-overhead mode, and allows the use of a new measurement method since sampled analysis does not need to consider the full state of the reuse stack. This measurement method uses O(1) data structures that may be made thread-private, allowing parallelization to reduce overhead in analysis mode. The performance of the resulting system is analyzed for a diverse set of parallel benchmarks and shown to generate accurate output compared to non-sampled full analysis as well as good results for the common application of locating low-locality code in the benchmarks, all with a performance overhead comparable to the best single-threaded analysis techniques.
Path-based reuse distance analysis
- IN: COMPILER CONSTRUCTION. LNCS
, 2006
"... Profiling can effectively analyze program behavior and provide critical information for feedback-directed or dynamic optimizations. Based on memory profiling, reuse distance analysis has shown much promise in predicting data locality for a program using inputs other than the profiled ones. Both whol ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Profiling can effectively analyze program behavior and provide critical information for feedback-directed or dynamic optimizations. Based on memory profiling, reuse distance analysis has shown much promise in predicting data locality for a program using inputs other than the profiled ones. Both wholeprogram and instruction-based locality can be accurately predicted by reuse distance analysis. Reuse distance analysis abstracts a cluster of memory references for a particular instruction having similar reuse distance values into a locality pattern. Prior work has shown that a significant number of memory instructions have multiple locality patterns, a property not desirable for many instruction-based memory optimizations. This paper investigates the relationship between locality patterns and execution paths by analyzing reuse distance distribution along each dynamic path to an instruction. Here a path is defined as the program execution trace from the previous access of a memory location to the current access. By differentiating locality patterns with the context of execution paths, the proposed analysis can expose optimization opportunities tailored only to a specific subset of paths leading to an instruction. In this paper, we present an effective method for path-based reuse distance profiling and analysis. We have observed that a significant percentage of the multiple locality patterns for an instruction can be uniquely related to a particular execution path in the program. In addition, we have also investigated the influence of inputs on reuse distance distribution for each path/instruction pair. The experimental results show that the path-based reuse distance is highly predictable, as a function of the data size, for a set of SPEC CPU2000 programs.
Rese arch FE ATURE Refactoring for Data Locality
"... Suggestions for locality optimizations (SLO), a cache profiling tool, analyzes runtime reuse paths to find the root causes of poor data locality, and suggests the most promising code optimizations. Refactoring using the hints of the SLO analyzer doubles the average execution speed of several SPEC200 ..."
Abstract
- Add to MetaCart
Suggestions for locality optimizations (SLO), a cache profiling tool, analyzes runtime reuse paths to find the root causes of poor data locality, and suggests the most promising code optimizations. Refactoring using the hints of the SLO analyzer doubles the average execution speed of several SPEC2000 benchmark programs. Refactoring a program means transforming its internal structure to improve its qualities, such as program organization, execution speed, or readability, without changing its functionality. Although refactoring is most often seen as a way to improve a program’s internal architecture, 1 here we use the term to mean improving the execution speed. The main bottleneck often is not computation time, but rather memory access delay: Processors can execute hundreds of instructions
TOWARDS INCREASED POWER EFFICIENCY IN LOW END EMBEDDED PROCESSORS: CAN CACHE HELP?
"... Embedded processors are often characterized by limited resources and are optimized for specific applications. A rising number of battery powered applications has driven a trend towards increased energy efficiency sometimes even traded with performance. Particularly, lower power and low specification ..."
Abstract
- Add to MetaCart
Embedded processors are often characterized by limited resources and are optimized for specific applications. A rising number of battery powered applications has driven a trend towards increased energy efficiency sometimes even traded with performance. Particularly, lower power and low specification embedded processors lack on-chip cache memories. This is mainly in order to avoid the higher energy overhead a cache structure would pose in an embedded processor. This paper proposes energy and throughput models which can be used to analyze energy and time overhead for a particular application due to introduction of a data cache architecture in a previously non-cached system or alternatively can be used in reconfigurable systems for cache overhead analysis. 1
Towards Architecture Independent Metrics for Multicore Performance Analysis ∗
"... The prevalence of multicore architectures has made the performance analysis of multithreaded applications an intriguing area of inquiry. An understanding of locality effects and communication behavior can provide programmers with valuable information about performance bottlenecks and opportunities f ..."
Abstract
- Add to MetaCart
The prevalence of multicore architectures has made the performance analysis of multithreaded applications an intriguing area of inquiry. An understanding of locality effects and communication behavior can provide programmers with valuable information about performance bottlenecks and opportunities for optimization. Unfortunately, most performance analyses are architecture dependent, and hence insights gleaned from an application’s behavior on one platform may not apply when the application is run on another. In this position paper, we argue that what is needed are architecture independent metrics that characterize the behavior of an application in a system-agnostic manner. Such metrics will allow a program’s performance to be analyzed across a range of architectures without incurring the overhead of repeated profiling and analysis. We propose two specific analyses: multicore-aware reuse distance, which captures the locality properties of an application and communication analysis, which exposes the structure of communication in an application. We also discuss a number of applications of these analyses, in the domains of optimization, code restructuring and performance modeling. 1.

