Results 1 - 10
of
53
A comparison of empirical and model-driven optimization
- In ACM Symp. on Programming Language Design and Implementation (PLDI’03
, 2003
"... Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose thes ..."
Abstract
-
Cited by 99 (12 self)
- Add to MetaCart
(Show Context)
Empirical program optimizers estimate the values of key optimization parameters by generating different program versions and running them on the actual hardware to determine which values give the best performance. In contrast, conventional compilers use models of programs and machines to choose these parameters. It is widely believed that empirical optimization is more effective than model-driven optimization, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the empirical optimization engine in ATLAS (a system for generating dense numerical linear algebra libraries) with a model-based optimization engine that used detailed models to estimate values for optimization parameters, and then measured the relative performance of the two systems on three different hardware platforms. Our experiments show that although model-based optimization can be surprisingly effective, useful models may have to consider not only hardware parameters but also the ability of back-end compilers to exploit hardware resources. 1.
A Statistical Multiprocessor Cache Model
, 2005
"... The introduction of general purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurat ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
(Show Context)
The introduction of general purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trendcorrect and intuitive feedback. This paper describes an efficient and flexible approach for modeling the memory system of a multiprocessor, such as those of chip multiprocessors (CMPs). Sparse data is sampled during a multithreaded execution. The data collected consist of the reuse distance and invalidation distribution for a small subset of the memory accesses. Based on the sampled data from a single run, a new mathematical formula is used to estimate the miss rate for a multiprocessor memory hierarchy built from caches of arbitrarily size, cache-line size and degree of sharing. The formula further divides the misses into six categories to aid the software developer. The method is evaluated using a large number of commercial and technical multithreaded applications. The result produced by our algorithm fed with sparse sampling data is shown to be consistent with results gathered during traditional architecture simulation.
Program Locality Analysis Using Reuse Distance
, 2009
"... On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive acc ..."
Abstract
-
Cited by 27 (12 self)
- Add to MetaCart
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input. The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reuse-distance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used on-line with the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions. The two techniques are among the first to enable quantitative analysis of whole-program locality and
Miss rate prediction across program inputs and cache configurations
- IEEE TRANSACTIONS ON COMPUTERS
, 2007
"... Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a para ..."
Abstract
-
Cited by 23 (14 self)
- Add to MetaCart
(Show Context)
Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This paper uses locality analysis to generate a parameterized model of program cache behavior. Given a cache size and associativity, this model predicts the miss rate for arbitrary data input set sizes. This model also identifies critical data input sizes where cache behavior exhibits marked changes. Experiments show this technique is within 2 percent of the hit rate for set associative caches on a set of floating-point and integer programs using array and pointer-based data structures. Building on the new model, this paper presents an interactive visualization tool that uses a three-dimensional plot to show miss rate changes across program data sizes and cache sizes and its use in evaluating compiler transformations. Other uses of this visualization tool include assisting machine and benchmark-set design. The tool can be accessed on the Web at
Data access partitioning for fine-grain parallelism on multicore architectures
- In Proceedings of the 40th Annual IEEE/ACM Symposium on Microarchitecture
, 2007
"... The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30 % over a single core processor. 1.
High level cache simulation for heterogeneous multiprocessors
- Proceedings of the 41st annual conference on Design automation
, 2004
"... As multiprocessor systems-on-chip become a reality, perfor-mance modeling becomes a challenge. To quickly evaluate many architectures, some type of high-level simulation is re-quired, including high-level cache simulation. We propose to perform this cache simulation by defining a metric to repre-sen ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
As multiprocessor systems-on-chip become a reality, perfor-mance modeling becomes a challenge. To quickly evaluate many architectures, some type of high-level simulation is re-quired, including high-level cache simulation. We propose to perform this cache simulation by defining a metric to repre-sent memory behavior independently of cache structure and back-annotate this into the original application. While the annotation phase is complex, requiring time comparable to normal address trace based simulation, it need only be per-formed once per application set and thus enables simulation to be sped up by a factor of 20 to 50 over trace based simu-lation. This is important for embedded systems, as software is often evaluated against many input sets and many ar-chitectures. Our results show the technique is accurate to within 20 % of miss rate for uniprocessors and was able to reduce the die area of a multiprocessor chip by a projected 14 % over a naive design by accurately sizing caches for each processor.
Code and data transformations for improving shared cache performance on smt processors
- In Proceedings of the 5th International Symposium on High Performance Computing
, 2002
"... Abstract. Simultaneous multithreaded processors use shared on-chip caches, which yield better cost-performance ratios. Sharing a cache between simultaneously executing threads causes excessive conflict misses. This paper proposes software solutions for dynamically partitioning the shared cache of an ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Simultaneous multithreaded processors use shared on-chip caches, which yield better cost-performance ratios. Sharing a cache between simultaneously executing threads causes excessive conflict misses. This paper proposes software solutions for dynamically partitioning the shared cache of an SMT processor, via the use of three methods originating in the optimizing compilers literature: dynamic tiling, copying and block data layouts. The paper presents an algorithm that combines these transformations and two runtime mechanisms to detect cache sharing between threads and react to it at runtime. The first mechanism uses minimal kernel extensions and the second mechanism uses information collected from the processor hardware counters. Our experimental results show that for regular, perfect loop nests, these transformations are very effective in coping with shared caches. When the caches are shared between threads from the same address space, performance is improved by 16–29 % on average. Similar improvements are observed when the caches are shared between threads from different address spaces. To our knowledge, this is the first work to present an all-software approach for managing shared caches on SMT processors. It is also one of the first performance and program optimization studies conducted on a commercial SMT-based multiprocessor using Intel’s hyperthreading technology.
Predicting locality phases for dynamic memory optimization
- J. PARALLEL DISTRIB. COMPUT. 67 (2007) 783–796
, 2007
"... ..."
(Show Context)
HOTL: A Higher Order Theory of Locality
"... The locality metrics are many, for example, miss ratio to test performance, data footprint to manage cache sharing, and reuse distance to analyze and optimize a program. It is unclear how different metrics are related, whether one subsumes another, and what combination may represent locality complet ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
The locality metrics are many, for example, miss ratio to test performance, data footprint to manage cache sharing, and reuse distance to analyze and optimize a program. It is unclear how different metrics are related, whether one subsumes another, and what combination may represent locality completely. This paper first derives a set of formulas to convert between five locality metrics and gives the condition for correctness. The transformation is analogous to differentiation and integration. As a result, these metrics can be assigned an order and organized into a hierarchy. Using the new theory, the paper then develops two techniques: one measures the locality in real time without special hardware support, and the other predicts multicore cache interference without parallel testing. The paper evaluates them using sequential and parallel programs as well as for a parallel mix of sequential programs.
On the theory and potential of LRU-MRU collaborative cache management
- In ISMM, 2011
"... The goal of cache management is to maximize data reuse. Collaborative caching provides an interface for software to communicate access information to hardware. In theory, it can obtain optimal cache performance. In this paper, we study a collaborative caching system that allows a program to choose d ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
The goal of cache management is to maximize data reuse. Collaborative caching provides an interface for software to communicate access information to hardware. In theory, it can obtain optimal cache performance. In this paper, we study a collaborative caching system that allows a program to choose different caching methods for its data. As an interface, it may be used in arbitrary ways, sometimes optimal but probably suboptimal most times and even counter productive. We develop a theoretical foundation for collaborative caches to show the inclusion principle and the existence of a distance metric we call LRU-MRU stack distance. The new stack distance is important for program analysis and transformation to target a hierarchical collaborative cache system rather than a single cache configuration. We use 10 benchmark programs to show that optimal caching may reduce the average miss ratio by 24%, and a simple feedback-driven compilation technique can utilize collaborative cache to realize 50 % of the optimal improvement.