Results 1 - 10
of
34
Statcache: A probabilistic approach to efficient and accurate data locality analysis
- In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2004
"... The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present Stat-Cache, a novel sampling-based method for performing data-locality analysis on realisti ..."
Abstract
-
Cited by 59 (7 self)
- Add to MetaCart
(Show Context)
The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present Stat-Cache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as �. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives. 1
SIGMA: A Simulator Infrastructure to Guide Memory Analysis
- In Supercomputing
, 2002
"... In this paper we present SIGM (Simulation Infrastructure to Guide Memory Analysis), a new data collection framework and family of cache analysis tools. The SIGM environment provides detailed cache information by gathering memory reference data using software-based instrumentation. ..."
Abstract
-
Cited by 48 (4 self)
- Add to MetaCart
(Show Context)
In this paper we present SIGM (Simulation Infrastructure to Guide Memory Analysis), a new data collection framework and family of cache analysis tools. The SIGM environment provides detailed cache information by gathering memory reference data using software-based instrumentation.
An Empirical Performance Evaluation of Scalable Scientific Applications
- in Proceedings of the 2002 ACM/IEEE Conference on Supercomputing
, 2002
"... We investigate the scalability, architectural requirements, and performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, w ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
(Show Context)
We investigate the scalability, architectural requirements, and performance characteristics of eight scalable scientific applications. Our analysis is driven by empirical measurements using statistical and tracing instrumentation for both communication and computation. Based on these measurements, we refine our analysis into precise explanations of the factors that influence performance and scalability for each application; we distill these factors into common traits and overall recommendations for both users and designers of scalable platforms. Our experiments demonstrate that some traits, such as improvements in the scaling and performance of MPI's collective operations, will benefit most applications. We also find specific characteristics of some applications that limit performance. For example, one application's intensive use of a 64-bit, floating-point divide instruction, which has high latency and is not pipelined on the POWER3, limits the performance of the application's primary computation. 1
METRIC: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting
, 2003
"... In this paper, we present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions in this ..."
Abstract
-
Cited by 29 (15 self)
- Add to MetaCart
(Show Context)
In this paper, we present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions in this paper. First, we present methods to extract partial data traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial data traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial data traces. By exploiting summarized memory metrics, by-reference metrics as well as cache evictor information, we can pin-point the sources of performance problems. Fourth, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several experiments resulting in up to 40% lower miss ratios.
Memory Profiling using Hardware Counters
- In Supercomputing Conference (SC
, 2003
"... Although memory performance is often a limiting factor in application performance, most tools only show performance data relating to the instructions in the program, not to its data. In this paper, we describe a technique for directly measuring the memory profile of an application. We describe the t ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
(Show Context)
Although memory performance is often a limiting factor in application performance, most tools only show performance data relating to the instructions in the program, not to its data. In this paper, we describe a technique for directly measuring the memory profile of an application. We describe the tools and their user model, and then discuss a particular code, the MCF benchmark from SPEC CPU 2000. We show performance data for the data structures and elements, and discuss the use of the data to improve program performance. Finally, we discuss extensions to the work to provide feedback to the compiler for prefetching and to generate additional reports from the data. 1.
Accuracy of performance monitoring hardware
- In Proc. LACSI Symposium, Sante Fe
, 2002
"... Performance monitoring hardware is available on most modern microprocessors in the form of hardware counters and other registers that record data about processor events. This hardware may be used in counting mode, in which aggregate event counts are accumulated, and/or in sampling mode, in which tim ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
(Show Context)
Performance monitoring hardware is available on most modern microprocessors in the form of hardware counters and other registers that record data about processor events. This hardware may be used in counting mode, in which aggregate event counts are accumulated, and/or in sampling mode, in which time-based or event-based sampling is used to collect profiling data. This paper discusses uses of these two modes and considers the accuracy issues raised by each. Implications for the PAPI cross-platform hardware counter interface and the application programmer also are discussed. 1
Iterative Compilation and Performance Prediction for Numerical Applications
, 2004
"... As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers fr ..."
Abstract
-
Cited by 17 (10 self)
- Add to MetaCart
(Show Context)
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of
Sip: Performance tuning through source code interdependence
- In Euro-Par’02
, 2002
"... Abstract. The gap between CPU peak performance and achieved ap-plication performance widens as CPU complexity, as well as the gap between CPU cycle time and DRAM access time, increases. While ad-vanced compilers can perform many optimizations to better utilize the cache system, the application progr ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
(Show Context)
Abstract. The gap between CPU peak performance and achieved ap-plication performance widens as CPU complexity, as well as the gap between CPU cycle time and DRAM access time, increases. While ad-vanced compilers can perform many optimizations to better utilize the cache system, the application programmer is still required to do some of the optimizations needed for efficient execution. Therefore, profiling should be performed on optimized binary code and performance prob-lems reported to the programmer in an intuitive way. Existing perfor-mance tools do not have adequate functionality to address these needs. Here we introduce source interdependence profiling, SIP, as a paradigm to collect and present performance data to the programmer. SIP identi-fies the performance problems that remain after the compiler optimiza-tion and gives intuitive hints at the source-code level as to how they can be avoided. Instead of just collecting information about the events directly caused by each source-code statement, SIP also presents data about events from some interdependent statements of source code. A first SIP prototype tool has been implemented. It supports both C and Fortran programs. We describe how the tool was used to improve the performance of the SPEC CPU2000 183.equake application by 59 percent. 1
SMP System Interconnect Instrumentation for Performance Analysis
, 2002
"... The system interconnect is often the performance bottleneck in SMP computers. Although modern SMPs include event counters on processors and interconnects, these provide limited information about the interaction of processors vying for shared resources. Additionally, transaction sources and addresses ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
The system interconnect is often the performance bottleneck in SMP computers. Although modern SMPs include event counters on processors and interconnects, these provide limited information about the interaction of processors vying for shared resources. Additionally, transaction sources and addresses are not readily available, making analysis of access patterns and data locality difficult. Enhanced system interconnect instrumentation is required to extract this information. This paper describes instrumentation implemented for monitoring the system interconnect on Sun Fire servers. The instrumentation supports sophisticated programmable filtering of event counters, allowing us to construct histograms of system interconnect activity, and a FIFO to capture trace sequences. Our implementation results in a very small hardware footprint, making it appropriate for inclusion in commodity hardware.
A comparison of counting and sampling modes of using performance monitoring hardware
- In International Conference on Computational Science (ICCS 2002
, 2002
"... Abstract. Performance monitoring hardware is available on most modern microprocessors in the form of hardware counters and other registers that record data about processor events. This hardware may be used in counting mode, in which aggregate events counts are accumulated, and/or in sampling mode, i ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Performance monitoring hardware is available on most modern microprocessors in the form of hardware counters and other registers that record data about processor events. This hardware may be used in counting mode, in which aggregate events counts are accumulated, and/or in sampling mode, in which time-based or event-based sampling is used to collect profiling data. This paper discusses uses of these two modes and considers the issues of efficiency and accuracy raised by each. Implications for the PAPI cross-platform hardware counter interface are also discussed. 1