Results 1 - 10
of
11
Performance and environment monitoring for continuous program optimization
"... Our research is aimed at characterizing, understanding, and exploiting the interactions between hardware and software to improve system performance. We have developed a paradigm for continuous program optimization (CPO) that assists in and automates the challenging task of performance tuning, and we ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Our research is aimed at characterizing, understanding, and exploiting the interactions between hardware and software to improve system performance. We have developed a paradigm for continuous program optimization (CPO) that assists in and automates the challenging task of performance tuning, and we have implemented an initial prototype of this paradigm. At the core of our implementation is a performance- and environment-monitoring (PEM) component that vertically integrates performance events from various layers in the execution stack. CPO agents use the data provided by PEM to detect, diagnose, and alleviate performance problems on existing systems. In addition, CPO can be used to improve future architecture designs by analyzing PEM data collected on a whole-system simulator while varying architectural characteristics. In this paper, we present the CPO paradigm, describe an initial implementation that includes PEM as a component, and discuss two CPO clients.
Soft-OLP: Improving Hardware Cache Performance Through Software-Controlled Object-Level Partitioning
"... Abstract—Performance degradation of memory-intensive programs caused by the LRU policy’s inability to handle weaklocality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU’s critical path, where only simple management mechan ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract—Performance degradation of memory-intensive programs caused by the LRU policy’s inability to handle weaklocality data accesses in the last level cache is increasingly serious for two reasons. First, the last-level cache remains in the CPU’s critical path, where only simple management mechanisms, such as LRU, can be used, precluding some sophisticated hardware mechanisms to address the problem. Second, the commonly used shared cache structure of multi-core processors has made this critical path even more performance-sensitive due to intensive inter-thread contention for shared cache resources. Researchers have recently made efforts to address the problem with the LRU policy by partitioning the cache using hardware or OS facilities guided by run-time locality information. Such approaches often rely on special hardware support or lack enough accuracy. In contrast, for a large
Prediction-based Power-Performance Adaptation of Multithreaded Scientific Codes
"... Abstract — Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip threadlevel parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract — Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip threadlevel parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores, however in several domains users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live program analysis to optimize applications dynamically. We describe a dynamic, phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven, phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8 % simultaneous with an improvement in performance of 17.9%, resulting in energy savings of 26.7%. Index Terms — Modeling and prediction, Application-aware adaptation, Energy-aware systems
Optimizing Communication Overlap for High-Speed Networks Abstract
"... Modern networking hardware supports true non-blocking communication and effective exploitation of this feature can lead to significant application performance improvements. We believe that algorithm design and optimization techniques that hide latency by taking advantage of communication overlap wil ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Modern networking hardware supports true non-blocking communication and effective exploitation of this feature can lead to significant application performance improvements. We believe that algorithm design and optimization techniques that hide latency by taking advantage of communication overlap will facilitate obtaining good parallel efficiency and performance on the highly concurrent contemporary systems. Finding an optimal, performance portable implementation when using non-blocking communication primitives is non-trivial and intimidating to many application developers. In this paper we present a methodology for discovering optimal message sizes and schedules for a variety of application scenarios. This is achieved by combining an analytic model that takes into account the variability of performance parameters with system scale and load with heuristics designed to avoid network congestion. We perform experiments to understand network behavior in the presence of overlap and purge the optimization space for any system based on either resource or implementation constraints. Our approach is able to choose optimal or nearly optimal implementation parameters for a variety of highly non-trivial scenarios and networks with different performance characteristics. Implementations based on parameters chosen by the models are able to hide over 90 % of communication overhead in all cases. 1.
A component model of spatial locality
- In Proceedings of the International Symposium on Memory Management
, 2009
"... Good spatial locality alleviates both the latency and bandwidth problem of memory by boosting the effect of prefetching and improving the utilization of cache. However, conventional definitions of spatial locality are inadequate for a programmer to precisely quantify the quality of a program, to ide ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Good spatial locality alleviates both the latency and bandwidth problem of memory by boosting the effect of prefetching and improving the utilization of cache. However, conventional definitions of spatial locality are inadequate for a programmer to precisely quantify the quality of a program, to identify causes of poor locality, and to estimate the potential by which spatial locality can be improved. This paper describes a new, component-based model for spatial locality. It is based on measuring the change of reuse distances as a function of the data-block size. It divides spatial locality into components at program and behavior levels. While the base model is costly because it requires the tracking of the locality of every memory access, the overhead can be reduced by using small inputs and by extending a sampling-based tool. The paper presents the result of the analysis for a large set of benchmarks, the cost of the analysis, and the experience of a user study, in which the analysis helped to locate a data-layout problem and improve performance by 7 % with a 6-line change in an application with over 2,000 lines.
A component-based definition of spatial locality
, 2006
"... The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be im ..."
Abstract
- Add to MetaCart
The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be improved, can a data layout be improved if the miss rate cannot be lowered, and can the overall spatial locality be decomposed into smaller components? This paper describes a new definition of spatial locality that addresses these questions. The model is based on off-line profiling of a sequential execution. It has been used to analyze the spatial locality of 14 SPEC2000 benchmarks. 1
Vertical Profiling: Evaluating Computer Architectures using Commercial Applications
"... This paper demonstrates how a performance analysis technique, vertical profiling, can be used to determine the cause of a performance anomaly: a gradual increase in instructions per cycle over time. Understanding the cause required trace-information from multiple layers of the execution stack (appli ..."
Abstract
- Add to MetaCart
This paper demonstrates how a performance analysis technique, vertical profiling, can be used to determine the cause of a performance anomaly: a gradual increase in instructions per cycle over time. Understanding the cause required trace-information from multiple layers of the execution stack (application, Java virtual machine, and hardware), expert knowledge in each layer, and repeated application of the process. To evaluate today’s complex software and hardware systems requires sophisticated performance analysis techniques. Nevertheless, as future software and hardware systems become more complex, these performance analysis techniques must be automated. 1.
A Component-based Definition of Spatial Locality
, 2008
"... The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be im ..."
Abstract
- Add to MetaCart
The data layout of a program is critical to performance because it determines the spatial locality of the data access. Most quantitative notions of spatial locality are based on the overall miss rate and leave three questions not fully answered: how much can the locality of a given data layout be improved, can a data layout be improved if the miss rate cannot be lowered, and can the overall spatial locality be decomposed into finer components? This paper describes a new definition of spatial locality that addresses these questions. The model is based on online profiling and off-line analysis. It has been used to analyze 7 SPEC2000 benchmarks and 1 SPEC2006 benchmarks. Among their 18 components, it finds 5 components that have a significant problem of poor spatial locality.
Auto-tuning Parallel Programs at Compiler- and Application-Levels
, 2009
"... Auto-tuning has recently received its fair share of attention from the High Performance Computing community. Most auto-tuning approaches are specialized to work either on specific domains- dense/sparse linear algebra, stencil computations etc.; or only at certain stages of program execution- compile ..."
Abstract
- Add to MetaCart
Auto-tuning has recently received its fair share of attention from the High Performance Computing community. Most auto-tuning approaches are specialized to work either on specific domains- dense/sparse linear algebra, stencil computations etc.; or only at certain stages of program execution- compile-time, launch-time or run-time. Real scientific applications, however, demand a cohesive environment that can efficiently provide auto-tuning solutions at all stages of application development and deployment. Towards that end, in this paper, we describe a unified end-to-end approach to auto-tuning scientific applications. A unique feature of our search-based auto-tuning system is a powerful parallel search algorithm, which leverages parallelism to effectively navigate the search space defined by compiler-level and applicationlevel tunable parameters. Our system is general-purpose and the results presented in this paper demonstrate its applicability in tuning compiler-generated and application-specific input parameter spaces. 1
Cache Conscious Task Regrouping on Multicore Processors
"... Abstract—Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation u ..."
Abstract
- Add to MetaCart
Abstract—Because of the interference in the shared cache on multicore processors, the performance of a program can be severely affected by its co-running programs. If job scheduling does not consider how a group of tasks utilize cache, the performance may degrade significantly, and the degradation usually varies sizably and unpredictably from run to run. In this paper, we use trace-based program locality analysis and make it efficient enough for dynamic use. We show a complete on-line system for periodically measuring the parallel execution, predicting and ranking cache interference for all co-run choices, and reorganizing programs based on the prediction. We test our system on floating-point and mixed integer and floating-point workloads composed of SPEC 2006 benchmarks and compare with the default Linux job scheduler to show the benefit of the new system in improving performance and reducing performance variation. Keywords-multicore; task grouping; online program locality analysis; lifetime sampling I.

