Results 1 -
7 of
7
An experimental survey of energy management across the stack
- In OOPSLA
, 2014
"... Modern demand for energy-efficient computation has spurred research at all levels of the stack, from devices to microarchi-tecture, operating systems, compilers, and languages. Unfor-tunately, this breadth has resulted in a disjointed space, with technologies at different levels of the system stack ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Modern demand for energy-efficient computation has spurred research at all levels of the stack, from devices to microarchi-tecture, operating systems, compilers, and languages. Unfor-tunately, this breadth has resulted in a disjointed space, with technologies at different levels of the system stack rarely compared, let alone coordinated. This work begins to remedy the problem, conducting an experimental survey of the present state of energy manage-ment across the stack. Focusing on settings that are exposed to software, we measure the total energy, average power, and execution time of 41 benchmark applications in 220 config-urations, across a total of 200,000 program executions. Some of the more important findings of the survey in-clude that effective parallelization and compiler optimiza-tions have the potential to save far more energy than Linux’s frequency tuning algorithms; that certain non-complementary energy strategies can undercut each other’s savings by half when combined; and that while the power impacts of most strategies remain constant across applications, the runtime impacts vary, resulting in inconsistent energy impacts. 1.
Main memory and cache performance of Intel Sandy
- Bridge and AMD Bulldozer. In Workshop on Memory Systems Performance and Correctness
, 2014
"... Abstract Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and m ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details. For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA multiprocessor systems with Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using our benchmarks we present fundamental memory performance data and illustrate performance-relevant architectural properties of both designs.
An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level
"... Algorithms with low computational intensity show interesting per-formance and power consumption behavior on multicore proces-sors. We choose the lattice-Boltzmann method (LBM) as a pro-totype for this scenario in order to show if and how single-chip performance and power characteristics can be gener ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Algorithms with low computational intensity show interesting per-formance and power consumption behavior on multicore proces-sors. We choose the lattice-Boltzmann method (LBM) as a pro-totype for this scenario in order to show if and how single-chip performance and power characteristics can be generalized to the highly parallel case. LBM is an algorithm for CFD simulations that has gained popularity due to its ease of implementation and suit-ability for complex geometries. In this paper we perform a thor-ough analysis of a sparse-lattice LBM implementation on the Intel Sandy Bridge processor. Starting from a single-core performance model we can describe the intra-chip saturation characteristics of the code and its optimal operating point in terms of energy to solu-tion as a function of the propagation method, the clock frequency, and the SIMD vectorization. We then show how these findings may be extrapolated to the massively parallel level on a petascale-class machine, and quantify the energy-saving potential of various opti-mizations. We find that high single-core performance and a correct choice of the number of cores used on the chip are the essential factors for lowest energy to solution with minimal loss of perfor-mance. In the highly parallel case, these guidelines are found to be even more important for fixing the optimal performance-energy op-erating point, especially when taking the system’s baseline power consumption and the MPI communication characteristics into ac-count. Simplistic measures often applied by users and computing centers, such as setting a low clock speed for memory-bound appli-cations, have limited impact.
The energy case for graph processing on hybrid cpu and gpu systems
- In Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
, 2013
"... This paper investigates the power, energy, and performance characteristics of large-scale graph processing on hybrid (i.e., CPU and GPU) single-node systems. Graph processing can be accelerated on hybrid systems by properly mapping the graph-layout to processing units, such that the algorithmic task ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
This paper investigates the power, energy, and performance characteristics of large-scale graph processing on hybrid (i.e., CPU and GPU) single-node systems. Graph processing can be accelerated on hybrid systems by properly mapping the graph-layout to processing units, such that the algorithmic tasks exer-cise each of the units where they perform best. However, the GPUs have much higher Thermal Design Power (TDP), thus their impact on the overall energy consumption is unclear. Our evaluation using large real-world graphs and synthetic graphs as large as 1 billion vertices and 16 billion edges shows that a hybrid system is efficient in terms of both time-to-solution and energy. Categories and Subject Descriptors
Software Power Analysis And Optimization For Power-Aware Multicore Systems
, 2014
"... ..."
(Show Context)
CPT: An Energy-Efficiency Model for Multi-core Computer Systems
"... Abstract—Resolving the excessive energy consumption of modern computer systems has become a substantial challenge. Therefore, various techniques have been proposed to reduce power dissipation and improve energy efficiency of computer systems. These techniques affect the energy efficiency across diff ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Resolving the excessive energy consumption of modern computer systems has become a substantial challenge. Therefore, various techniques have been proposed to reduce power dissipation and improve energy efficiency of computer systems. These techniques affect the energy efficiency across different layers in a system. In order to better understand and analyze those techniques, it is necessary to obtain a general metric that represents the energy efficiency of a computer system, for a specific configuration, given a certain amount of workload. In this paper, we take the initial step and define a general energy-efficiency model, the CPT model, for multi-core computer systems. CPT is a unified model that helps to decide the best configuration of a system in terms of energy efficiency to execute a given workload. In addition, we expect the model can be utilized to analyze possible knobs that are used to improve energy efficiency. Three case studies are employed to illustrate the usage of the proposed CPT model. I.
How Processor Speedups Can Slow Down I/O Performance 1 Hung-Ching Chang,
"... Abstract—Power states in power-scalable systems are managed to maximize performance and reduce energy waste. Power-scalable processor capabilities (e.g., Intel Turbo Boost) embrace a “faster is better ” approach to power management. While these technologies can vastly improve performance and energy ..."
Abstract
- Add to MetaCart
Abstract—Power states in power-scalable systems are managed to maximize performance and reduce energy waste. Power-scalable processor capabilities (e.g., Intel Turbo Boost) embrace a “faster is better ” approach to power management. While these technologies can vastly improve performance and energy efficiency, there is a growing body of evidence that “faster is not always better”. For example, in some I/O intensive benchmarks, we observe up to 47 % performance loss when running codes at faster (higher power) frequencies versus slower (lower power) frequencies. To the best of our knowledge, this is the first work to systematically and accurately pinpoint the root cause of these types of slowdowns. The lack of such studies is likely due to three challenges we overcome in this work: 1) high runtime system variance; 2) bottleneck isolation across user- and system-space boundaries; and 3) non-determinism in parallel codes. Our analytical model-driven approach identifies Atomic Batch Transactions (ABTs) in the Linux kernel as the cause of slowdowns at higher processor speeds. We propose and evaluate the use of power-aware ABT's that can increase performance more than 3-fold over the default Linux kernel while maintaining comparable reliability. Our work motivates the need for more studies that potentially reconsider the "faster is better " design paradigm. Keywords—Performance, energy efficiency, atomic batch transactions. I.