Results 11 - 20
of
1,844
Transient Fault Detection via Simultaneous Multithreading
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor fa ..."
Abstract
-
Cited by 267 (7 self)
- Add to MetaCart
Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations. We demonstrate
SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling
- in Proceedings of the 30th annual international symposium on Computer architecture
, 2003
"... Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents ..."
Abstract
-
Cited by 258 (25 self)
- Add to MetaCart
(Show Context)
Current software-based microarchitecture simulators are many orders of magnitude slower than the hardware they simulate. Hence, most microarchitecture design studies draw their conclusions from drastically truncated benchmark simulations that are often inaccurate and misleading. This paper presents the Sampling Microarchitecture Simulation (SMARTS) framework as an approach to enable fast and accurate performance measurements of full-length benchmarks. SMARTS accelerates simulation by selectively measuring in detail only an appropriate benchmark subset. SMARTS prescribes a statistically sound procedure for configuring a systematic sampling simulation run to achieve a desired quantifiable confidence in estimates. Analysis of 41 of the 45 possible SPEC2K benchmark/ input combinations show CPI and energy per instruction (EPI) can be estimated to within ±3 % with 99.7% confidence by measuring fewer than 50 million instructions per benchmark. In practice, inaccuracy in microarchitectural state initialization introduces an additional uncertainty which we empirically bound to ~2 % for the tested benchmarks. Our implementation of SMARTS achieves an actual average error of only 0.64 % on CPI and 0.59% on EPI for the tested benchmarks, running with average speedups of 35 and 60 over detailed simulation of 8-way and 16-way out-of-order processors, respectively. 1.
Phase Tracking and Prediction
, 2003
"... In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and und ..."
Abstract
-
Cited by 233 (19 self)
- Add to MetaCart
(Show Context)
In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
, 2001
"... Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot f ..."
Abstract
-
Cited by 227 (10 self)
- Add to MetaCart
(Show Context)
Serialization of threads due to critical sections is a fundamental bottleneck to achieving high performance in multithreaded programs. Dynamically, such serialization may be unnecessary because these critical sections could have safely executed concurrently without locks. Current processors cannot fully exploit such parallelism because they do not have mechanisms to dynamically detect such false inter-thread dependences. We propose Speculative Lock Elision (SLE), a novel micro-architectural technique to remove dynamically unnecessary lock-induced serialization and enable highly concurrent multithreaded execution. The key insight is that locks do not always have to be acquired for a correct execution. Synchronization instructions are predicted as being unnecessary and elided. This allows multiple threads to concurrently execute critical sections protected by the same lock. Misspeculation due to inter-thread data conflicts is detected using existing cache mechanisms and rollback is used for recovery. Successful speculative elision is validated and committed without acquiring the lock. SLE can be implemented entirely in microarchitecture without instruction set support and without system-level modifications, is transparent to programmers, and requires only trivial additional hardware support. SLE can provide programmers a fast path to writing correct high-performance multithreaded programs.
Gated-V dd : A Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories
, 2000
"... Deep-submicron CMOS designs have resulted in large leakage energy dissipation in microprocessors. While SRAM cells in onchip cache memories always contribute to this leakage, there is a large variability in active cell usage both within and across applications. This paper explores an integrated arch ..."
Abstract
-
Cited by 227 (11 self)
- Add to MetaCart
(Show Context)
Deep-submicron CMOS designs have resulted in large leakage energy dissipation in microprocessors. While SRAM cells in onchip cache memories always contribute to this leakage, there is a large variability in active cell usage both within and across applications. This paper explores an integrated architectural and circuitlevel approach to reducing leakage energy dissipation in instruction caches. We propose, gated-V dd , a circuit-level technique to gate the supply voltage and reduce leakage in unused SRAM cells. Our results indicate that gated-V dd together with a novel resizable cache architecture reduces energy-delay by 62% with minimal impact on performance. 1INTRODUCTION The ever-increasing levels of on-chip integration in the recent decade have enabled phenomenal increases in computer system performance. Unfortunately, the performance improvement has been also accompanied by an increase in a chip's power and energy dissipation. Higher power and energy dissipation require more exp...
The Generic Modeling Environment
- Workshop on Intelligent Signal Processing
, 2001
"... The Generic Modeling Environment (GME) is a configurable toolset that supports the easy creation of domain-specific modeling and program synthesis environments. The primarily graphical, domain-specific models can represent the application and its environment including hardware resources, and their r ..."
Abstract
-
Cited by 209 (9 self)
- Add to MetaCart
(Show Context)
The Generic Modeling Environment (GME) is a configurable toolset that supports the easy creation of domain-specific modeling and program synthesis environments. The primarily graphical, domain-specific models can represent the application and its environment including hardware resources, and their relationship. The models are then used to automatically synthesize the application and/or generate inputs to COTS analysis tools. In addition to traditional signal processing problems, we have applied this approach to tool integration and structurally adaptive systems among other domains. This paper describes the GME toolset and compares it to other similar approaches. A case study is also presented that illustrates the core concepts through an example. 1.
Runtime power monitoring in high-end processors: Methodology and empirical data
, 2003
"... With power dissipation becoming an increasingly vexing problem across many classes of computer systems, measuring power dissipation of real, running systems has become crucial for hardware and software system research and design. Live power measurements are imperative for studies requiring execution ..."
Abstract
-
Cited by 199 (4 self)
- Add to MetaCart
(Show Context)
With power dissipation becoming an increasingly vexing problem across many classes of computer systems, measuring power dissipation of real, running systems has become crucial for hardware and software system research and design. Live power measurements are imperative for studies requiring execution times too long for simulation, such as thermal analysis. Furthermore, as processors become more complex and include a host of aggressive dynamic power management techniques, per-component estimates of power dissipation have become both more challenging as well as more important. In this paper we describe our technique for a coordinated measurement approach that combines real total power measurement with performance-counter-based, perunit power estimation. The resulting tool offers live total power measurements for Intel Pentium 4 processors, and also provides power breakdowns for 22 of the major CPU subunits over minutes of SPEC2000 and desktop workload execution. As an example application, we use the generated component power breakdowns to identify program power phase behavior. Overall, this paper demonstrates a processor power measurement and estimation methodology and also gives experiences and empirical application results that can provide a basis for future power-aware research. 1.
The Design and Use of SimplePower: A Cycle-Accurate Energy Estimation Tool
, 2000
"... In this paper, we present the design and use of a comprehensive framework, SimplePower, for evaluating the effect of high-level algorithmic, architectural, and compilation tradeoffs on energy. An execution-driven, cycle-accurate RT level energy estimation tool that uses transition sensitive energy m ..."
Abstract
-
Cited by 198 (11 self)
- Add to MetaCart
(Show Context)
In this paper, we present the design and use of a comprehensive framework, SimplePower, for evaluating the effect of high-level algorithmic, architectural, and compilation tradeoffs on energy. An execution-driven, cycle-accurate RT level energy estimation tool that uses transition sensitive energy models forms the cornerstone of this framework. SimplePower also provides the energy consumed in the memory system and on-chip buses using analytical energy models.
Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures
- In Proceedings of the 42nd Annual Symposium on Microarchitecture
, 2009
"... This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamen ..."
Abstract
-
Cited by 192 (4 self)
- Add to MetaCart
(Show Context)
This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and doublegate transistors.
Managing Multi-Configurable Hardware via Dynamic Working Set Analysis
- In 29th Annual International Symposium on Computer Architecture
, 2002
"... Microprocessors are designed to provide good average performance over a variety of workloads. This can lead to inefficiencies both in power and performance for individual programs and during individual phases within the same program. Microarchitectures with multi-configuration units (e.g. caches, pr ..."
Abstract
-
Cited by 192 (3 self)
- Add to MetaCart
(Show Context)
Microprocessors are designed to provide good average performance over a variety of workloads. This can lead to inefficiencies both in power and performance for individual programs and during individual phases within the same program. Microarchitectures with multi-configuration units (e.g. caches, predictors, instruction windows) are able to adapt dynamically to program behavior and enable /disable resources as needed. A key element of existing configuration algorithms is adjusting to program phase changes. This is typically done by "tuning" when a phase change is detected -- i.e. sequencing through a series of trial configurations and selecting the best. We study algorithms that dynamically collect and analyze program working set information. To make this practical, we propose working set signatures -- highly compressed working set representations (e.g. 32-128 bytes total). We describe algorithms that use working set signatures to 1) detect working set changes and trigger re-tuning; 2) identify recurring working sets and re-install saved optimal reconfigurations, thus avoiding the time-consuming tuning process; 3) estimate working set sizes to configure caches directly to the proper size, also avoiding the tuning process. We use reconfigurable instruction caches to demonstrate the performance of the proposed algorithms. When applied to reconfigurable instruction caches, an algorithm that identifies recurring phases achieves power savings and performance similar to the best algorithm reported to date, but with orders-of-magnitude savings in retunings. 1