Results 1 - 10
of
19
Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources
- in Proc. of MICRO–34
, 2001
"... The “one–size–fits–all ” philosophy used for permanently allocating datapath resources in today’s superscalar CPUs to maximize performance across a wide range of applications results in the overcommitment of resources in general. To reduce power dissipation in the datapath, the resource allocations ..."
Abstract
-
Cited by 81 (12 self)
- Add to MetaCart
The “one–size–fits–all ” philosophy used for permanently allocating datapath resources in today’s superscalar CPUs to maximize performance across a wide range of applications results in the overcommitment of resources in general. To reduce power dissipation in the datapath, the resource allocations can be dynamically adjusted based on the demands of applications. We propose a mechanism to dynamically, simultaneously and independently adjust the sizes of the issue queue (IQ), the reorder buffer (ROB) and the load/store queue (LSQ) based on the periodic sampling of their occupancies to achieve significant power savings with minimal impact on performance. Resource upsizing is done more aggressively (compared to downsizing) using the relative rate of blocked dispatches to limit the performance penalty. Our results are validated by the execution of SPEC 95 benchmark suite on a substantially modified version of Simplescalar simulator, where the IQ, the ROB, the LSQ and the register files are implemented as separate structures, as is the case with most practical implementations. For the SPEC 95 benchmarks, the use of our technique in a 4–way superscalar processor results in a power savings in excess of 70 % within individual components and an average power savings of 53 % for the IQ, LSQ and ROB combined for the entire benchmark suite with an average performance penalty of only 5%.
Front-End Policies for Improved Issue Efficiency in SMT Processors
- Proceedings of the 9th Intl. Conference on High Performance Computer Architecture
, 2003
"... The performance and power optimization of dynamic superscalar microprocessors requires striking a careful balance between exploiting parallelism and hardware simplification. Hardware structures which are needlessly complex may exacerbate critical timing paths and dissipate extra power. One such stru ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
The performance and power optimization of dynamic superscalar microprocessors requires striking a careful balance between exploiting parallelism and hardware simplification. Hardware structures which are needlessly complex may exacerbate critical timing paths and dissipate extra power. One such structure requiring careful design is the issue queue. In a Simultaneous Multi-Threading (SMT) processor, it is particularly challenging to achieve issue queue simplification due to the increased utilization of the queue afforded by multi-threading.
Energy efficient co-adaptive instruction fetch and issue
- In ISCA ’03: Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... Front-end instruction delivery accounts for a significant fraction of the energy consumed in a dynamic superscalar processor. The issue queue in these processors serves two crucial roles: it bridges the front and back ends of the processor and serves as the window of instructions for the outof-order ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Front-end instruction delivery accounts for a significant fraction of the energy consumed in a dynamic superscalar processor. The issue queue in these processors serves two crucial roles: it bridges the front and back ends of the processor and serves as the window of instructions for the outof-order engine. A mismatch between the front end producer rate and back end consumer rate, and between the supplied instruction window from the front end, and the required instruction window to exploit the level of application parallelism, results in additional front-end energy, and increases the issue queue utilization. While the former increases overall processor energy consumption, the latter aggravates the issue queue hot spot problem. We propose a complementary combination of fetch gating
Cross-Layer Adaptive Video Coding to Reduce Energy on General-Purpose Processors
- In Proc. of IEEE Intl. Conf. on Image Processing
, 2003
"... Traditionally, video encoders have been designed assuming that the more redundancy is removed, the better the encoder. However, on current laptops, reducing the compression efficiency of the video encoder by reducing the number of instructions used to perform compression can actually reduce the tota ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
Traditionally, video encoders have been designed assuming that the more redundancy is removed, the better the encoder. However, on current laptops, reducing the compression efficiency of the video encoder by reducing the number of instructions used to perform compression can actually reduce the total energy used to encode and transmit a sequence. The correct balance between computation and compression efficiency may change dynamically, motivating adaptive encoders. At the same time, recent generalpurpose processors also employ energy-driven adaptations. For best gains, the adaptations in the hardware and application layers must be coordinated. From a system design viewpoint, this coordination must happen through minimal, well-defined interfaces.
Dynamic Allocation of Datapath Resources for Low Power
- in Proc. of Workshop on Complexity–Effective Design, held in conjunction with ISCA–28
, 2001
"... We show by profiling the execution of SPEC95 benchmarks that the usage of datapath resources in a modern superscalar processor is highly dynamic and correlated. The one–size– fits all philosophy used for permanently allocating datapath resources in a modern superscalar CPU is thus complexity– ineffe ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
We show by profiling the execution of SPEC95 benchmarks that the usage of datapath resources in a modern superscalar processor is highly dynamic and correlated. The one–size– fits all philosophy used for permanently allocating datapath resources in a modern superscalar CPU is thus complexity– ineffective due to the overcommittment of resources in general. We propose a strategy to dynamically and simultaneously adjust the sizes of two such correlated resources – the dispatch buffer (also known as an issue queue) and the reorder buffer – to reduce power dissipation in the datapath without significant impact on the performance. We also show how the resizing technique can be augmented with dynamic adaptation of dispatch rate. Representative results show reduction in power dissipation of 69 % for the dispatch buffer and of 52 % for the reorder buffer with an average IPC loss below 8.5%. 1.
A Case for Dynamic Pipeline Scaling
- PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURE, AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES'02
, 2002
"... Energy consumption can be reduced by scaling down frequency when peak performance is not needed. A lower frequency permits slower circuits, and hence a lower supply voltage. Energy reduction comes from voltage reduction, a technique called Dynamic Voltage Scaling (DVS). This paper makes the case tha ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Energy consumption can be reduced by scaling down frequency when peak performance is not needed. A lower frequency permits slower circuits, and hence a lower supply voltage. Energy reduction comes from voltage reduction, a technique called Dynamic Voltage Scaling (DVS). This paper makes the case that the useful frequency range of DVS is limited because there is a lower bound on voltage. Lowering frequency permits voltage reduction until the lowest voltage is reached. Beyond that point, lowering frequency further does not save energy because voltage is constant. However, there is still opportunity for energy reduction outside the influence of DVS. If frequency is lowered enough, pairs of pipeline stages can be merged to form a shallower pipeline. The shallow pipeline has better instructions-per-cycle (IPC) than the deep pipeline. Since energy also depends on IPC, energy is reduced for a given frequency. Accordingly, we propose Dynamic Pipeline Scaling (DPS). A DPS-enabled deep pipeline can merge adjacent pairs of stages by making the intermediate latches transparent and disabling corresponding feedback paths. Thus, a DPS-enabled pipeline has a deep mode for higher frequencies within the influence of DVS, and a shallow mode for lower frequencies. Shallow mode extends the frequency range for which energy reduction is possible. For frequencies outside the influence of DVS, a DPS-enabled deep pipeline consumes from 23 % to 40 % less energy than a rigid deep pipeline.
Dynamically trading frequency for complexity in a gals microprocessor
- In Proceedings of the 37th International Symposium on Microarchitecture
, 2004
"... Microprocessors are traditionally designed to provide “best overall ” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by the current application pha ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Microprocessors are traditionally designed to provide “best overall ” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by the current application phase. Others have proposed a different energy-saving approach: dividing the processor into domains and dynamically changing the clock frequency and voltage within each domain during phases when the full domain frequency is not required. What has not been studied to date is how to exploit the adaptive nature of these approaches to improve performance rather than to save energy. In this paper, we describe an adaptive globally asynchronous, locally synchronous (GALS) microprocessor with
Instruction Packing: Reducing Power and Delay of the Dynamic Scheduling Logic
- in Proc. ISLPED
, 2005
"... The instruction scheduling logic used in modern superscalar microprocessors often relies on associative searching of the issue queue entries to dynamically wakeup instructions for the execution. Traditional designs use one issue queue entry for each instruction, regardless of the actual number of op ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
The instruction scheduling logic used in modern superscalar microprocessors often relies on associative searching of the issue queue entries to dynamically wakeup instructions for the execution. Traditional designs use one issue queue entry for each instruction, regardless of the actual number of operands actively used in the wakeup process. In this paper we propose Instruction Packing – a novel microarchitectural technique that reduces both the delay and the power consumption of the issue queue by sharing the associative part of an issue queue entry between two instructions, each with at most one non-ready register source operand at the time of dispatch. Our results show that Instruction Packing provides a 39 % reduction of the whole issue queue power and 21.6 % reduction in the wakeup delay with as little as 0.4% IPC degradation on the average across the simulated SPEC benchmarks.
Fetch gating control through speculative instruction window weighting
- In 2nd HiPEAC Conference
, 2007
"... Abstract. In a dynamic reordering superscalar processor, the front-end fetches instructions and places them in the issue queue. Instructions are then issued by the back-end execution core. Till recently, the front-end was designed to maximize performance without considering energy consumption. The f ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract. In a dynamic reordering superscalar processor, the front-end fetches instructions and places them in the issue queue. Instructions are then issued by the back-end execution core. Till recently, the front-end was designed to maximize performance without considering energy consumption. The front-end fetches instructions as fast as it can until it is stalled by a filled issue queue or some other blocking structure. This approach wastes energy: (i) speculative execution causes many wrong-path instructions to be fetched and executed, and (ii) back-end execution rate is usually less than its peak rate, but front-end structures are dimensioned to sustained peak performance. Dynamically reducing the frontend instruction rate and the active size of front-end structure (e.g. issue queue) is a required performance-energy trade-off. Techniques proposed in the literature attack only one of these effects. In previous work, we have proposed Speculative Instruction Window Weighting (SIWW) [21], a fetch gating technique that allows to address both fetch gating and instruction issue queue dynamic sizing. SIWW computes a global weight on the set of inflight instructions. This weight depends on the number and types of inflight instructions (non-branches, high confidence or low confidence branches,...). The front-end instruction rate can be continuously adapted based on this weight. This paper extends the analysis of SIWW performed in previous work. It shows that SIWW performs better than previously proposed fetch gating techniques and that SIWW allows to dynamically adapt the size of the active instruction queue. 1
Power-Efficient Wakeup Tag Broadcast
- in Proceedings of the International Conference on Computer Design
, 2005
"... The dynamic instruction scheduling logic is one of the most critical components of modern superscalar microprocessors, both from the delay and power dissipation standpoints. The delay and energy requirement of driving the wakeup tags across the associatively-addressed issue queue accounts for a sign ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The dynamic instruction scheduling logic is one of the most critical components of modern superscalar microprocessors, both from the delay and power dissipation standpoints. The delay and energy requirement of driving the wakeup tags across the associatively-addressed issue queue accounts for a significant percentage of the scheduler’s overhead and also limits the design scalability. We propose Tag Memoization and Tagline Folding- two schemes to reduce the power of wakeup tag broadcasts by reducing the number of tag-bits that are driven in each broadcast. Our results show that the combination of these mechanisms provides 22.3 % average reduction of the wakeup tag broadcast power with no impact on the IPC. 1.

