Results 1 - 10
of
20
Slack: Maximizing Performance Under Technological Constraints
, 2002
"... Many emerging processor microarchitectures seek to manage technological constraints (e.g., wire delay, power, and circuit complexity) by resorting to nonuniform designs that provide resources at multiple quality levels (e.g., fast/slow bypass paths, multi-speed functional units, and grid architectur ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
Many emerging processor microarchitectures seek to manage technological constraints (e.g., wire delay, power, and circuit complexity) by resorting to nonuniform designs that provide resources at multiple quality levels (e.g., fast/slow bypass paths, multi-speed functional units, and grid architectures). In such designs, the constraint problem becomes a control problem, and the challenge becomes designing a control policy that mitigates the performance penalty of the non-uniformity. Given the increasing importance of non-uniform control policies, we believe it is appropriate to examine them in their own right. To this end, we develop slack for use in creating control policies that match program execution behavior to machine design. Intuitively, the slack of a dynamic instruction i is the number of cycles i can be delayed with no effect on execution time. This property makes slack a natural candidate for hiding non-uniform latencies. We make three contributions in our exploration of slack. First, we formally define slack, distinguish three variants (local, global and apportioned), and perform a limit study to show that slack is prevalent in our SPEC2000 workload. Second, we show how to predict slack in hardware. Third, we illustrate how to create a control policy based on slack for steering instructions among fast (high power) and slow (lower power) pipelines.
Power-Aware Control Speculation through Selective Throttling
- IN PROCEEDINGS OF THE NINTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 2003
"... ..."
Using Interaction Costs for Microarchitectural Bottleneck Analysis
- ABSTRACT APPEARS IN 36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO ’03)
, 2003
"... Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle wher ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle where many, overlapping resources are active). This paper introduces a new model for understanding event costs to facilitate processor design and optimization. First, we observe that everything in a machine (instructions, hardware structures, events) can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization. Finally, we propose performance-monitoring hardware for measuring interaction costs that is suitable for modern processors.
Quantifying Instruction Criticality
, 2002
"... Information about instruction criticality can be used to control the application of micro-architectural resources efficiently. To this end, several groups have proposed methods to predict critical instructions. This paper presents a framework that allows us to directly measure the criticality of ind ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Information about instruction criticality can be used to control the application of micro-architectural resources efficiently. To this end, several groups have proposed methods to predict critical instructions. This paper presents a framework that allows us to directly measure the criticality of individual dynamic instructions. This allows us to (1) measure the accuracy of proposed critical path predictors, (2) quantify the amount of slack present in non-critical instructions, and (3) provide a new metric, called tautness, which ranks critical instructions by their dominance on the critical path. This research investigates methods for improving critical path predictor accuracy and studies the distribution of slack and tautness in programs. It shows that instruction criticality changes dynamically, and that criticality history patterns can be used to significantly improve predictor accuracy.
Interaction cost and shotgun profiling
- ACM Transactions on Architecture and Code Optimization
, 2004
"... We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energ ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energy efficiency and finding the right loads to prefetch all require measuring the performance costs of bottlenecks. In the past, simple event counts were enough to find the important bottlenecks. Today, the parallelism of modern processors makes such analysis much more difficult, rendering traditional performance counters less useful. If two microarchitectural events (such as a fetch stall and a cache miss) occur in the same cycle, which event should we blame for the cycle? What cost should we assign to each event? In this paper, we introduce a new model for understanding event costs to facilitate processor design and optimization. First, we observe that all instructions, hardware structures, and events in a machine can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization.
Critical Path Analysis of the TRIPS Architecture
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2006
"... Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performanc ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem [5, 18]. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques [5, 18]. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%. 1
A Dynamically Reconfigurable Mixed In-Order/Out-of-Order Issue Queue for Power-Aware Microprocessors
- In ISVLSI
, 2003
"... In this work we focus on power-aware solutions for the issue queue in an out-of-order superscalar processor. We propose two different schemes. Our first approach partitions the issue queue into FIFOs such that only the instructions at the head of each FIFO may request to issue. We then dynamically m ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this work we focus on power-aware solutions for the issue queue in an out-of-order superscalar processor. We propose two different schemes. Our first approach partitions the issue queue into FIFOs such that only the instructions at the head of each FIFO may request to issue. We then dynamically monitor the FIFO usage and disable FI-FOs that are not being efficiently used. In our second approach we also use a FIFO scheme, but dynamically vary the number and size of each FIFO simultaneously while at the same time keeping the total number of issue queue entries constant. We analyze both approaches and compare them in terms of the performance and power reduction. We find that although the first scheme of completely disabling issue queue entries is more straight-forward to implement, it may not be the best option, particularly for floating point applications. Our best experimental result shows an average power saving of 27.3 % in the issue queue with a performance degradation of only 2.7%. 1.
Microarchitectural techniques to reduce interconnect power in clustered processors
- In Proc. of the Workshop on Complexity Effective Design
, 2004
"... The paper presents a preliminary evaluation of novel techniques that address a growing problem – power dissipation in on-chip interconnects. Recent studies have shown that around 50 % of the dynamic power consumption in modern processors is within onchip interconnects. The contribution of interconne ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The paper presents a preliminary evaluation of novel techniques that address a growing problem – power dissipation in on-chip interconnects. Recent studies have shown that around 50 % of the dynamic power consumption in modern processors is within onchip interconnects. The contribution of interconnect power to total chip power is expected to be higher in future communication-bound billion-transistor architectures. In this paper, we propose the design of a heterogeneous interconnect, where some wires are optimized for low latency and others are optimized for low power. We show that a large fraction of onchip communications are latency insensitive. Effecting these non-critical transfers on low-power long-latency interconnects can result in significant power savings without unduly affecting performance. Two primary techniques are evaluated in this paper: (i) a dynamic critical path predictor that identifies results that are not urgently consumed, and (ii) an address prediction mechanism that requires addresses to be transferred off the critical path for verification purposes. Our results demonstrate that 49 % of all interconnect transfers can be effected on power-efficient wires, while incurring a performance penalty of only 2.5%. 1.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning
- In 7th Annual Workshop on Interaction Between Compilers and Computer Architecture (INTERACT-7
, 2003
"... By anticipating when resources will be idle, it is possible to reconfigure the hardware to reduce power consumption without significantly reducing performance. This requires predicting what the resource requirements will be for an application. In the past, researchers have taken one of two approache ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
By anticipating when resources will be idle, it is possible to reconfigure the hardware to reduce power consumption without significantly reducing performance. This requires predicting what the resource requirements will be for an application. In the past, researchers have taken one of two approaches: design hardware monitors that can measure recent performance, or profile the application to determine the most likely behavior for each block of code. This paper explores a third option which is to combine hardware monitoring with software profiling to achieve lower power utilization than either method alone. We demonstrate the potential for this approach in two ways. First, we compare hardware monitoring and software profiling of IPC for code blocks and show that they capture different information. By combining them, we can control issue width and ALU usage more effectively to save more power. Second, we show that anticipating stalls due to critical load misses in the L2 cache can enable fetch halting. However, hardware monitoring and software profiling must be used together to effectively predict misses and criticality of loads.
Exploiting Compiler-Generated Schedules for Energy Savings in High-Performance Processors
- In Proceedings of the 2003 International Symposium on Low Power Electronics and Design
, 2003
"... This paper develops a technique that uniquely combines the advantages of static scheduling and dynamic scheduling to reduce the energy consumed in modern superscalar processors with out-of-order issue logic. In this HybridScheduling paradigm, regions of the application containing large amounts of p ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper develops a technique that uniquely combines the advantages of static scheduling and dynamic scheduling to reduce the energy consumed in modern superscalar processors with out-of-order issue logic. In this HybridScheduling paradigm, regions of the application containing large amounts of parallelism visible at compile-time completely bypass the dynamic scheduling logic and execute in a low power static mode. Simulation studies using the Wattch framework on several media and scientific benchmarks demonstrate large improvements in overall energy consumption of 43% in kernels and 25% in full applications with only a 2.8% performance degradation on average.

