Results 1 - 10
of
219
The predictability of data values
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)
Highly Accurate Data Value Prediction using Hybrid Predictors
, 1997
"... Data dependences (data flow constraints) present a major hurdle to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, wh ..."
Abstract
-
Cited by 211 (3 self)
- Add to MetaCart
Data dependences (data flow constraints) present a major hurdle to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, when an instruction is fetched, its result can be predicted so that subsequent instructions that depend on the result can use this pre- dicted value. l/Vhen the correct result becomes avail- able, all instructions that are data dependent on that prediction can be validated. This paper investigates a variety of techniques to carry out highly accurate data value predictions. The first technique investigates the potential of monitoring the strides by which the results produced by different instances of an instruction change. The second technique investigates the potential of pattern-based two-level prediction schemes. Simulation results of these two schemes show improvements over the existing method of predicting the last outcome. In particular, some benchmarks show improvement with the stride-based predictor and others show improvement with the pattern-based predictor. To do uniformly well across benchmarks, we combine these two predictors to form a hybrid predictor. Simulation analysis of the hybrid predictor shows its overall prediction accuracy to be better than that of the component predictors across all benchmarks.
Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors
- In Proceedings of the 28th Annual International Symposium on Computer Architecture
, 2001
"... Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive appr ..."
Abstract
-
Cited by 174 (0 self)
- Add to MetaCart
(Show Context)
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution---essentially a combined act of speculative address generation and prefetching--- to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.
Managing Wire Delay in Large Chip-Multiprocessor Caches
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2004
"... In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency bank ..."
Abstract
-
Cited by 157 (4 self)
- Add to MetaCart
In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.
Selective value prediction
- In 26th Annual International Symposium on Computer Architecture
, 1999
"... Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines sel ..."
Abstract
-
Cited by 138 (13 self)
- Add to MetaCart
(Show Context)
Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines selective techniques for using value prediction in the presence of predictor capacity constraints and reasonable misprediction penalties. We examine prediction and confidence mechanisms in light of these constraints, and we minimize capacity conflicts through instruction filtering. The latter technique filters which instructions put values into the value prediction table. We examine filtering techniques based on instruction type, as well as giving priority to instructions belonging to the longest data dependence path in the processor’s active instruction window. We apply filtering both to the producers of predicted values and the consumers. In addition, we examine the benefit of using different confidence levels for instructions using predicted values on the longest dependence path. 1
Handling Long-latency Loads in a Simultaneous Multithreading Processor
, 2001
"... Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very longlatency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads cou ..."
Abstract
-
Cited by 125 (10 self)
- Add to MetaCart
(Show Context)
Simultaneous multithreading architectures have been defined previously with fully shared execution resources. When one thread in such an architecture experiences a very longlatency operation, such as a load miss, the thread will eventually stall, potentially holding resources which other threads could be using to make forward progress. This paper shows that in many cases it is better to free the resources associated with a stalled thread rather than keep that thread ready to immediately begin execution upon return of the loaded data. Several possible architectures are examined, and some simple solutions are shown to be very effective, achieving speedups close to 6.0 in some cases, and averaging 15% speedup with four threads and over 100% speedup with two threads running. Response times are cut in half for several workloads in open system experiments. 1
Data Prefetch Mechanisms
, 2000
"... The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently use ..."
Abstract
-
Cited by 110 (4 self)
- Add to MetaCart
(Show Context)
The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching
Predictor-directed stream buffers
- In 33rd International Symposium on Microarchitecture
, 2000
"... An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ’ ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in ..."
Abstract
-
Cited by 70 (12 self)
- Add to MetaCart
(Show Context)
An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ’ ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. In this paper we propose Predictor-Directed Stream Buffers (PSB), a scheme in which the stream buffer follows an address prediction stream instead of a fixed stride. In addition, we examine using confidence techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show for pointer-based applications that PSB provides a 30 % speedup on average over no prefetching, and provides an average 10 % speedup over using previously proposed stride-based stream buffers for pointer-intensive applications. 1
Implementations of Context Based Value Predictors
, 1997
"... Execution paradigms that eliminate data dependences based on value prediction have been shown to have significant performance potential. High accuracy value prediction is essential for the success of such paradigms. Recently it was shown that context based prediction can predict values with high a ..."
Abstract
-
Cited by 59 (3 self)
- Add to MetaCart
(Show Context)
Execution paradigms that eliminate data dependences based on value prediction have been shown to have significant performance potential. High accuracy value prediction is essential for the success of such paradigms. Recently it was shown that context based prediction can predict values with high accuracy.
Predictive Techniques for Aggressive Load Speculation
- In 31st International Symposium on Microarchitecture
, 1998
"... Load latency remains a significant bottleneck in dynamically scheduled pipelined processors. Load speculation techniques have been proposed to reduce this latency. Dependence Prediction can be used to allow loads to be issued before all prior store addresses are known, and to predict exactly which s ..."
Abstract
-
Cited by 53 (9 self)
- Add to MetaCart
Load latency remains a significant bottleneck in dynamically scheduled pipelined processors. Load speculation techniques have been proposed to reduce this latency. Dependence Prediction can be used to allow loads to be issued before all prior store addresses are known, and to predict exactly which store a load should wait upon. Address Prediction can be used to allow a load to bypass the calculation of its effective address and speculatively issue. Value Prediction can be used to bypass the load forward latency and avoid cache misses. Memory Renaming has been proposed to communicate stored values directly to aliased loads. In this paper we examine in detail the interaction and performance tradeoffs of these four load speculation techniques in the presence of two miss-speculation recovery architectures -- reexecution and squash. We examine the performance of combining these techniques to create a load speculation chooser which provides performance improvement over using any one techniqu...