Results 1 - 10
of
110
Speculative Precomputation: Long-range Prefetching of Delinquent Loads
, 2001
"... This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, an ..."
Abstract
-
Cited by 180 (23 self)
- Add to MetaCart
This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium TM ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%. 1.
Runahead execution: An alternative to very large instruction windows for out-of-order processors
- In HPCA-9
, 2003
"... Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of ..."
Abstract
-
Cited by 175 (22 self)
- Add to MetaCart
(Show Context)
Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large, in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor, without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel R ○ Pentium R ○ 4 processor, having a 128-entry instruction window, adding runahead execution improves the IPC (Instructions Per Cycle) by 22 % across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1 % of a machine with no runahead execution and a 384-entry instruction window. 1.
Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors
- In Proceedings of the 28th Annual International Symposium on Computer Architecture
, 2001
"... Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive appr ..."
Abstract
-
Cited by 174 (0 self)
- Add to MetaCart
(Show Context)
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution---essentially a combined act of speculative address generation and prefetching--- to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.
Execution-based Prediction Using Speculative Slices.
- In Proceedings of the 28th Annual International Symposium on Computer Architecture,
, 2001
"... Abstract instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions ge ..."
Abstract
-
Cited by 173 (6 self)
- Add to MetaCart
(Show Context)
Abstract instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the program's execution path) that they will not be used. This enables the remaining predictions to be correctly correlated.
Understanding the Backward Slices of Performance Degrading Instructions
- in Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fract ..."
Abstract
-
Cited by 85 (3 self)
- Add to MetaCart
(Show Context)
For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch predictors and caches, contribute a large fraction of such performance degrading events. This paper analyzes the dynamic instruction stream leading up to these performance degrading instructions to identify the operations necessary to execute them early. The backward slice (the subset of the program that relates to the instruction) of these performance degrading instructions, if small compared to the whole dynamic instruction stream, can be pre-executed to hide the instruction's latency. To overcome conservative dependence assumptions that result in large slices, speculation can be used, resulting in speculative slices. This paper provides an initial characterization of the backward slices of L2 data cache misses and branch mispredictions, and shows the effectiveness of techniques, including memory dependence prediction and control independence, for reducing the size of these slices. Through the use of these techniques, many slices can be reduced to less than one tenth of the full dynamic instruction stream when considering the 512 instructions before the performance degrading instruction. 1
An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture
- In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2000
"... This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multit ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
(Show Context)
This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multithreading technologies spend a significant fraction of their time in kernel code. A full understanding of the behavior of such workloads therefore requires execution and measurement of the operating system, as well as the application itself. To carry out this study, we (1) modified the Digital Unix 4.0d operating system to run on an SMT CPU, and (2) integrated our SMT Alpha instruction set simulator into the SimOS simulator to provide an execution environment. For an OS-intensive workload, we ran the multithreaded Apache Web server on an 8-context SMT. We compared Apache's user- and kernel-mode behavior to a standard multiprogrammed SPECInt workload, and compared the SMT processor to an out-...
Design and Evaluation of Compiler Algorithms for Pre-Execution
"... Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a ..."
Abstract
-
Cited by 53 (10 self)
- Add to MetaCart
(Show Context)
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting preexecution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate threadlevel parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads. 1.
Dynamically allocating processor resources between nearby and distant ILP
"... Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Since instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such a ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
(Show Context)
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Since instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file, issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the size of these structures, resulting in conflicting design requirements. In this paper, we present a novel microarchitecture designed to overcome the limitations of a register file size dictated by cycle time constraints. Available registers are dynamically allocated between the primary program thread and a future thread. The future thread issues and executes instructions when the primary thread is limited by resource availability. The future thread is not constrained by in-order commit requirements. It is therefore able to examine a much larger instruction window and jump far ahead to execute ready instructions. Results are communicated back to the primary thread by warming up the register file, instruction cache, data cache, and instruction reuse buffer, and by resolving branch mispredicts early. The proposed microarchitecture is able to get an overall speedup of 1.17 over the base processor for our benchmark set, with speedups of up to 1.64.
Toward kilo-instruction processors
- ACM Transactions on Architecture and Code Optimization
, 2004
"... The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may r ..."
Abstract
-
Cited by 48 (4 self)
- Add to MetaCart
The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints. In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor,anarchitecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.