DMCA
Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors (2001)
Venue: | In Proceedings of the 28th Annual International Symposium on Computer Architecture |
Citations: | 174 - 0 self |
Citations
1420 | The SPLASH-2 programs: characterization and methodological considerations. ISCA,
- Woo, Ohara, et al.
- 1995
(Show Context)
Citation Context ...an be employed to generate data addresses and prefetch well ahead of the main execution, we examine a set of applications drawn from four common benchmark suites (SPEC2000 [16], SPEC95 [10], SPLASH-2 =-=[34]-=-, and Olden [25]). A large number of cache misses in these applications are due to relatively irregular access patterns involving pointers, hash tables, indirect array references, or a mix of them, wh... |
1158 |
Advanced Compiler Design and Implementation
- Muchnick
- 1997
(Show Context)
Citation Context ...incur cache misses. Thus, the very first thing is to decide where to launch pre-execution in the program, based on the programmer's knowledge, cache miss profiling [23], or compiler locality analysis =-=[24]-=-. Once this decision has been made, new instructions for spawning pre-execution threads are inserted at the right places in the program. Each threadspawning instruction requests for an idle hardware c... |
589 | Multiscalar Processors. In
- Sohi, Breach, et al.
- 1995
(Show Context)
Citation Context ...res the computational results of pre-execution. More aggressive schemes that actually use the results of speculative threads have also been proposed. Examples of them are the multiscalar architecture =-=[29]-=-, threadlevel data-speculation (TLDS) [30], threaded multiple path execution (TME) [33], dynamic multithreading (DMT) [2], slipstream processors [31], and speculative data-driven multithreading (DDMT)... |
382 | Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor,”
- Tullsen, Eggers, et al.
- 1996
(Show Context)
Citation Context ...licity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching. 1. Introduction Multithreading =-=[1, 32]-=- and prefetching [8, 20, 22] are two major techniques for tolerating ever-increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the ... |
333 |
SPEC CPU 2000: Measuring CPU Performance in the New Millennium.
- Henning
- 2000
(Show Context)
Citation Context ...e-controlled pre-execution can be employed to generate data addresses and prefetch well ahead of the main execution, we examine a set of applications drawn from four common benchmark suites (SPEC2000 =-=[16]-=-, SPEC95 [10], SPLASH-2 [34], and Olden [25]). A large number of cache misses in these applications are due to relatively irregular access patterns involving pointers, hash tables, indirect array refe... |
302 | Tolerating Latency Through Software-Controlled Data Prefetching.
- Mowry
- 1993
(Show Context)
Citation Context ...fers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching. 1. Introduction Multithreading [1, 32] and prefetching =-=[8, 20, 22]-=- are two major techniques for tolerating ever-increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the running thread encounters a ... |
290 |
Programming with POSIX Threads,
- Butenhof
- 1997
(Show Context)
Citation Context ...e done by the compiler. To facilitate the insertion process, we can have an application programming interface (API) for manipulating preexecution threads. In fact, a similar interface called PThreads =-=[6]-=- has long been used to exploit parallelism using threads. By providing a new API (perhaps we can call it MThreads) or extending the existing PThreads to support pre-execution threads, programmers can ... |
283 | APRIL: A processor Architecture for Multiprocessing,”
- Agarwal, Lim, et al.
- 1990
(Show Context)
Citation Context ...licity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching. 1. Introduction Multithreading =-=[1, 32]-=- and prefetching [8, 20, 22] are two major techniques for tolerating ever-increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the ... |
256 | The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization”,
- Steffan, Mowry
- 1998
(Show Context)
Citation Context ...Such register-state duplication is also needed in many other related techniques like threaded multi-path execution [33], speculative data-driven multithreading [28], and thread-level data speculation =-=[30]-=-, etc. For these techniques, the entire register state has to be copied as fast as possible. Consequently, special hardware mechanisms have been proposed to accomplish this task. In our case, although... |
220 | Effective Hardware-Based Data Prefetching for High-Performance Processors.
- Chen, Baer
- 1995
(Show Context)
Citation Context ...fers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching. 1. Introduction Multithreading [1, 32] and prefetching =-=[8, 20, 22]-=- are two major techniques for tolerating ever-increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the running thread encounters a ... |
203 | Compiler-based prefetching for recursive data structures.”
- Luk, Mowry
- 1996
(Show Context)
Citation Context ...fers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching. 1. Introduction Multithreading [1, 32] and prefetching =-=[8, 20, 22]-=- are two major techniques for tolerating ever-increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the running thread encounters a ... |
190 | A Dynamic Multithreading Processor.
- Akkary, Driscoll
- 1998
(Show Context)
Citation Context ...ds have also been proposed. Examples of them are the multiscalar architecture [29], threadlevel data-speculation (TLDS) [30], threaded multiple path execution (TME) [33], dynamic multithreading (DMT) =-=[2]-=-, slipstream processors [31], and speculative data-driven multithreading (DDMT) [28]. Of course, being able to use the results of speculative threads is appealing. Nevertheless, by caring only data ad... |
187 | Slipstream processors: improving both performance and fault tolerance.
- Sundaramoorthy, Purser, et al.
- 2000
(Show Context)
Citation Context ...eculatively. There are a number of ways to apply pre-execution, including prefetching data and/or instructions [12, 26], precomputing branch outcomes [15], and pre-computing general execution results =-=[28, 31]-=-. For our purpose, pre-execution is mainly used as a vehicle for speculatively generating data addresses and prefetching---the ultimate computational results are simply ignored. Moreover, unlike sever... |
180 | Speculative precomputation: Long-range prefetching of delinquent loads.
- Collins, Wang, et al.
- 2001
(Show Context)
Citation Context ... mainly used as a vehicle for speculatively generating data addresses and prefetching---the ultimate computational results are simply ignored. Moreover, unlike several recent pre-execution techniques =-=[4, 5, 9, 26, 28, 31, 36]-=- which pre-execute a shortened version of the program, our technique simply works on the original program and hence requires no mechanism to trim the program. In essence, our technique tolerates laten... |
179 | Dependence based prefetching for linked data structures.”
- Roth, Moshovos, et al.
- 1998
(Show Context)
Citation Context ...tching, the main advantage of multithreading is that unlike prefetching, it does not need to predict data addresses in advance which can be a serious challenge in codes with irregular access patterns =-=[20, 26]-=-. Prefetching, however, has a significant advantage that it can improve single-thread performance, unlike multithreading which requires multiple concurrent threads. In this paper, we propose a techniq... |
173 | Execution-based Prediction Using Speculative Slices.
- Zilles, Sohi
- 2001
(Show Context)
Citation Context ... mainly used as a vehicle for speculatively generating data addresses and prefetching---the ultimate computational results are simply ignored. Moreover, unlike several recent pre-execution techniques =-=[4, 5, 9, 26, 28, 31, 36]-=- which pre-execute a shortened version of the program, our technique simply works on the original program and hence requires no mechanism to trim the program. In essence, our technique tolerates laten... |
169 |
Speculative Data-Driven Multi-Threading. In
- Roth, Sohi
- 2001
(Show Context)
Citation Context ... threadlevel data-speculation (TLDS) [30], threaded multiple path execution (TME) [33], dynamic multithreading (DMT) [2], slipstream processors [31], and speculative data-driven multithreading (DDMT) =-=[28]-=-. Of course, being able to use the results of speculative threads is appealing. Nevertheless, by caring only data addresses but not the final results of pre-execution, our scheme is substantially simp... |
166 | Supporting dynamic data structures on distributed memory machines.”
- Rogers, Carlisle, et al.
- 1995
(Show Context)
Citation Context ...o generate data addresses and prefetch well ahead of the main execution, we examine a set of applications drawn from four common benchmark suites (SPEC2000 [16], SPEC95 [10], SPLASH-2 [34], and Olden =-=[25]-=-). A large number of cache misses in these applications are due to relatively irregular access patterns involving pointers, hash tables, indirect array references, or a mix of them, which are typicall... |
153 | Improving data cache performance by pre-executing instructions under a cache miss.
- Dundas, Mudge
- 1997
(Show Context)
Citation Context ...to as the approach that tolerates long-latency operations by initiating them early and speculatively. There are a number of ways to apply pre-execution, including prefetching data and/or instructions =-=[12, 26]-=-, precomputing branch outcomes [15], and pre-computing general execution results [28, 31]. For our purpose, pre-execution is mainly used as a vehicle for speculatively generating data addresses and pr... |
127 | The Alpha 21264 Microprocessor Architecture. In
- Kessler
- 1998
(Show Context)
Citation Context ... Pipeline Parameters Number of Hardware Contexts 4 Fetch/Decode/Issue/Commit Width 8 Instruction Queue 128 entries Functional Units 8 integer, 6 floating-point; latencies are based on the Alpha 21264 =-=[17]-=- Branch Predictor A McFarling-style choosing branch predictor like the one in the Alpha 21264 [17] Thread Prioritization Policy A modified ICOUNT scheme [32] which favors the main thread Memory Parame... |
111 | Simultaneous subordinate microthreading (ssmt).
- Chappell, Stark, et al.
- 1999
(Show Context)
Citation Context ...n the values returned from cache misses. The notion of using helper threads to accelerate the main execution was independently introduced in the form of simultaneous subordinate microthreading (SSMT) =-=[7]-=- and assisted execution [11]. In both schemes, helper threads do not directly run the program: Instead, they are designated to perform some specific algorithms such as stride prefetching and self-hist... |
109 | Threaded multiple path execution.
- Wallace, Calder, et al.
- 1998
(Show Context)
Citation Context ...ed, we can then cancel all wrong-path pre-execution and only keep the one that is on the right path to allow it running ahead of the main execution. A similar idea of executing multiple paths at once =-=[18, 33]-=- has been exploited before to reduce the impact of branch misprediction. Let us illustrate this scheme using the Spec95 benchmark compress. In this application, the function compress() reads a series ... |
102 | Effective jump-pointer prefetching for linked data structures
- Roth, Sohi
- 1999
(Show Context)
Citation Context ... depicts the situation where the address of the next node we want to prefetch is not known until we finish with the current load. To tackle this problem, prefetching techniques based on jump pointers =-=[20, 27]-=- have been proposed to record the address of the node that we would like to prefetch at the current node according to past traversals. These techniques would tolerate the latency of accessing a single... |
85 | Understanding the Backwards Slices of Performance Degrading Instructions. In
- Zilles, Sohi
- 2000
(Show Context)
Citation Context ...earchers have investigated ways to pre-execute only a subset of instructions (as known as a slice) that lead to performance degradation such as cache misses and branch mispredictions. Zilles and Sohi =-=[35]-=- found that speculation techniques like memory dependency prediction and control independence can be used to significantly reduce the slice size. Recently, a collection of schemes [4, 5, 9, 15, 26, 28... |
84 |
Data Prefetching by Dependence Graph Precomputation.
- Annavaram, Patel, et al.
- 2001
(Show Context)
Citation Context ... mainly used as a vehicle for speculatively generating data addresses and prefetching---the ultimate computational results are simply ignored. Moreover, unlike several recent pre-execution techniques =-=[4, 5, 9, 26, 28, 31, 36]-=- which pre-execute a shortened version of the program, our technique simply works on the original program and hence requires no mechanism to trim the program. In essence, our technique tolerates laten... |
72 | Assisted Execution.
- Song, Dubois
- 1998
(Show Context)
Citation Context ...cache misses. The notion of using helper threads to accelerate the main execution was independently introduced in the form of simultaneous subordinate microthreading (SSMT) [7] and assisted execution =-=[11]-=-. In both schemes, helper threads do not directly run the program: Instead, they are designated to perform some specific algorithms such as stride prefetching and self-history branch prediction. In co... |
72 | Selective eager execution on the polypath architecture
- Klauser, Paithankar, et al.
- 1998
(Show Context)
Citation Context ...ed, we can then cancel all wrong-path pre-execution and only keep the one that is on the right path to allow it running ahead of the main execution. A similar idea of executing multiple paths at once =-=[18, 33]-=- has been exploited before to reduce the impact of branch misprediction. Let us illustrate this scheme using the Spec95 benchmark compress. In this application, the function compress() reads a series ... |
54 |
Simultaneous Multithreading: Multiplying Alpha's Performance. Presentation at the Microprocessor Forum ‘99,
- Emer
- 1999
(Show Context)
Citation Context ...hosen as the platform for this study. An SMT machine allows multiple independent threads to execute simultaneously (i.e. in the same cycle) in different functional units. For example, the Alpha 21464 =-=[13]-=- will be an SMT machine with four threads that can issue up to eight instructions per cycle from one or more threads. Although pre-execution could be applied to other multithreaded architectures as we... |
52 | Dataflow Analysis of Branch Mispredictions and Its Application to Early Resolution of Branch Outcomes.
- Farcy, Temam, et al.
- 1998
(Show Context)
Citation Context ...latency operations by initiating them early and speculatively. There are a number of ways to apply pre-execution, including prefetching data and/or instructions [12, 26], precomputing branch outcomes =-=[15]-=-, and pre-computing general execution results [28, 31]. For our purpose, pre-execution is mainly used as a vehicle for speculatively generating data addresses and prefetching---the ultimate computatio... |
51 | Dynamically Allocating Processor Resources Between Nearby and Distant ILP
- Balasubramonian, Dwarkadas, et al.
- 2001
(Show Context)
Citation Context ... mainly used as a vehicle for speculatively generating data addresses and prefetching---the ultimate computational results are simply ignored. Moreover, unlike several recent pre-execution techniques =-=[4, 5, 9, 26, 28, 31, 36]-=- which pre-execute a shortened version of the program, our technique simply works on the original program and hence requires no mechanism to trim the program. In essence, our technique tolerates laten... |
38 | Predicting Data Cache Misses in Non-Numeric Applications Through Correlation Profiling. In - Mowry, Luk - 1997 |
29 | Automatic Compiler-Inserted Prefetching for Pointer-Based Applications
- Luk, Mowry
- 1999
(Show Context)
Citation Context ...ended algorithm [22] for prefetching these references was used. For the remaining five applications, we experimented with the greedy prefetching and jump-pointer prefetching proposed by Luk and Mowry =-=[20, 21]-=- as well as the extensions of jump-pointer prefetching (full jumping, chain jumping, and root jumping) proposed by Roth and Sohi [27]. In all cases, a wide range of prefetching distances (whenever app... |
7 |
Performance Evaluation Corporation. The SPEC95 benchmark suite. hup://www.specbench org
- Standard
(Show Context)
Citation Context ...pre-execution can be employed to generate data addresses and prefetch well ahead of the main execution, we examine a set of applications drawn from four common benchmark suites (SPEC2000 [16], SPEC95 =-=[10]-=-, SPLASH-2 [34], and Olden [25]). A large number of cache misses in these applications are due to relatively irregular access patterns involving pointers, hash tables, indirect array references, or a ... |
4 |
Relaxing constraints: thoughts on the evolution of computer architecture
- Emer
- 2000
(Show Context)
Citation Context ...ting softwarecontrolled pre-execution in future SMT machines. 8. Acknowledgments I thank Joel Emer's encouragement of pursuing a softwarebased approach to pre-execution (see his HPCA-7 keynote speech =-=[14]-=- for the philosophy behind). Yuan Chou, Robert Cohn, Joel Emer, and Steve Wallace gave insightful comments on early drafts of the paper. In addition, I thank the Asim team for supporting the simulator... |
3 | Multi-chain prefetching: Exploiting memory parallelism in pointer-chasing codes
- Kohout, Choi, et al.
- 2000
(Show Context)
Citation Context ...single chain if the traversal order is fairly static over time and the overhead of maintaining those jump pointers does not overwhelm the benefit. Recently, a technique called multi-chain prefetching =-=[19]-=- has been proposed to attack the pointer-chasing problem from a different angle. Instead of prefetching pointer chains one at a time, this technique prefetches multiple pointer chains that will be vis... |