Results 1 - 10
of
33
Task Selection for a Multiscalar Processor
- In Proceedings of the 31st annual international symposium on Microarchitecture
, 1998
"... The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implication ..."
Abstract
-
Cited by 58 (6 self)
- Add to MetaCart
The Multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of this paper is to understand the key implications of the architectural features of distributed processor organization and task-level speculation for compiler task selection from the point of view of performance. We identify the fundamental performance issues to be: control flow speculation, data communication, data dependence speculation, load imbalance, and task overhead. We show that these issues are intimately related to a few key characteristics of tasks: task size, inter-task control flow, and inter-task data dependence. We describe compiler heuristics to select tasks with favorable characteristics. We report experimental results to show that the heuristics are successful in boosting overall performance by establishing larger ILP win...
Compiler Optimization of Scalar Value Communication Between Speculative Threads
- In Proceedings of the 10th ASPLOS
, 2002
"... While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of pro ..."
Abstract
-
Cited by 56 (17 self)
- Add to MetaCart
While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2--28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.
The stampede approach to thread-level speculation
- ACM Transactions on Computer Systems
, 2005
"... Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multipr ..."
Abstract
-
Cited by 38 (6 self)
- Add to MetaCart
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86 % and 56%, four others by more than 8%, and an average across all applications of 16%—confirming that TLS is a promising way
Mitosis compiler: An infrastructure for speculative threading based on pre-computation slices
- In PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
, 2005
"... Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, w ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
Speculative parallelization can provide significant sources of additional thread-level parallelism, especially for irregular applications that are hard to parallelize by conventional approaches. In this paper, we present the Mitosis compiler, which partitions applications into speculative threads, with special emphasis on applications for which conventional parallelizing approaches fail. The management of inter-thread data dependences is crucial for the performance of the system. The Mitosis framework uses a pure software approach to predict/compute the thread’s input values. This software approach is based on the use of pre-computation slices (p-slices), which are built by the Mitosis compiler and added at the beginning of the speculative thread. P-slices must compute thread input values accurately but they do not need to guarantee correctness, since the underlying architecture can detect and recover from misspeculations. This allows the compiler to use aggressive/unsafe optimizations to significantly reduce their overhead. The most important optimizations included in the Mitosis compiler and presented in this paper are branch pruning, memory and register dependence speculation, and early thread squashing. Performance evaluation of Mitosis compiler/architecture shows an average speedup of 2.2.
A study of control independence in superscalar processors
, 1998
"... Control independence has been put forward as a significant new source of instruction-level parallelism for future generation processors. However, its performance potential under practical hardware constraints is not known, and even less is understood about the factors that contribute to or limit the ..."
Abstract
-
Cited by 36 (4 self)
- Add to MetaCart
Control independence has been put forward as a significant new source of instruction-level parallelism for future generation processors. However, its performance potential under practical hardware constraints is not known, and even less is understood about the factors that contribute to or limit the performance of control independence. Important aspects of control independence are identified and singled out for study, and a series of idealized machine models are used to isolate and evaluate these aspects. It is shown that much of the performance potential of control independence is lost due to data dependences and wasted resources consumed by incorrect control dependent instructions. Even so, control independence can close the performance gap between real and perfect branch prediction by as much as half. Next, important implementation issues are discussed and some design alternatives are given. This is followed by a more detailed set of simulations, where the key implementation features are realistically modeled. These simulations show typical performance improvements of 10-30%. 1.
Memory Dependence Prediction
, 1998
"... As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combat ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence. Memory dependence prediction exploits regularity in the memory dependence stream of ordinary programs, a phenomenon which is also identified in this thesis. To demonstrate the utility of memory dependence prediction this thesis also presents the following three novel microarchitectural techniques: 1. Dynamic Speculation/Synchronization of Memory Dependences: this thesis demonstrates that to exploit parallelism over larger regions of code waiting to determine the dependences a load has is not the best performing option. Higher performance is possible if memory dependence speculation is used especially if memory dependence prediction is used to guide this speculation.
A Quantitative Assessment of Thread-Level Speculation Techniques
- IN PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS ’00
, 1999
"... Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit such type of parallelism have been proposed and significant gains have been u ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Speculative thread-level parallelism has been recently proposed as an alternative source of parallelism that can boost the performance for applications where independent threads are hard to find. Several schemes to exploit such type of parallelism have been proposed and significant gains have been usually reported. However, there is a lack of undertanding of the sources of these benefits as well as the impact of some design choices. This work analyzes the benefits of different thread speculation techniques and the impact of some critical issues such as the value predictor, the branch predictor, the thread initialization overhead and the connectivity among thread units.
Control Independence in Trace Processors
- IN PROC. 32ND INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1999
"... Branch mispredictions are a major obstacle to exploiting instruction-level parallelism, at least in part because all instructions after a mispredicted branch are squashed. However, instructions that are control independent of the branch must be fetched regardless of the branch outcome, and do not ne ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Branch mispredictions are a major obstacle to exploiting instruction-level parallelism, at least in part because all instructions after a mispredicted branch are squashed. However, instructions that are control independent of the branch must be fetched regardless of the branch outcome, and do not necessarily have to be squashed and re-executed. Control independence exists when the two paths following a branch re-converge. A trace processor microarchitecture is developed to exploit control independence and thereby reduce branch misprediction penalties. There are three major contributions. 1) Trace-level re-convergence is not guaranteed despite re-convergence at the instruction-level. Novel trace selection techniques are developed to expose control independence at the trace-level. 2) Control independence’s potential complexity stems from insertion and removal of instructions from the middle of the instruction window. Trace processors manage control flow hierarchically (traces are the fundamental unit of control flow) and this results in an efficient implementation. 3) Control independent instructions must be inspected for incorrect data dependences caused by mispredicted control flow. Existing data speculation support is easily leveraged to selectively re-execute incorrect-data dependent, control independent instructions. Control independence improves trace processor performance from 2 % to 25%, and 13 % on average, for the SPEC95 integer benchmarks.
Architecture of the Atlas Chip-Multiprocessor: Dynamically Parallelizing Irregular Applications
- IEEE Transactions on Computers
, 1999
"... Single-chip multiprocessors are an important research direction for future microprocessors. The stigma of this approach is that many important applications cannot be automatically parallelized. This paper presents a single-chip multiprocessor that engages aggressive speculation techniques to enable ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Single-chip multiprocessors are an important research direction for future microprocessors. The stigma of this approach is that many important applications cannot be automatically parallelized. This paper presents a single-chip multiprocessor that engages aggressive speculation techniques to enable dynamic parallelization of irregular, sequential binaries. Thread speculation (multiscalar execution) and data value prediction are combined to enable the processor to execute dependent threads in parallel. The architecture performs a novel form of dynamic thread partitioning and includes an aggressive correlated value predictor. Several new microarchitectural structures manage inter-thread dependencies. On an eight processor system, simulated execution of SPECint95 binaries delivers a speedup of 3.4 over uniprocessor performance. This improvement is due entirely to the exploitation of dynamically extracted thread level parallelism. 1. Introduction As semiconductor technology pushes toward...
Loop Selection for Thread-Level Speculation
- In Proceedings of the 18 th International Workshop on Languages and Compilers for Parallel Computing
, 2005
"... Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication mak ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication makes it difficult to obtain the desired performance unless the speculative threads are carefully chosen. In this paper, we focus on extracting parallel threads from loops in generalpurpose applications because loops, with their regular structures and significant coverage on execution time, are ideal candidates for extracting parallel threads. General-purpose applications, however, usually contain a large number of nested loops with unpredictable parallel performance and dynamic behavior, thus making it difficult to decide which set of loops should be parallelized to improve overall program performance. Our proposed loop selection algorithm addresses all these difficulties. We have found that (i) with the aid of profiling information, compiler analyses can achieve a reasonably accurate estimation of the performance of parallel execution, and that (ii) different invocations of a loop may behave differently, and exploiting this dynamic behavior can further improve performance. With a judicious choice of loops, we can improve the overall program performance of SPEC2000 integer benchmarks by as much as 20%. 1

