Results 1 - 10
of
187
Speculative Precomputation: Long-range Prefetching of Delinquent Loads
, 2001
"... This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, an ..."
Abstract
-
Cited by 180 (23 self)
- Add to MetaCart
This paper explores Speculative Precomputation, a technique that uses idle thread contexts in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium TM ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%. 1.
Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors
- In Proceedings of the 28th Annual International Symposium on Computer Architecture
, 2001
"... Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive appr ..."
Abstract
-
Cited by 174 (0 self)
- Add to MetaCart
(Show Context)
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution---essentially a combined act of speculative address generation and prefetching--- to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.
Execution-based Prediction Using Speculative Slices.
- In Proceedings of the 28th Annual International Symposium on Computer Architecture,
, 2001
"... Abstract instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions ge ..."
Abstract
-
Cited by 173 (6 self)
- Add to MetaCart
(Show Context)
Abstract instructions can move smoothly through the pipeline because the slice has tolerated the latency of the memory hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the program's execution path) that they will not be used. This enables the remaining predictions to be correctly correlated.
Detailed Design and Evaluation of Redundant Multithreading Alternatives
, 2002
"... ... with reductions in voltage levels, makes each generation of microprocessors increasingly vulnerable to transient faults. In a multithreaded environment, we can detect these faults by running two copies of the same program as separate threads, feeding them identical inputs, and comparing their ou ..."
Abstract
-
Cited by 167 (7 self)
- Add to MetaCart
(Show Context)
... with reductions in voltage levels, makes each generation of microprocessors increasingly vulnerable to transient faults. In a multithreaded environment, we can detect these faults by running two copies of the same program as separate threads, feeding them identical inputs, and comparing their outputs, a technique we call Redundant Multithreading (RMT). This paper studies
Transient-fault recovery for chip multiprocessors
- In Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery (CRTR). CRTR extends the previously-proposed CRT for transient-fault detection in CMPs, and the previously-proposed SRTR for t ..."
Abstract
-
Cited by 145 (3 self)
- Add to MetaCart
To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery (CRTR). CRTR extends the previously-proposed CRT for transient-fault detection in CMPs, and the previously-proposed SRTR for transient-fault recovery in SMT. All these schemes achieve fault tolerance by executing and comparing two copies, called leading and trailing threads, of a given application. Previous recovery schemes for SMT do not perform well on CMPs. In a CMP, the leading and trailing threads execute on different processors to achieve load balancing and reduce the probability of a fault corrupting both threads; whereas in an SMT, both threads execute on the same processor. The inter-processor communication required to compare the threads introduces
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery
- In Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memo ..."
Abstract
-
Cited by 137 (10 self)
- Add to MetaCart
(Show Context)
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memory muhiprocessor (i.e., processors, memor3; and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic " coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes perfor-mance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an inter-connection network switch (and its buffered messages). Using full-system simulation of a 16-way muhiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. 1
N-variant systems: A secretless framework for security through diversity
- In Proceedings of the 15th USENIX Security Symposium
, 2006
"... We present an architectural framework for systematically using automated diversity to provide high assurance detection and disruption for large classes of attacks. The framework executes a set of automatically diversified variants on the same inputs, and monitors their behavior to detect divergences ..."
Abstract
-
Cited by 121 (3 self)
- Add to MetaCart
(Show Context)
We present an architectural framework for systematically using automated diversity to provide high assurance detection and disruption for large classes of attacks. The framework executes a set of automatically diversified variants on the same inputs, and monitors their behavior to detect divergences. The benefit of this approach is that it requires an attacker to simultaneously compromise all system variants with the same input. By constructing variants with disjoint exploitation sets, we can make it impossible to carry out large classes of important attacks. In contrast to previous approaches that use automated diversity for security, our approach does not rely on keeping any secrets. In this paper, we introduce the N-variant systems framework, present a model for analyzing security properties of N-variant systems, define variations that can be used to detect attacks that involve referencing absolute memory addresses and executing injected code, and describe and present performance results from a prototype implementation. 1.
ReVive: CostEffective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
- In ISCA-02
, 2002
"... This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all me ..."
Abstract
-
Cited by 120 (13 self)
- Add to MetaCart
(Show Context)
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999 % even when the errors occur as often as once per day. 1
Transientfault recovery using simultaneous multithreading.
- In Proceedings of ISCA,
, 2002
"... Abstract We propose Simultaneously and Redundantly Threaded processors with Recovery (SRTR) that enhances a previously proposed scheme for transient error detection, called Simultaneously and Redundantly Threaded (SRT) processors, to include transient fault recovery. SRT replicates an application i ..."
Abstract
-
Cited by 120 (3 self)
- Add to MetaCart
(Show Context)
Abstract We propose Simultaneously and Redundantly Threaded processors with Recovery (SRTR) that enhances a previously proposed scheme for transient error detection, called Simultaneously and Redundantly Threaded (SRT) processors, to include transient fault recovery. SRT replicates an application into two communicating threads, one executing ahead of the other. The leading thread communicates the values it produces to the trailing thread, which repeats the computation and compares the values produced by the two threads. SRT's leading instructions may commit before checking for errors, relying on the trailing thread to detect errors. SRTR, on the other hand, must not allow any leading instruction to commit before checking, since a faulty instruction cannot be undone once the instruction commits. To avoid leading instructions stalling at commit while waiting for their trailing counterparts, SRTR exploits the time between completion and commit of a leading instruction. SRTR compares the leading and trailing values as soon as the trailing instruction completes, typically before the leading instruction reaches the commit point. To avoid increasing the bandwidth demand on the register file for checking register values, SRTR uses the register value queue (RVQ) to hold register values for checking. To reduce the bandwidth pressure on the RVQ itself, SRTR employs dependence-based checking elision (DBCE). By reasoning that faults propagate through dependent instructions, DBCE exploits register (true) dependence chains so that only the last instruction in a chain uses the RVQ to check leading and trailing values. The performance of SRTR is within 1% and 7% of the SRT performance for SPEC95 integer and floating-point programs, respectively. While SRTR without DBCE incurs about 18% performance loss when the number of RVQ ports is reduced from four (which is equivalent to an unlimited number) to two ports, with DBCE, a two-ported RVQ performs within 2% of a four-ported RVQ.
Dynamic Speculative Precomputation
, 2001
"... A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, t ..."
Abstract
-
Cited by 105 (10 self)
- Add to MetaCart
A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations. 1.