• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. In (2003)

by H Akkary
Venue:MICRO-36,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 175
Next 10 →

Virtualizing Transactional Memory

by Ravi Rajwar, et al. , 2005
"... Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from wellknown limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated ho ..."
Abstract - Cited by 337 (3 self) - Add to MetaCart
Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from wellknown limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated how transactions can achieve high performance while not suffering limitations of lock-based mechanisms. However, current hardware proposals require programmers to be aware of platform-specific resource limitations such as buffer sizes, scheduling quanta, as well as events such as page faults, and process migrations. If the transactional model is to gain wide acceptance, hardware support for transactions must be virtualized to hide these limitations in much the same way that virtual memory shields the programmer from platform-specific limitations of physical memory. This paper proposes Virtual Transactional Memory (VTM), a user-transparent system that shields the programmer from various platform-specific resource limitations. VTM maintains the performance advantage of hardware transactions, incurs low overhead in time, and has modest costs in hardware support. While many system-level challenges remain, VTM takes a step toward making transactional models more widely acceptable.
(Show Context)

Citation Context

...size, and the number of hash functions can be found in the literature (see, for example, [6]), Experimental work [21] suggests that a family of linear congruences work well for hash functions. Others =-=[1, 23]-=- have discussed hardware filter implementations. XF representation and design: A concrete representation of a counting Bloom filter must address several questions: how many hash functions and counters...

Continual flow pipelines

by Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, Mike Upton - In International Conference on Architectural Support for Programming Languages and Operating Systems , 2004
"... Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multip ..."
Abstract - Cited by 107 (0 self) - Add to MetaCart
Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new nonblocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.
(Show Context)

Citation Context

...o the pipeline. The CFP concept is applicable to a broad range of processor architectures (see Section 4.3). In this paper we use Checkpoint Processing and Recovery (CPR) as the baseline architecture =-=[2]-=- since it has been shown to outperform conventional ROB-based architectures. CPR is a reorder-bufferfree architecture requiring a small number of rename-map table checkpoints selectively created at lo...

Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth

by Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, Andreas G. Nowatzyk - In Proc. of the Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS , 2004
"... Recent studies have suggested that the soft-error rate in microprocessor logic will become a reliability concern by 2010. This paper proposes an e#cient error detection technique, called fingerprinting, that detects di#erences in execution across a dual modular redundant (DMR) processor pair. Finger ..."
Abstract - Cited by 80 (8 self) - Add to MetaCart
Recent studies have suggested that the soft-error rate in microprocessor logic will become a reliability concern by 2010. This paper proposes an e#cient error detection technique, called fingerprinting, that detects di#erences in execution across a dual modular redundant (DMR) processor pair. Fingerprinting summarizes a processor's execution history in a hash-based signature; di#erences between two mirrored processors are exposed by comparing their fingerprints. Fingerprinting tightly bounds detection latency and greatly reduces the interprocessor communication bandwidth required for checking. This paper presents a study that evaluates fingerprinting against a range of current approaches to error detection. The result of this study shows that fingerprinting is the only error detection mechanism that simultaneously allows high-error coverage, low error detection bandwidth, and high I/O performance.
(Show Context)

Citation Context

...ward error recovery schemes is the checkpoint mechanism. Microarchitectural techniques work for short checkpoint intervals (thousands of instructions). The Checkpoint Processing and Recovery proposal =-=[1]-=- scales the out-of-order execution window by building a large, hierarchical store buffer and aggressively reclaiming physical registers. The SC++lite [7] mechanism speculatively allows values into the...

A case for MLP-aware cache replacement

by Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, Yale N. Patt - In ISCA , 2006
"... Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses – some misses occur in ..."
Abstract - Cited by 78 (14 self) - Add to MetaCart
Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses – some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, tradi-tional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a run-time technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement pol-icy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%. 1.
(Show Context)

Citation Context

...can reduce the cycles per instruction incurred due to L2 misses. The effectiveness of an out-of-order engine’s ability to increase MLP is limited by the instruction window size. Several proposals [15]=-=[1]-=-[4][25] have looked at the problem of scaling the instruction window for out-of-order processors. Chou et al. [3] analyzed the effectiveness of different microarchitectural techniques such as out-of-o...

Out-of-order commit processors

by Adrian Cristal, Josep Llosa, Mateo Valero, Cataluña Hewlett, Packard Labs - In Proceedings of the 10th International Symposium on High Performance Computer Architecture , 2004
"... Modern out-of-order processors tolerate long latency memory operations by supporting a large number of inflight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data s ..."
Abstract - Cited by 59 (12 self) - Add to MetaCart
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of inflight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the Reorder Buffer (ROB), the general purpose instructions queues, the Load/Store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. In this paper we propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10 % for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200 % improvement over a current processor with a similar number of entries. 1.
(Show Context)

Citation Context

... and loads already pre-issued. This piece of work follows the conceptual path of [6]. At the time we were writing the final version of this paper for the conference proceedings, we received reference =-=[3]-=-. The latter paper presents mechanisms similar to those presented in [3], [20] and [10], and moreover a new and intelligent way of dealing with load-store queues. 7. Conclusions In order to tolerate i...

Reunion: Complexity-Effective Multicore Redundancy

by Jared C Smolens , Brian T Gold , Babak Falsafi , James C Hoe - In International Symposium on Microarchitecture , 2006
"... Abstract ..."
Abstract - Cited by 55 (4 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...m frequent serializing instructions. With increased comparison intervals, the number of these events remains constant, but the stall penalty increases. At forty cycles, the average performance penalty from checking is 17%. In contrast, the scientific workloads suffer from increased reorder buffer occupancy because they can saturate this resource, which decreases MLP. At a comparison latency of forty cycles, the average performance penalty is 11%. While space constraints limit more detailed analysis, larger speculation windows (e.g., thousands of instructions, as in checkpointing architectures [1]) completely eliminate the resource occupancy bottleneck, but cannot relieve stalls from serializing instructions. 5.3. Reunion Performance We first evaluate the performance penalty of relaxed input replication under Reunion, then explore Reunion’s sensitivity to comparison latencies. Unlike the strict input replication model, vocal and mute execution in Reunion is only loosely coupled across the cores. For non-serializing instructions, these differences can be absorbed by buffering in the check stage. However, serializing instructions expose the loose coupling because neither core can make fu...

Dual-core execution: building a highly scalable single-thread instruction window

by Huiyang Zhou , 2005
"... Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a singl ..."
Abstract - Cited by 54 (3 self) - Add to MetaCart
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor. 1.
(Show Context)

Citation Context

...rocessors even with carefully designed memory hierarchy and prefetching mechanisms. Out-of-order execution can successfully hide long latencies if there are enough independent instructions to process =-=[1]-=-, [11], [18], [20]. With the projected memory access latency being as high as hundreds of processor clock cycles, an instruction window needs to be very large to keep track of a high number of in-flig...

Toward kilo-instruction processors

by Adrián Cristal, Oliverio J. Santana, Oliverio J. Santana, Mateo Valero, Mateo Valero - ACM Transactions on Architecture and Code Optimization , 2004
"... The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may r ..."
Abstract - Cited by 48 (4 self) - Add to MetaCart
The continuously increasing gap between processor and memory speeds is a serious limitation to the performance achievable by future microprocessors. Currently, processors tolerate long-latency memory operations largely by maintaining a high number of in-flight instructions. In the future, this may require supporting many hundreds, or even thousands, of in-flight instructions. Unfortunately, the traditional approach of scaling up critical processor structures to provide such support is impractical at these levels, due to area, power, and cycle time constraints. In this paper we show that, in order to overcome this resource-scalability problem, the way in which critical processor resources are managed must be changed. Instead of simply upsizing the processor structures, we propose a smarter use of the available resources, supported by a selective checkpointing mechanism. This mechanism allows instructions to commit out of order, and makes a reorder buffer unnecessary. We present a set of techniques such as multilevel instruction queues, late allocation and early release of registers, and early release of load/store queue entries. All together, these techniques constitute what we call a kilo-instruction processor,anarchitecture that can support thousands of in-flight instructions, and thus may achieve high performance even in the presence of large memory access latencies.

Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures

by Jared C. Smolens, Jangwoo Kim, James C. Hoe, Babak Falsafi , 2004
"... Previous proposals for soft-error tolerance have called for redundantly executing a program as two concurrent threads on a superscalar microarchitecture. In a balanced superscalar design, the extra workload from redundant execution induces a severe performance penalty due to increased contention for ..."
Abstract - Cited by 47 (3 self) - Add to MetaCart
Previous proposals for soft-error tolerance have called for redundantly executing a program as two concurrent threads on a superscalar microarchitecture. In a balanced superscalar design, the extra workload from redundant execution induces a severe performance penalty due to increased contention for resources throughout the datapath. This paper identifies and analyzes four key factors that affect the performance of redundant execution, namely 1) issue bandwidth and functional unit contention, 2) issue queue and reorder buffer capacity contention, 3) decode and retirement bandwidth contention, and 4) coupling between redundant threads' dynamic resource requirements. Based on this analysis, we propose the SHREC microarchitecture for asymmetric and staggered redundant execution. This microarchitecture addresses the four factors in an integrated design without requiring prohibitive additional hardware resources. In comparison to conventional single-threaded execution on a state-ofthe -art superscalar microarchitecture with comparable cost, SHREC reduces the average performance penalty to within 4% on integer and 15% on floating-point SPEC2K benchmarks by sharing resources more efficiently between the redundant threads.

Checkpointed early load retirement

by Nevin Kırman, Meyrem Kırman, Mainak Chaudhuri, José F. Martínez - In Proceedings of the 11th International Symposium on High Performance Computer Architecture , 2005
"... Long-latency loads are critical in today’s processors due to the ever-increasing speed gap with memory. Not only do these loads block the execution of dependent instructions, they also prevent other instructions from moving through the in-order reorder buffer (ROB) and retire. As a result, the proce ..."
Abstract - Cited by 47 (2 self) - Add to MetaCart
Long-latency loads are critical in today’s processors due to the ever-increasing speed gap with memory. Not only do these loads block the execution of dependent instructions, they also prevent other instructions from moving through the in-order reorder buffer (ROB) and retire. As a result, the processor quickly fills up with uncommitted instructions, and computation ultimately stalls. To attack this problem, we propose checkpointed early load retirement, a mechanism that combines register checkpointing and back-end—i.e., at retirement—load-value prediction. When a long-latency load hits the ROB head unresolved, the processor enters Clear mode by (1) taking a Checkpoint of the architectural registers, (2) supplying a Load-value prediction to consumers, and (3) EARly-retiring the long-latency load. This unclogs the ROB, thereby “clearing the way ” for subsequent instructions to retire, and also allowing instructions dependent on the long-latency load to execute sooner. When the actual value returns from memory, it is compared against the prediction. A misprediction causes the processor to roll back to the checkpoint, discarding all subsequent computation. The benefits of executing in Clear mode come from providing early forward progress on correct predictions, and from warming up caches and other structures on wrong predictions. Our evaluation shows that a Clear implementation with support for four checkpoints yields an average speedup of 1.12 for both eleven integer and eight floating-point applications (1.27 and 1.19 for five integer and five floatingpoint memory-bound applications, respectively), relative to a contemporary out-of-order processor with an aggressive hardware prefetcher. 1
(Show Context)

Citation Context

...erry, which recycles physical registers and load/store queue entries aggressively, and uses a combination of ROB and periodic checkpointing to support precise exceptions and interrupts. Akkary et al. =-=[1]-=- and Cristal et al. [9, 8] present ROB-less or quasi ROB-less micro-architectures, based on a multicheckpointing mechanism. None of these works incorporates any kind of loadvalue prediction mechanism....

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University