Results 1 - 10
of
475
The SimpleScalar tool set, version 2.0
- Computer Architecture News
, 1997
"... This report describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, bette ..."
Abstract
-
Cited by 1844 (43 self)
- Add to MetaCart
(Show Context)
This report describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This report contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms). 1
Complexity-effective superscalar processors
- IN PROCEEDINGS OF THE 24TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are ana-lyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 467 (5 self)
- Add to MetaCart
(Show Context)
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are ana-lyzed. Each is modeled and Spice simulated for feature sizes of 0:8m, 0:35m, and 0:18m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simpli-fied and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.
The predictability of data values
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)
The M5 simulator: Modeling networked systems
- IEEE Micro
, 2006
"... TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulato ..."
Abstract
-
Cited by 249 (22 self)
- Add to MetaCart
(Show Context)
TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulator attributes, M5 provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsys-tem, and the ability to simulate multiple networked systems deterministically. Our experience in simulating network workloads revealed some unexpected interactions between TCP and the common simulation accel-eration techniques of sampling and warm-up. We have successfully validated M5’s simulated performance results against real machines, indicating that our models and methodology adequately capture the salient characteristics of these systems. M5’s usefulness as a general-purpose architecture simulator and its liberal open-source license have led to its adoption by several other academic and commercial groups. 2 Keywords computer architecture, simulation, simulation software, interconnected systems 3
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
, 1999
"... This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques -- syste ..."
Abstract
-
Cited by 246 (8 self)
- Add to MetaCart
(Show Context)
This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques -- system-level, gate-level, or component-specific approaches -- are either too costly for general purpose computing, overly intrusive to the design, or insufficient for covering arbitrary logic faults. An approach in which the microarchitecture itself provides fault tolerance is required. We propose a new time redundancy fault-tolerant approach in which a program is duplicated and the two redundant programs simultaneously run on the processor. The technique exploits several significant microarchitectural trends to provide broad coverage of transient faults and restricted coverage of permanent faults. These trends are simultaneous multithreading, control flow and data flow prediction, and hierarchi...
Memory bandwidth limitations of future microprocessors
- IN PROCEEDINGS OF THE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1996
"... This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for ..."
Abstract
-
Cited by 226 (12 self)
- Add to MetaCart
(Show Context)
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches—implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
Dynamic instruction reuse, in:
- Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97),
, 1997
"... ..."
Slipstream processors: improving both performance and fault tolerance
- In Proceedings of the ninth international conference on Architectural
"... Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computat ..."
Abstract
-
Cited by 187 (6 self)
- Add to MetaCart
(Show Context)
Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow. The shortened program is run concurrently with the full program on a chip multiprocessor or simultaneous multithreaded processor, with two key advantages: 1) Improved single-program performance. The shorter program speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes, at the same time validating the speculative, shorter program. The two programs combined run faster than the original program alone. Detailed simulations of an example implementation show an average improvement of 7 % for the SPEC95 integer benchmarks. 2) Fault tolerance. The shorter program is a subset of the full program and this partial-redundancy is transparently leveraged for detecting and recovering from transient hardware faults. 1.
Dependence Based Prefetching for Linked Data Structures
, 1998
"... We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By iden ..."
Abstract
-
Cited by 179 (13 self)
- Add to MetaCart
(Show Context)
We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identifying producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching effect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs. 1 Introduction Linked data structures (LDS) such as lists and trees are used in many important applications. The importance of LDS is growing with the increasing popularity of C++, Java, and other systems that use linked object graphs and function tables. Flexible, dynamic construction allows linked structures to grow large and diffic...
Trace processors
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)