• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Evaluating Future Microprocessors: The SimpleScalar Tool Set (1996)

by T M Austin D Burger, S Bennett
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 475
Next 10 →

The SimpleScalar tool set, version 2.0

by Doug Burger, Todd M. Austin - Computer Architecture News , 1997
"... This report describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, bette ..."
Abstract - Cited by 1844 (43 self) - Add to MetaCart
This report describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors. The new release offers more tools and capabilities, precompiled binaries, cleaner interfaces, better documentation, easier installation, improved portability, and higher performance. This report contains a complete description of the tool set, including retrieval and installation instructions, a description of how to use the tools, a description of the target SimpleScalar architecture, and many details about the internals of the tools and how to customize them. With this guide, the tool set can be brought up and generating results in under an hour (on supported platforms). 1
(Show Context)

Citation Context

...reas the most detailed processor simulator simulates about 150,000 per second. The current release (version 2.0) of the tools is a major improvement over the previous release. Compared to version 1.0 =-=[2]-=-, this release includes better documentation, enhanced performance, compatibility with more platforms, precompiled SPEC95 SimpleScalar binaries, cleaner interfaces, two new processor simulators, optio...

Complexity-effective superscalar processors

by Subbarao Palacharla, J. E. Smith, et al. - IN PROCEEDINGS OF THE 24TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE , 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are ana-lyzed. Each is modeled and Spice simulated for ..."
Abstract - Cited by 467 (5 self) - Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are ana-lyzed. Each is modeled and Spice simulated for feature sizes of 0:8m, 0:35m, and 0:18m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simpli-fied and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.
(Show Context)

Citation Context

...w of the conventional processor has 64 entries. Both microarchitectures can decode, rename, and execute a maximum of 8 instructions per cycle. The timing simulator, a modified version of SimpleScalar =-=[4]-=-, is detailed in Table 3. Fetch width any 8 instructions I-cache Perfect instruction cache Branch Predictor McFarling’s gshare [13] 4K 2-bit counters, 12 bit history unconditional control instructions...

The predictability of data values

by Yiannakis Sazeides, James E. Smith - IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE , 1997
"... ..."
Abstract - Cited by 288 (11 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...o form a context for the fcm predictor we use full concatenation of history values so there is no aliasing when matching contexts. Trace driven simulation was conducted using the Simplescalar toolset =-=[26]-=- for the integer SPEC95 benchmarks shown in Table 2 . The benchmarks were compiled using the simplescalar compiler with -O3 optimization. Integer benchmarks were selected because they tend to have les...

The M5 simulator: Modeling networked systems

by Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, Steven K. Reinhardt - IEEE Micro , 2006
"... TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulato ..."
Abstract - Cited by 249 (22 self) - Add to MetaCart
TCP/IP networking is an increasingly important aspect of computer systems, but a lack of simulation tools limits architects ’ ability to explore new designs for network I/O. We have developed the M5 simulator specif-ically to enable research in this area. In addition to typical architecture simulator attributes, M5 provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsys-tem, and the ability to simulate multiple networked systems deterministically. Our experience in simulating network workloads revealed some unexpected interactions between TCP and the common simulation accel-eration techniques of sampling and warm-up. We have successfully validated M5’s simulated performance results against real machines, indicating that our models and methodology adequately capture the salient characteristics of these systems. M5’s usefulness as a general-purpose architecture simulator and its liberal open-source license have led to its adoption by several other academic and commercial groups. 2 Keywords computer architecture, simulation, simulation software, interconnected systems 3
(Show Context)

Citation Context

...rk interface devices; and the ability to model multiple networked systems (e.g., a server and one or more clients) in a deterministic fashion. Traditional CPU-centric simulators, such as SimpleScalar =-=[BAB96]-=-, lack nearly all of these features. Among the simulators available when we began 1sour research, only SimOS [RHWG95] provided the majority of the features necessary, missing only the ability to deter...

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

by Eric Rotenberg , 1999
"... This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques -- syste ..."
Abstract - Cited by 246 (8 self) - Add to MetaCart
This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques -- system-level, gate-level, or component-specific approaches -- are either too costly for general purpose computing, overly intrusive to the design, or insufficient for covering arbitrary logic faults. An approach in which the microarchitecture itself provides fault tolerance is required. We propose a new time redundancy fault-tolerant approach in which a program is duplicated and the two redundant programs simultaneously run on the processor. The technique exploits several significant microarchitectural trends to provide broad coverage of transient faults and restricted coverage of permanent faults. These trends are simultaneous multithreading, control flow and data flow prediction, and hierarchi...
(Show Context)

Citation Context

...vironment A detailed, fully execution-driven simulator of a trace processor [17] was modified to support AR-SMT time redundancy. The simulator was developed using the simplescalarssimulation platform =-=[24]-=-. This platform uses a MIPS-like instruction set and a gcc-based compiler to create binaries. The simulator only measures performance of the microarchitecture. Fault coverage is not evaluated. It is b...

Memory bandwidth limitations of future microprocessors

by Doug Burger, James R. Goodman, Alain Kägi - IN PROCEEDINGS OF THE 23RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE , 1996
"... This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for ..."
Abstract - Cited by 226 (12 self) - Add to MetaCart
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches—implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.
(Show Context)

Citation Context

...generate the traces for each benchmark. It also lists both the number of memory references that we simulated (in millions) and the data set sizes for each benchmark. We used the SimpleScalar tool set =-=[4]-=- to measure the execution time of simulated processors that use a MIPS-like instruction set. SimpleScalar uses execution-driven simulation to measure execution time accurately. It includes simulation ...

Dynamic instruction reuse, in:

by A Sodani, G Sohi - Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97), , 1997
"... ..."
Abstract - Cited by 203 (8 self) - Add to MetaCart
Abstract not found

Slipstream processors: improving both performance and fault tolerance

by Karthik Sundaramoorthy, Zach Purser, Eric Rotenberg - In Proceedings of the ninth international conference on Architectural
"... Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computat ..."
Abstract - Cited by 187 (6 self) - Add to MetaCart
Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow. The shortened program is run concurrently with the full program on a chip multiprocessor or simultaneous multithreaded processor, with two key advantages: 1) Improved single-program performance. The shorter program speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes, at the same time validating the speculative, shorter program. The two programs combined run faster than the original program alone. Detailed simulations of an example implementation show an average improvement of 7 % for the SPEC95 integer benchmarks. 2) Fault tolerance. The shorter program is a subset of the full program and this partial-redundancy is transparently leveraged for detecting and recovering from transient hardware faults. 1.
(Show Context)

Citation Context

... functional simulator run independently and in parallel with the detailed timing simulator [33]. The functional simulator checks retired R-stream control flow and data flow outcomes. The Simplescalar =-=[3]-=- compiler and ISA are used. Binaries are compiled with -O3 level optimization. The Simplescalar compiler is gcc-based and the ISA is MIPS-based; as a result, programs inherit any inefficiencies of the...

Dependence Based Prefetching for Linked Data Structures

by Amir Roth, Andreas Moshovos, Gurindar S. Soh , 1998
"... We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By iden ..."
Abstract - Cited by 179 (13 self) - Add to MetaCart
We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identifying producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching effect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs. 1 Introduction Linked data structures (LDS) such as lists and trees are used in many important applications. The importance of LDS is growing with the increasing popularity of C++, Java, and other systems that use linked object graphs and function tables. Flexible, dynamic construction allows linked structures to grow large and diffic...
(Show Context)

Citation Context

...t optimize or discount these in any way. Finally, the suggested input sets for some benchmarks were changed to produce longer execution samples. For our simulations, we use the SimpleScalar simulator =-=[2]-=-. We model a 4-way superscalar, out-of-order processor with a conventional five stage pipeline that allows a maximum of 32 in-flight instructions. The branch unit uses a hybrid scheme with an 8Kentry ...

Trace processors

by Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, Jim Smith - IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE , 1997
"... ..."
Abstract - Cited by 176 (15 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...d simulation is used to evaluate the performance of trace processors. For comparison, superscalar processors are also simulated. The simulator was developed using the simplescalar simulation platform =-=[23]-=-. This platform uses a MIPS-like instruction set (no delayed branches) and comes with a gcc-based compiler to create binaries. Table 2. Fixed parameters and benchmarks. frontend latency 2 cycles (fetc...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University