• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Nonuniform Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance (1989)

by Norman P Jouppi
Venue:IEEE Transactions on Computers
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 23
Next 10 →

Instruction-Level Parallel Processing: History, Overview and Perspective

by B. Ramakrishna Rau, Joseph A. Fisher , 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract - Cited by 166 (0 self) - Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.

HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Designs

by Mark Oskin, Frederic T. Chong, Matthew Farrens - IN PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE , 2000
"... As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and symbolic execution to evaluate design alternatives. This simulation methodology allows for quick and accurate contour maps ..."
Abstract - Cited by 73 (0 self) - Add to MetaCart
As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and symbolic execution to evaluate design alternatives. This simulation methodology allows for quick and accurate contour maps to be generated of the performance space spanned by design parameters. We validate the accuracy of HLS through correlation with existing cycle-by-cycle simulation techniques and current generation hardware. We demonstrate the power of HLS by exploring design spaces de ned by two parameters: code properties and value prediction. These examples motivate how HLS can be used to set design goals and individual component performance targets.

Out-of-Order Vector Architectures

by Roger Espasa, Mateo Valero, James E. Smith , 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract - Cited by 46 (21 self) - Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.

A FRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

by Derek B. Noonburg , 1997
"... This dissertation presents a statistical approach to modeling superscalar processor performance. Instead of directly modeling an execution trace, as with standard simulation-based performance models, a statistical model works with the probabilities of instruction types, instruction sequences, and p ..."
Abstract - Cited by 43 (0 self) - Add to MetaCart
This dissertation presents a statistical approach to modeling superscalar processor performance. Instead of directly modeling an execution trace, as with standard simulation-based performance models, a statistical model works with the probabilities of instruction types, instruction sequences, and processor states. The program trace and machine are analyzed separately, and the performance is com-puted from these two inputs. The statistical flow graph is introduced as a compact repre-sentation for program traces. The characterization of a specific processor and the statistical flow graph for a specific benchmark are combined to form a Markov chain. In order to reduce the state space size, this Markov chain is partitioned into several smaller submodels. Simulation-based techniques require extremely long run times, especially as traces reach lengths in the billions of instructions. The statistical approach presented here dramatically reduces the time required to explore a microarchitectural design space. Separating the program and machine models allows the time-consuming part of the modeling process,

Code Optimizers and Register Organizations for Vector Architectures

by Corinna Grace Lee , 1992
"... A major challenge facing computer architects today is designing cost-effective hardware that executes multiple operations simultaneously. The goal of such designs is to improve performance by taking advantage of fine-grain parallelism. In this dissertation, I study vector architectures, the oldest o ..."
Abstract - Cited by 19 (0 self) - Add to MetaCart
A major challenge facing computer architects today is designing cost-effective hardware that executes multiple operations simultaneously. The goal of such designs is to improve performance by taking advantage of fine-grain parallelism. In this dissertation, I study vector architectures, the oldest of several processor designs that support fine-grain parallelism. Because implementing a cost-effective processor that performs well requires studying not only the design of processors but also the design of algorithms for compilers, this dissertation encompasses aspects of both hardware and software design. In the first half of this dissertation, I demonstrate that a vector architecture is a cost-effective processor that supports fine-grain parallelism. I show that implementing a vector architecture is no more costly than implementing a superscalar architecture, which is currently popular among designers of VLSI microprocessors. I then show that programs that are rich in parallelism tend als...

Reducing The Impact Of Register Pressure On Software Pipelined Loops

by Josep Llosa, Margarita Espuny , 1996
"... This work deals with the problems caused by the high register requirements of software pipelined loops. The main contributions of this work are: * Register requirements of software pipelined loops are evaluated. * Several heuristics to perform register-constrained software pipelining are proposed * ..."
Abstract - Cited by 12 (8 self) - Add to MetaCart
This work deals with the problems caused by the high register requirements of software pipelined loops. The main contributions of this work are: * Register requirements of software pipelined loops are evaluated. * Several heuristics to perform register-constrained software pipelining are proposed * The effects of register requirements on performance under register constraints are evaluated * HRMS is proposed to perform software pipelining with resource constraints and reduced register requirements * Two new register file organizations are proposed to allow for a large number of registerse with low area cost and fast access time.

An Instruction Throughput Model of Superscalar Processors

by Tarek M. Taha - In Proceedings of the 14th IEEE International Workshop on Rapid System Prototyping (RSP , 2003
"... With advances in semiconductor technology, processors are becoming larger and more complex. Future processor designers will face an enormous design space, and must evaluate more architecture design points to reach a final optimum design. This exploration is currently performed using cycle accurate s ..."
Abstract - Cited by 10 (0 self) - Add to MetaCart
With advances in semiconductor technology, processors are becoming larger and more complex. Future processor designers will face an enormous design space, and must evaluate more architecture design points to reach a final optimum design. This exploration is currently performed using cycle accurate simulators that are accurate but slow, limiting a comprehensive search of design options. The vast design space and time to market economic pressures motivate the need for faster architectural evaluation methods. The model presented in this paper facilitates a rapid exploration of the architecture design space for superscalar processors. It supplements current design tools by narrowing a large design space quickly, after which existing cycle accurate simulators can arrive at a precise optimum design. This allows a designer to select the final architecture design much faster than with traditional tools. The model calculates the instruction throughput of superscalar processors using a set of key architecture and application properties. It was validated with the Simplescalar out-of-order simulator. Results were within 5.5 % accuracy of the cycle accurate simulator, but executed 40,000 times faster. 1

Quantitative Analysis of Vector Code

by Roger Espasa, Mateo Valero, David Padua - In Euromicro Workshop on Parallel and Distributed Processing. IEEE Computer , 1995
"... In this paper we present the results of a detailed simulation study of the execution of vector programs on a single processor of a Convex C3480 machine, using a subset of the Perfect Club benchmarks. We are interested in evaluating several cost/performance tradeoffs that the machine designers made i ..."
Abstract - Cited by 10 (7 self) - Add to MetaCart
In this paper we present the results of a detailed simulation study of the execution of vector programs on a single processor of a Convex C3480 machine, using a subset of the Perfect Club benchmarks. We are interested in evaluating several cost/performance tradeoffs that the machine designers made in order to asses which features of the architecture severely limit the performance attainable. We present the detailed usage of the vector functional units and a study of the kinds of resource conflicts that stall the machine. The results obtained show that the resources of the vector architecture are not efficiently used mainly due to the single bus memory architecture. Other severe limitations of the machine turn out to be the lack of chaining between vector loads and vector computations, and the lack of a second general purpose functional unit. We also present some data about the port pressure on the vector register file and we see that stalls due to port conflicts are relatively high. We...

Instruction level characterization of the Perfect Club programs on a vector computer

by Roger Espasa, Mateo Valero - In XV International Conference of the Chilean Computation Society , 1995
"... In this paper we study the instruction level characteristics of the Perfect Club programs when compiled and executed on a vector processor. Using a trace driven approach we measure the degree of vectorization of the programs, the vector length used in operations, the operation type distribution, the ..."
Abstract - Cited by 7 (5 self) - Add to MetaCart
In this paper we study the instruction level characteristics of the Perfect Club programs when compiled and executed on a vector processor. Using a trace driven approach we measure the degree of vectorization of the programs, the vector length used in operations, the operation type distribution, the basic block size and the balance between memory and compute operations. We also study the spill code introduced in the program by the compiler and the pressure on the dispatch unit of the vector architecture. 1 Introduction Vector architectures have historically been evaluated using high level benchmarks. By high level benchmarks we understand measuring the performance of a vector machine using benchmarks like the PERFECT CLUB [1] or even LINPACK [2], that summarize the behavior of the machine in the total MFLOPS achieved. While these procedures certainly give some interesting information to the final user of the machine, they can not capture the performance advantages/disadvantages that e...

Evaluation of a commercial microprocessor

by Robert Yung , 1998
"... ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Abstract not found
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University