Superscalar processing is the latest in a long series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of superscalar processors. We begin with a discussion of the general problem solved by superscalar processors: converting an ostensibly sequential program into a more parallel one. The principles underlying this process, and the constraints that must be met, are discussed. The paper then provides a description of the specific implementation techniques used in the important phases of superscalar processing. The major phases include: i) instruction fetching and conditional branch processing, ii) the determination of data dependences involving register values, iii) the initiation, or issuing, of instructions for parallel execution, iv) the communication of data values through memory via loads and stores, and v) committing the process state in correct order so that precise interrupts can be supported. Examples of recent superscalar microprocessors, the MIPS R10000, the DEC 21164, and the AMD K5 are used to illustrate a variety of superscalar methods.
|
554
|
Cache memories
– Smith
- 1982
|
|
445
|
Multiscalar Processors
– Sohi, Breach, et al.
- 1995
|
|
374
|
A study of branch prediction strategies
– Smith
- 1981
|
|
335
|
Limits of instruction-level parallelism
– Wall
- 1991
|
|
284
|
Lockup-free instruction fetch/prefetch cache organization
– Kroft
- 1981
|
|
282
|
An efficient algorithm for exploiting multiple arithmetic units
– Tomasulo
- 1967
|
|
224
|
Branch prediction strategies and branch target buffer design
– Lee, Smith
- 1984
|
|
210
|
Implementation of precise interrupts in pipelined processors
– Smith, Pleszkun
- 1985
|
|
208
|
Limits of Control Flow on Parallelism
– Lam, Wilson
- 1992
|
|
201
|
Instruction issue logic for highperformance, interruptible pipelined processors
– Sohi, Vajapeyam
- 1987
|
|
189
|
Available instruction-level parallelism for superscalar and superpipelined machines
– Jouppi, Wall
- 1989
|
|
174
|
Improving the accuracy of dynamic branch prediction using branch correlation
– Pan, So, et al.
- 1992
|
|
171
|
Instruction-Level Parallel Processing: History, Overview, and Perspective
– Rau, Fisher
- 1993
|
|
168
|
A VLIW architecture for a trace scheduling compiler
– Colwell, Nix, et al.
- 1991
|
|
147
|
Predicting Conditional Branch Directions from Previous Runs of a Program
– Fisher, Freudenberger
- 1992
|
|
121
|
High-bandwidth data memory systems for superscalar processors
– Sohi, Franklin
- 1991
|
|
120
|
Optimization of instruction fetch mechanisms for high issue rates
– Conte, Menezes, et al.
- 1995
|
|
104
|
Limits on multiple instruction issue
– Smith, Johnson, et al.
- 1989
|
|
96
|
Synchronization, coherence, and event ordering in multiprocessors
– Dubois, Scheurich, et al.
- 1988
|
|
86
|
Complexity/Performance Tradeoffs with Non-Blocking Loads
– Farkas, Jouppi
- 1994
|
|
84
|
Branch History Table Prediction of Moving Target Branches due to Subroutine Returns
– Kaeli, Emma
- 1991
|
|
79
|
Single-Program Speculative Multithreading (SPSM) Architecture
– Dubey, O’Brien, et al.
- 1995
|
|
59
|
Detection and parallel execution of independent instructions
– Tjaden, Flynn
- 1970
|
|
55
|
The Cydra 5 Departmental Supercomputer: Design
– Rau
- 1989
|
|
52
|
Look-ahead processors
– Keller
- 1975
|
|
48
|
Parallel operation in the control data 6600
– Thornton
- 1964
|
|
37
|
HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality
– Hwu, Patt
- 1986
|
|
34
|
HPS, A New Microarchitecture: Rationale and Introduction
– Patt, Hwu, et al.
- 1985
|
|
33
|
Organization of the Motorola 88110 superscalar RISC microprocessor
– Deifendorff, Allen
- 1992
|
|
32
|
A Hardware Mechanism for Dynamic Memory Disambiguation
– Franklin, Sohi, et al.
- 1996
|
|
32
|
Checkpoint Repair for High-Performance Out-of-Order Execution Machines
– Hwu, Patt
- 1987
|
|
31
|
IBM RISC System/6000 processor architecture
– Oehler, Groves
- 1990
|
|
31
|
The CRAY-1 Computer System
– Russel
- 1978
|
|
30
|
Multis: A new class of multiprocessor computers
– Bell
|
|
27
|
Optimal pipelining in supercomputers
– Kunkel, Smith
- 1986
|
|
27
|
Critical issues regarding HPS, a high performance microarchitecture
– Patt, Melvin, et al.
- 1985
|
|
26
|
The effect of speculatively updating branch history on branch prediction accuracy, revisited
– Hao, Chang, et al.
- 1994
|
|
23
|
Exploring the Design Space for a Shared-Cache Multiprocessor
– Nayfeh, Olukotun
- 1994
|
|
22
|
The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling
– Anderson, Sparacio, et al.
- 1967
|
|
20
|
Optimal Pipelining
– Dubey, Flynn
- 1990
|
|
17
|
Machine Organization of the IBM RISC System/6000 processor
– Grohoski
- 1990
|
|
16
|
Alternative Implementations of Two-Level Adaptive Training Branch Prediction
– Yeh, Patt
- 1992
|
|
15
|
Hardware/Software Tradeoffs for Increased Performance
– Hennessy, Jouppi, et al.
- 1982
|
|
11
|
Design of the R8000 Microprocessor
– Hsu
- 1994
|
|
11
|
MIPS R10000 Uses Decoupled Architecture. Microprocessor Report
– Gwennap
- 1994
|
|
9
|
MIPS R10000 Uses Decoupled Architecture," Microprocessor Report
– Gwennap
- 1994
|
|
9
|
Organization of the Motorola 88110
– Diefendorff, Allen
- 1992
|
|
8
|
Digital Leads the Pack with 21164," Microprocessor Report
– Gwennap
- 1994
|
|
7
|
Cray X-MP: The birth of a supercomputer
– August, Brost, et al.
- 1989
|
|
7
|
Intel Reveals Pentium Implementation Details,’’ Microprocessor Report
– Case
|