| D. Papworth. "Tuning the Pentium Pro microarchitecture." IEEE Micro, pages 8-15, April, 1996. |
....predictor does consume power, but the gain in the reduced number of instructions exceeds that extra cost. Reducing the Number of Micro ops Per Instruction Set Architecture (ISA) break macro instructions into a sequence of one or more simple operations, called micro operations, or micro ops [4]. Handling and executing each micro op consumes power. Eliminating micro ops from the micro op stream or combining several micro ops together reduces the overall power. The Intel Pentium M processor micro ops fusion and dedicated stack engine do exactly that. Here, as well, the gain in reduced ....
D.B. Papworth, Tuning the Pentium Pro microarchitecture, IEEE Micro, Vol. 16-2, April 1996, p. 8.
....in the pipeline and, ideally, can deliver times the performance of a nonpipelined one. Pipelining is a very effective technique. There is a clear trend of increasing the number of pipe stages and reducing the amount of work per stage. Some microprocessors (e.g. Pentium Pro microprocessor [6]) have more than ten pipeline stages. Employing many pipe stages is sometimes termed deep pipelining or super pipelining. Unfortunately, the number of pipeline stages cannot increase indefinitely. There is a certain clocking overhead associated with each pipe stage (setup and hold time, clock ....
D. B. Papworth, "Tuning the Pentium Pro microarchitecture," IEEE Micro, vol. 16, pp. 8--15, Apr. 1996.
.... running complex commercial workloads, including all of the AIX operating system as well as code generated dynamically by just in time compilation frameworks for Java [6] Performing this translation is analogous to the kind of translation already performed in hardware by many modern microprocessors[2, 9], which transform individual instructions into a series of less complex operations. Traditionally reserved for CISC instruction sets, this kind of instruction cracking also occurs in recent implementations of the somewhat complicated PowerPC architecture[3, 4] Implementing this cracking in ....
David B. Papworth, "Tuning the Pentium Pro Microarchitecture", IEEE Micro, 16(2):8-15, 1996.
....after all previous instructions have been retired, and all of the instruction s constituent ops have completed. The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. More information on the Pentium Pro can be found in [6] 11] 13] [21] [34] Measurements were performed using the Pentium Pro hardware counters [13] We present aggregate (user operating system) activity, factoring out the idle loop. On the uniprocessor, this technique is possible because NT implements the idle loop using the HALT instruction. The event counters ....
D. Papworth. "Tuning the Pentium Pro microarchitecture." IEEE Micro, pages 8-15, April, 1996.
....The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. Table 2 5 summarizes the characteristics of the Pentium Pro caches. More detailed descriptions of the Pentium Pro s architectural features can be found in [15] 25] 39] 50] [76]. We will also present additional details in subsequent sections, when discussing our measurement results. 2.5.2. Potential Sources of Pentium Pro Stalls In practice, the 0.33 theoretical minimum CPI is seldom achieved, due to stalls from cache misses, oversubscription of certain resources, and ....
D. Papworth. "Tuning the Pentium Pro microarchitecture," IEEE Micro, pages 8-15, April 1996.
....effectiveness of an MPS design. Keywords MPS, Multiple Instruction Issue, Miss Path Scheduling, Instruction Level Parallelism, Schedule Cache I. Introduction Current multiple issue processors such as the Hewlett Packard PA 8000 and the Intel Pentium Pro employ out of order instruction issue [1], 2] One advantage of this approach is that compiler scheduling is not required to attain high levels of instruction level parallelism (ILP) Additionally, dynamic scheduling hardware can deal effectively with unanticipated run time events (e.g. cache misses) There are also disadvantages. ....
David B. Papworth, "Tuning the Pentium Pro microarchitecture," IEEE Micro, vol. 16, no. 2, pp. 8--15, Apr. 1996.
....latencies are shown in Figure 5, with each functional unit pipelined to the depth indicated. A pipelined L1 cache memory interface with a three cycle latency and a one Op bandwidth was assumed (the three cycle latency was chosen as it is similar to the L2 latency in contemporary microprocessors [5], 6] A perfect L1 data cache and L2 cache was assumed to prevent data cache effects from coloring the performance measurements. Integer and floating point programs from the SPEC92 and SPEC95 suites were used as benchmarks for the evaluations and are listed in Table 1 3 . Two million Ops were ....
....unit and decoded instruction cache for use in decoding x86 instructions [19] Their design associates a NextPC field with each cache block. The Intel Pentium Pro processor employs a multi stage i fetch that fetches 16 bytes per cycle from the i cache and uses three stages to align the instructions [5]. NextPC is PC 16 in the absence of a branch instruction [20] The AMD K5 stores decode information related to instruction length in the L1 instruction cache which is later used for NextPC computation in the i fetch stage [10] Like the Pentium Pro, the K5 uses multiple stages to fetch and align ....
David B. Papworth, "Tuning the Pentium Pro microarchitecture," IEEE Micro, vol. 16, no. 2, pp. 8--15, Apr. 1996.
....with every cache access. High performance CISC processors implement different solutions for fetching variable length instructions. The Intel Pentium Pro processor employs a multi stage i fetch that fetches 16 bytes per cycle from the i cache and then uses three stages to align the instructions [11]. NextPC is PC 16 in the absence of a branch instruction. The AMD K5 stores decode information related to instruction length in the L1 instruction cache which is later used for NextPC computation in the i fetch stage [12] Like the Pentium Pro, the K5 uses multiple stages to fetch and align an x86 ....
D. B. Papworth, "Tuning the Pentium Pro microarchitecture, " IEEE Micro, vol. 16, pp. 8--15, Apr. 1996.
....compared to the low overhead programs. Withoverheadbased replacement, the performance of high overhead programs improves substantially, while the low overhead programs perform only slightly worse than in the case of the LRU replacement. 1 Introduction Unlike contemporary superscalar processors [1] [2] 3] which employ dynamic scheduling, VLIW processors de Published in: Proc. 29th Annual Int l Symp. on Microarchitecture, Paris, 1996 pend on a schedule of code generated by the compiler. The compiler has full knowledge of the machine model, described in terms of the hardware resources ....
D. B. Papworth, "Tuning the Pentium Pro microarchitecture," IEEE Micro, vol. 16, pp. 8--15, Apr. 1996.
....parameters used in the analysis, based on trends [6, 7] in custom CMOS microprocessor implementations, packaging technologies, and board level components. The microprocessor and ASICs used for memory and I O control directly drive external buses similar to several current microprocessors [16, 22] and the ASIC designed for the AlphaServer 8000 [2] 3 Methodology Our evaluation methodology combines the use of the System Tradeoff Analysis Toolset (STATS) 1] for performance analysis with area and package pin count calculations. We use the die photograph of the Alpha 21164 [3] to determine ....
Papworth, D.B. Tuning the Pentium Pro microarchitecture. IEEE Micro, 16(2):8--15, April 1996.
....to perform dynamic instruction scheduling. Independent instructions can be discovered at run time and scheduled to functional units out of order. To avoid anti and outputdependencies, registers can be renamed by using a reorder buffer [5] or a mapping table [10] The Power PC [7] Pentium Pro [6], and MIPS R10000 [10] are examples of superscalar microprocessors that perform out of order issue and register renaming. The register file is a design obstacle for superscalar microprocessors. If N instructions can be issued in a cycle, then a superscalar microprocessor s register file needs 2N ....
D. B. Papworth. Tuning the pentium pro microarchitecture. IEEE Micro, pages 8--15, April 1996.
....of the profile based optimization is only available during subsequent runs of the program and the initial profile collecting run may suffer from worsened performance. Hardware solutions for a limited form of runtime code optimization are now commonplace in modern superscalar microprocessors [21][25][19] The optimization unit is a fixed size instruction window, with the optimization logic operating on the critical execution path. The Trace Cache is another hardware alternative that can be extended to do superscalar like optimization off the critical path [27] 15] Dynamo offers the potential ....
Papworth, D. 1996. Tuning the Pentium Pro microarchitecture. IEEE Micro, (Apr.). 8-15.
....and decoded instruction cache for use in decoding x86 instructions [23] Their design associates a NextPC field with each cache block. The Intel Pentium Pro processor employs a multi stage i fetch that fetches 16 bytes per cycle from the i cache and then uses three stages to align the instructions [19]. NextPC is PC 16 in the absence of a branch instruction. The AMD K5 stores decode information related to instruction length in the L1 instruction cache which is later used for NextPC computation in the i fetch stage [7] Like the Pentium Pro, the K5 uses multiple stages to fetch and align an x86 ....
....latencies are shown in Figure 5, with each functional unit pipelined to the depth indicated. A pipelined L1 cache memory interface with a three cycle latency and a one Op bandwidth was assumed (the three cycle latency was chosen as it is similar to the L2 latency in contemporary microprocessors [19], 6] A perfect L1 data cache and L2 cache was assumed to prevent data cache effects from coloring the performance measurements. Integer and floating point programs from the SPEC92 and SPEC95 suiteswere used as benchmarks for the evaluations and are listed in Table 1 3 . Two million Ops were ....
D. B. Papworth. Tuning the Pentium Pro microarchitecture. IEEE Micro, 16(2):8--15, Apr. 1996.
....In the case of a dynamic optimizer, on the other hand, all profile data that is generated is consumed in the very same run, and no data is written out for use offline or during a later run. Hardware implementations of dynamic optimizers are now commonplace in modern microprocessors [Kumar 1996; Papworth 1996; Keller 1996] The optimization unit is a fixed size instruction window, with the optimization logic operating on the critical execution path. The Trace Cache is another hardware alternative that can be extended to do superscalar like optimization off the critical path [Peleg and Weiser 1994; ....
Papworth, D. 1996. Tuning the Pentium Pro microarchitecture. IEEE Micro, (Apr.). 8-15.
....is fast and effective for small register files. However, more recent microprocessors have backed away from this search mechanism, and have incorporated map tables, that record a pointer to the most recent rename of each architected register in a single direct mapped table. The Intel Pentium Pro [7], IBM RIOS [2] HP PA8000 [9] and MIPS R10000 [5] all use map tables. Map table designs have also been investigated in the research community [8] 10] There are two common ways to implement renaming with a map table. The Pentium Pro and HP parts store the rename registers separate from the ....
....cp commit Technical Report: CMuART 2000 01 5 2.2 Renaming with a Rename Register File in the ROB Register renaming schemes that use separate register files may implement the rename registers in the reorder buffer or a special rename register file. For this discussion the Pentium Pro [7] design is examined. Figure2 illustrates the interaction between the rename map table, the ROB RRF, and the architected register file (ARF) along with the worst case delay paths for the four renaming functions. In this implementation the RRF is combined with the reorder buffer (ROB) Each entry ....
D. Papworth. Tuning the Pentium Pro Micro Architecture. IEEE Micro, August 1996.
....are outlined and shown to be feasible. Performance results from trace driven simulations are presented that highlight the effectiveness of the approach. 1 Introduction Current multiple issue processors such as the Hewlett Packard PA 8000 and the Intel Pentium Pro employ out of order issue [1], 2] One advantage of such an approach is that compiler scheduling is not required to achieve acceptable performance. Additionally, dynamic scheduling hardware can deal effectively with unanticipated run time events (e.g. cache misses) Since the hardware is used on every instruction cache ....
....result is written to an architected register at writeback. Miss path scheduling can be used for CISC architectures as well as RISC architectures. As is done in contemporary multiple issue CISC processors [27] a CISC instruction can be decomposed into multiple RISC like instructions (micro ops) [1]. These instructions can then be individually scheduled. Care must be taken so that the newly created micro ops are ordered such that the dependencies between them are honored. Exception handling is achieved by the reorder buffer and is done identically for both RISC and CISC architectures. ....
D. B. Papworth, "Tuning the Pentium Pro microarchitecture," IEEE Micro, vol. 16, pp. 8--15, Apr. 1996.
....over a sequential machine. 1.2 Other ILP Work The ILP literature is vast, we mention only a small part of it here due to space limitations. There are a variety of approaches to ILP enhancement and exploitation. Hardware methods include the dynamic execution model employed in the Pentium Pro[7]. Software methods include various VLIW type approaches[2, 4] There is also the multiscalar approach[10] Recently, Data Effect Reduction Techniques (DERTs) such as data speculation[6, 9] have begun to be investigated. DEE is orthogonal to all of these methods, and can be used with any of ....
....and including the other critical path logic, a nominal cycle time for a DEE CD MF computer was computed. For comparison purposes, the students also laid out and simulated a string of 9 10 simple gates in the same technology, indicative of the critical path delay (with latch) of the Pentium Pro[3, 7]. This gave us a base cycle time. The DEE CD MF model s cycle time is slightly longer than the base machine s cycle time, giving potentially about a 30 loss in performance in the new model (only considering cycle time) or a net speedup of a factor of 26.0(10 13) 20.0. This is still about an ....
D. B. Papworth. Tuning the Pentium Pro Microarchitecture. MICRO, 16(2):8--15, April 1996.
....the paper. 2 Levo Basics The general operation and microarchitecture of Levo are now described. Since Levo is somewhat unusual in its construction, we will provide correlations to more commonly known microarchitectural structures and techniques, especially with reference to the Intel Pentium Pro[4]. In the discussion, the corresponding Pentium Pro element will be indicated in boxed italics: italics] NOTE that we are NOT equating the correlated structures, just providing familiar points of reference that the reader can build on. We will assume a MIPS R4400 Instruction Set Architecture ....
....will use thirty two 32 bit MIPS processors to mimic the function of the 32 PE s. Each PE will likely have one SBlock associated with it. 6 Comparison to Other Work Virtually all commercial and most research superscalar machines use only simple SP execution with minimal register data dependencies[4]. VLIW and other software based machines are an alternate approach; see [12] for a discussion and comparison of these and other methods. The multiscalar architecture[5] realizes concurrently executing tasks in multiple localities of SP executions. The realization of concurrent localities is a ....
D. B. Papworth. Tuning the Pentium Pro Microarchitecture. MICRO, 16(2):8--15, April 1996.
No context found.
D. Papworth. "Tuning the Pentium Pro microarchitecture." IEEE Micro, pages 8-15, April, 1996.
No context found.
D. Papworth. Tuning the pentium pro microarchitecture. IEEE Micro, 16:8--15, April 1996.
No context found.
D. Papworth. Tuning the Pentium Pro Microarchitecture. IEEE Micro, 16(2):8--15, 1996.
No context found.
D. Papworth, \Tuning the Pentium Pro Microarchitecture, " IEEE Micro, Vol. 16(2), April 1996, pp. 8-15.
No context found.
D. Papworth, "Tuning the Pentium Pro Microarchitecture, " IEEE Micro, Vol. 16(2), April 1996, pp. 8-- 15.
No context found.
D. Papworth. Tuning the Pentium Pro Micro Architecture. IEEE Micro, August 1996.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC