25 citations found. Retrieving documents...
S. Weiss, J. Smith, POWER and PowerPC, Morgan Kaufmann, 1994.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Quantitative Analysis of Protection Options - Banerji, Panteleenko, Wyant, Cohn (1996)   (5 citations)  (Correct)

....this question [Small 96] but a full analysis is still lacking. This makes it difficult to project the performance trade offs as technology evolves. This paper attempts to fill this gap. 4. Experimental Methodology 4. 1 Setup The measurements reported here were made on an IBM RS 6000 Model 390 [Weiss 94] The machines characteristics are summarized in Table 1. Except as noted, tests were conducted with a quiescent system. 1. Typically, kernel extensions are unsafe and have the same ability to crash a machine as device drivers. 6 4.2 Tools The results reported in this paper are based on ....

S. Weiss, J. Smith, POWER and PowerPC, Morgan Kaufmann, 1994.


Coverage Driven Processor Test Generation: Proof of Concept - Ur, Yadin   (Correct)

....tours to test vectors) is replaced with translation of tours on the model into verification tasks and than generating architectural test programs from these tasks. This paper shows the first implementation of the methodology developed in [18] to a superscalar state of the art PowerPC implementation[17][16] The experiment, which is described in detail, includes modeling of parts of that processor in SMV[10] generating abstract tests from the model using CFSM [7] converting the abstract tests into restric tions on architectural tests, converting the restrictions into directive for a test ....

S. Weiss, J. E. Smith "POWER and PowerPC", Morgan Kaufmann, 1994


On Null Spaces and their Application to Model.. - Vandierendonck, De.. (2002)   (Correct)

.... too for a level 1 data cache, especially when the superscalar paradigm is extended to higher issue widths [SF91, NVDB00] Current processors already make provisions for the simultaneous execution of several LOAD STORE operations per cycle by wave pipelining or double pumping the cache [KMW98, WS94] or by interleaving the data cache [CS99, GL96] This paper develops a model to describe XOR based randomisation functions and investigate several aspects of these functions. Randomisation functions are presented by their null space, i.e. the set of all binary vectors that are mapped to set 0. ....

S. Weiss and J.E. Smith. POWER and PowerPC. Morgan Kaufmann Publishers, Inc., 1994.


Source Level Static Branch Prediction - Wong (1999)   (Correct)

....as to improve the accuracy of branch prediction during runtime. This remains a subject for further research[11] In any case, wearenow seeing architectures that recognize the importance of branch prediction and the possibility of software branch prediction. The SPARCversion 9 [19] and the PowerPC [20] instruction sets are examples of modern superscalar architectures that haveintroduced branch instructions with prediction bits. As described in Section 4, the implementation of the testbed for our ideas was a modification (of no more than an estimated 5 ) of the GNU C compiler, a production ....

S. Weiss, and J. E. Smith. (1994) Power and PowerPC. Morgan Kaufmann Publishers, Inc.


MPS: Miss-path Scheduling for Multiple-issue Processors - Banerjia, Sathaye.. (1998)   (1 citation)  (Correct)

....and stored with the individual instructions in the cache) The central issue in speculating instructions is choosing which instructions to speculate, a decision that relies on predicting which path a branch will take. For architectures that support a prediction bit in the instruction encoding [18], 19] the scheduler can use this bit to make a prediction. The bit can be set based on profile information termed profiled prediction or a simple heuristic such as backward taken, forward not taken (BTFNT) In the absence of an ISA level prediction bit, a simple heuristic such as ....

S. Weiss and J. E. Smith, POWER and PowerPC, Morgan Kaufmann, San Francisco, CA, 1994.


A Persistent Rescheduled-Page Cache for Low Overhead.. - Conte, Sathaye, Banerjia (1996)   (3 citations)  (Correct)

....to the low overhead programs. Withoverheadbased replacement, the performance of high overhead programs improves substantially, while the low overhead programs perform only slightly worse than in the case of the LRU replacement. 1 Introduction Unlike contemporary superscalar processors [1] 2] [3] which employ dynamic scheduling, VLIW processors de Published in: Proc. 29th Annual Int l Symp. on Microarchitecture, Paris, 1996 pend on a schedule of code generated by the compiler. The compiler has full knowledge of the machine model, described in terms of the hardware resources available, ....

....(the base model) was calculated. Speedup is: number of cycles of execution estimated in the experiment) number of cycles of execution estimated for the base model) All three parts assumed a page size of 4K bytes, as is used in many contemporary operating systems [19] 20] and processors [21] [3]. Results from all three parts are shown in Figure 3. It can 1 1.5 2 2.5 3 3.5 Speedup (Harmonic Mean) Gn m no ov NAT G2 NAT G1 NAT G3 Speedup of Native code for Gen n NAT Gn Gn m ov No overhead speedup of Gen n code translated to Gen m With overhead speedup of Gen n ....

S. Weiss and J. E. Smith, POWER and PowerPC. San Francisco, CA: Morgan Kaufmann, 1994.


Optimization of VLIW Compatibility Systems Employing.. - Thomas Conte Sumedh   (Correct)

....the base model) was calculated. It is defined as: speedup = number of cycles of execution estimated in the experiment) number of cycles of execution estimated for the base model) All three parts assumed a page size of 4K bytes, as in contemporary operating systems [24] 25] and processors [ [26]. B. Results The experiments described above were run for each benchmark, and are presented in Figures 12, 13, and 14. Figure 12 shows the performance of the code rescheduled to run on TINKER A. This is compared with the performance of the native compiled code for TINKER A. Both no overhead ....

S. Weiss and J. E. Smith, POWER and PowerPC. San Francisco, CA: Morgan Kaufmann, 1994.


Speculation Techniques for Improving Load Related.. - Yoaz, Erez, Ronen.. (1999)   (17 citations)  (Correct)

.... four basic categories: Single ported cache enhancements such as in [Wils95] Cache duplication, in which several single ported copies of the cache are kept and accessed simultaneously [Digi97] Single ported caches run at twice the core speed, thus emulating a truly multi ported cache [Weis94]. Multi banked cache designs such as [Simo95] Hunt95] In these multi banked designs a scheme was provided for minimizing the bank conflict problem. In this scheme load store instructions undergo dual scheduling after the load s address is calculated it enters a second level scheduler for ....

S. Weiss and J. Smith -- "POWER and PowerPC" -- Morgan Kaufmann Publishers, 1994.


System-Level Power Consumption Modeling and Tradeoff Analysis.. - Conte, al. (1999)   (6 citations)  (Correct)

....parallelism and execute code correctly. Empirical results suggest as much as a five times speed improvement when instruction level parallelism is exploited [2] Current designs seek parallelism by examining and issuing four to six instructions per cycle, with higher rates expected [4] 5] 6] [7]. Successful use of these high issue rates requires careful tuning of the microarchitecture. There is a wealth of technological alternatives for this task. These include branch handling strategies [8] functional unit duplication [2] and instruction fetch, issue, completion and retirement ....

S. Weiss and J. E. Smith, POWER and PowerPC, Morgan Kaufmann, San Francisco, CA, 1994.


Hardware Techniques To Improve The Performance Of The.. - Burger (1998)   (10 citations)  (Correct)

....In joint work, Reinhardt came up with an efficient organization to handle tag cache misses quickly. In the organization that he proposed, the tag store is organized as a hash table (similar to an inverted page table in conventional microprocessors, such as in the POWER and PowerPC architectures [65, 129]) As in the PowerPC architectures, the size of the hash table was set to be twice as large as the power of two greater than or equal to the number of physical mapped regions (physical pages in PowerPC and cache blocks in ICE) In the PowerPC architecture, each hash table entry maps to one page ....

Shlomo Weiss and James E. Smith. POWER and PowerPC. Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1994.


A Cache Line Fill Circuit for a Micropipelined.. - Mehra Garside Rmehra (1995)   (1 citation)  (Correct)

.... the required word first wrapping the line fetch addresses appropriately (desired word first wrapping fetch [4] thus reducing t miss to a minimum (t lat t wd ) Both the early restart and desired word first optimisations are employed in high performance architectures (e.g. RS 6000 560 [15]) When employing early restart a problem can occur when the processor attempts its subsequent memory access, in that the cache may still be fetching the remainder of the previous missed line. The next cycle may be any of the following: 1. A write operation. t miss t lat nw t wd = 2. ....

....whilst allowing the line fetch to proceed in parallel. 4: Synchronous Implementations As mentioned in 2.1 Stall On Miss is the most commonly used line fetch strategy since it is simple to implement. To improve performance some synchronous designs use an Early Restart based strategy (RS6000 560 [15], ARM [1] in particular the ARM series of cached microprocessors use a variant which will be described below. In the ARM a cache miss stalls the processor and causes a line fetch to be started from the address of the lowest word in the line. At the end of each memory read cycle the processor s ....

Weiss, S., Smith, J.E., "POWER and PowerPC", Morgan Kaufmann, 1994.


Miss Path Speculative Scheduling For High Issue Rates - Banerjia, Sathaye, Menezes, ..   (Correct)

....in speculating instructions is choosing which instructions to speculate, a decision that relies on predicting which path a branch will take. Branch prediction has been a heavily researched topic [18] 19] 20] 21] For architectures that support a prediction bit in the instruction encoding [22], 23] the scheduler can use this bit to make a prediction. The bit setting may be based on profile information or a simple heuristic such as backward taken, forward not taken (BTFNT) In the absence of an ISA level prediction bit, a simple heuristic such as FT can be directly implemented in the ....

S. Weiss and J. E. Smith, POWER and PowerPC. San Francisco, CA: Morgan Kaufmann, 1994.


The AMULET2e Cache System - Garside Temple Mehra (1996)   (3 citations)  (Correct)

....starts by fetching the required word, forwards this to the processor and then continues autonomously to complete the line fill. The processor is then free to continue in parallel. This is known as early restart [12] and is a technique employed in high performance architectures (e.g. RS 6000560 [15]) The parallelism introduced by this mechanism can cause problems when the processor requests its next memory cycle. To illustrate this the cycle following the miss can be classified as follows: 1) A write operation. 2) A read operation which may be satisfied from the existing cache ....

Weiss, S., Smith, J.E., "POWER and PowerPC", Morgan Kaufmann, 1994.


Design and Evaluation of Network Interfaces for System Area.. - Mukherjee (1998)   (Correct)

....a system programmable back off interval to reduce the likelihood of stealing the block back before the processor competes its writes to the CDR. This technique, called virtual polling, is useful for processors that cannot efficiently push data out of their caches. For processors (e.g. PowerPC [132]) that do support user level cache flush instructions, a CDR can be directly flushed out of the cache. Fourth, processors can efficiently communicate control information, e.g. interrupt masks, to an NI through CDRs. Changing control information, such as masking NI interrupts, can be expensive in ....

Shlomo Weiss and James E. Smith. Power and PowerPC. Morgan Kaufmann Publishers, Inc., 1994. 186


Accurate and Practical Profile-Driven Compilation Using the.. - Thomas Conte (1996)   (18 citations)  (Correct)

....These results indicate that reducing the number of contentions should be a priority for obtaining better profile information. Traditionally, branch target buffers (BTB) that have similar properties as the profile buffer have overcome the problem by employing larger buffers (512 or 1024 entry) 1] [22], but this implies added hardware cost. An alternative solution is to reduce the number of branches that access the buffer. This alternative is explored in the next two sections. 0 50 100 150 200 250 1 10 1000 10000 1000000 10000000 execution weight (actual) arc error address mapping (8) ....

S. Weiss and J. E. Smith. POWER and PowerPC. Morgan Kaufmann, San Francisco, CA, 1994.


Disjoint Eager Execution: An Optimal Form of Speculative.. - Uht, Sindagi (1995)   (37 citations)  (Correct)

....(including VLIW [Very Long Instruction Word] 1] and mixtures of the two[4] ILP as used herein refers to the parallelism that exists between two or more machine instructions in a program. Up to six instructions may be executed concurrently in current or announced machines, e.g. the IBM POWER2 [14], but there are severe limitations on the mix of simultaneous instructions allowable and the typical average performance gain due to ILP is only at most a factor of 2 or 3 better than an ideal sequential machine. 1.1 ILP enhancement techniques Enhancing the independence of instructions so that ....

S. Weiss and J. E. Smith. POWER and PowerPC. Morgan Kaufman Publishers, Inc., San Francisco, California, 1994.


Coherent Network Interfaces for Fine-Grain Communication - Mukherjee, Falsafi, al. (1996)   (25 citations)  (Correct)

....programmable backoff interval to reduce the likelihood of stealing the block back before the processor completes its writes to the CDR. This technique, called virtual polling, is necessary because few processors can efficiently push data out of their caches. For processors (e.g. PowerPC [47]) that do support user level cache flush instructions, the CDR can be directly flushed out of the cache. CDRs allow a processor to efficiently transfer a full cache block (e.g. 32 128 bytes) of information to or from a CNI. For smaller amounts of data, e.g. a 4 byte word, CDRs are less ....

Shlomo Weiss and James E. Smith. Power and PowerPC. Morgan Kaufmann Publishers, Inc., 1994.


Optimization of Instruction Fetch Mechanisms for High.. - Conte, Menezes, Mills.. (1995)   (85 citations)  (Correct)

....unit to speculate beyond branches [1] 12] This decoupling reduces the impact of more complicated (and higher latency) instruction fetch hardware. In addition to this, the six instruction per cycle IBM POWER2 architecture employs an instruction cache with eight, independently addressable banks [13]. This fetch unit can align many instruction sequences, but is limited by the POWER2 s static branch prediction mechanism, which is known to have lower performance than dynamic schemes. The recentlyannounced AMD superscalar 29K addresses this limitation by embedding prediction and branch target ....

....contention seldom occurs. Two register files are maintained: the Messy register file and the Future register file. The former is used for out of order execution. If used without augmentation, the microarchitecture would be limited to imprecise interrupts. This is remedied using a reorder buffer [13]. The chief performance metric is instructions retired per cycle (IPC) which is the number of instructions leaving the reorder buffer (i.e. retiring) per simulated execution cycle. All three microarchitectures have direct mapped instruction caches. The cache block size is calculated so that a ....

S. Weiss and J. E. Smith, POWER and PowerPC. San Francisco, CA: Morgan Kaufmann, 1994.


Software-Managed Address Translation - Jacob, Mudge (1997)   (19 citations)  (Correct)

....same set in an associative, virtually indexed, virtually tagged cache. flushing the entire cache or sweeping through the entire cache and modifying the affected lines. 3. 2 Segmented translation The IBM 801 introduced a segmented design that persisted through the POWER and PowerPC architectures [6, 29, 39, 53]; it is illustrated in Fig 2. Applications generate 32 bit effective addresses that are mapped onto a larger virtual address space at the granularity of segments, 256MB virtual regions. Sixteen segments comprise an application s address space. The top four bits of the effective address select ....

S. Weiss and J. E. Smith. POWER and PowerPC. Morgan Kaufmann Publishers, San Francisco CA, 1994.


System-Level Power Consumption Modeling and Tradeoff.. - Conte, Menezes, Sathaye (1997)   (6 citations)  (Correct)

....parallelism and execute code correctly. Empirical results suggest as much as a five times speed improvement when instruction level parallelism is exploited [2] Current designs seek parallelism by examining and issuing four to six instructions per cycle, with higher rates expected [3] 4] 5] [6]. Successful use of these high issue rates requires careful tuning of the microarchitecture. There is a wealth of technological alternatives for this task. These include branch handling strategies [7] functional unit duplication [2] and instruction fetch, issue, completion and retirement ....

S. Weiss and J. E. Smith, POWER and PowerPC, Morgan Kaufmann, San Francisco, CA, 1994.


Two Computer Systems Paradoxes: Serialize-to-Parallelize, and.. - Orni, Vishkin (1995)   (5 citations)  (Correct)

.... include software pipelining [18, 14, 16] loop unrolling and trace scheduling [9] Runtime techniques include speculative execution, conditional execution, and branch prediction [9] One kind of machine which uses the parallelism found by compiler or runtime techniques is multiscalar machines [27]. Since relatively little attention has been given to using these techniques for programs which were not initially designed for a strictly serial processor, we mention one outstanding issue with respect to doing so. To run such programs efficiently, the hardware may need to have the ability to ....

S. Weiss and J. E. Smith. Power and PowerPC. Morgan Kaufman Publishers Inc., San Francisco, CA, 1994.


Wrong-Path Instruction Prefetching - Pierce, Mudge (1994)   (30 citations)  (Correct)

.... to alleviate the decode problems associated with variable instruction size and complex encodings [1] Superscalar processors like the Alpha or Power2 architectures prefetch multiple lines from the cache so that multiple instructions can be issued per cycle even during branch conditions [7][13]. The PowerPC also prefetches multiple instructions from the cache into prefetch buffers. It does this primarily because the instruction fetch must share a single Wrong Path Instruction Prefetching 3 port to the unified cache with data memory requests and thus it cannot fetch an instruction from ....

S. Weiss and J. Smith, Power and the PowerPC, San Mateo, CA: Morgan Kaufmann, 1994. Wrong-Path Instruction Prefetching


Selective Dual Path Execution - Heil, Smith (1996)   (24 citations)  Self-citation (Smith)   (Correct)

....the Resume Cache, eliminating one cycle from the misprediction penalty. The IBM POWER1 processor statically predicts all conditional branches not taken. However, the processor fetches some instructions from the taken path as a hedge, which results in a single cycle penalty when the branch is taken [17]. 1.2 Paper Summary The rest of the paper is divided into the following sections. Section 2 contains a description of the branch confidence mechanism and describes the forking policies studied. Section 3 describes the trace driven model and gives simulation results. Section 4 describes ....

....can occur on the forked second path. With this method, two basic blocks per cycle can be fetched during dual path execution, one from each path. An alternative, which we did not study, is to use a single fetch unit that can fetch multiple basic blocks in a single cycle. Such units are proposed in [1, 10, 17]. FIGURE 16. Alternating fetching of paths in SDPE reduces most of performance improvement obtained through SDPE. branch misprediction. Cycles lost due to max. outstanding branches. Cycles lost due to instruction queue full. Cycles lost due to BTB misses. Cycles lost due to Icache misses. Cycles ....

S. Weiss, J. Smith. POWER and PowerPC. Morgan Kaufmann Publishers, Inc., pages 100 - 104, 188 - 192, 1994.


Restricted Dual Path Execution - Heil, Farrens, Smith, Tyson   Self-citation (Smith)   (Correct)

....from only one path could be decoded and executed [CoFP79] Similarly, the IBM POWER1 processor statically predicts all conditional branches not taken. However, the processor fetches some instructions from the taken path as a hedge, which results in a single cycle penalty when the branch is taken [WeSm94]. Also, the MIPS R10000 requires a one cycle # # delay to decode and predict branches. This cycle is used to fetch instructions sequentially following the branch. If the branch is predicted taken, the extra fetched instructions are stored in a Resume Cache [Gwen94] in a partially decoded state. If ....

S. Weiss and J. Smith, POWER and PowerPC, Morgan Kaufmann Publishers, Inc., (1994). # #


A Methodology for Processor Implementation Verification - Lewin, Lorenz, Ur (1996)   (3 citations)  (Correct)

No context found.

S. Weiss, J. E. Smith "POWER and PowerPC", Morgan Kaufmann, 1994

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC