| M. D. Smith, M. Johnson, et al. Limits on multiple instruction issue. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 290--302. ACM Press, 1989. |
....of up to 50,000 operations were measured and an average speedup of around 90 was observed. Smith used trace driven simulations to measure the effective limits of multiple instruction issue in superscalar architectures and observed an instruction issue rate of about two instructions per cycle [1]. Wall also tested ILP limit with various assumptions using a wide variety of hardware and software techniques including branch prediction, register renaming, and alias analysis, and concluded that average parallelism rarely exceeds 7 [2] Audio and video applications use different enough ....
Michael D. Smith, Mike Johnson, and Mark A. Horowitz, "Limits on Multiple Instruction Issue", Third International Symposium on Architectural Support for Programming Languages and operating Systems, pp. 290- 302, April 1989.
....with studies that attempt to measure the inherent instruction level parallelism (ILP) limits in various programs. They profile or simulate code, following the dependence paths, and measure the amount of parallelism given various architectural constraints. Among those have been Smith, et al. [20], Butler, et al. 4] Wall [26] Theobald, et al. 22] and Lam and Wilson [11] The difference in this paper is that we are not searching just for the length of the critical path (another way of thinking about the ILP they measured) but the composition of the critical path. We want to know ....
M.D. Smith, M. Johnson, and M.A. Horowitz. Limits on multiple instruction issue. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 290--302, 1989.
....or superpipelined organization in order to increase execution speed by exploiting instruction parallelism, i.e. by attempting to execute more than one instruction in parallel. Unfortunately, the amount of available instruction parallelism in typical programs is low; on the order of two [1] to eight [2] instructions. In addition, both techniques tend to increase the penalty for branches and, hence, require expensive hardware for branch prediction. Ultimately, processor performance is limited by the presence of data and control hazards in programs as well as the need to fetch data ....
Michael D. Smith, Mike Johnson, and Mark A. Horowitz. Limits on multiple instruction issue. In ASPLOS 3, 1989.
....comment: Cache misses impose a larger penalty for multi issue and other parallel machines. This is because the number of instructions lost is magnified by the width of the instruction window. Smith, Johnson and Horowitz study the available parallelism for a superscalar MIPS architecture in [161]. In this study, trace driven simulations were used to find the parallelism for variations of the MIPS architecture, including superscalar versions. The benchmarks used were non scientific code, i.e. avoiding the Livermore Loops. They start with code optimized for the R2000 in this study. Pixie is ....
M. D. Smith, M. Johnson, M. A. Horowitz, Limits on Multiple Instruction Issue, Proceedings of the Third International Conference on Architectural Support for 181 Programming Languages and Operating Systems, 1989, vol. 17, pp. 290-302.
....assumptions about the compilation and processor technology which may never be attained or (better yet) may be bettered in future. Early studies [161, 128] concluded that there is very limited parallelism in general purpose applications. Since then several ILP limit studies have been conducted [108, 83, 157, 155, 172, 173, 90]. Most of these studies concluded that, in general, the available ILP is limited, claiming that often the number of maximum instructions that can be issued on each cycle is typically less than 10. All these studies explain the general perception that conventional approaches to ILP may not improve ....
M. Smith, M. Johnson, and M. Horowitz. Limits on multiple instruction issue, 1989.
....in order to achieve the maximum possible performance. In particular, in order to make effective use of a large number of functional units it is necessary to perform optimizations across basic block boundaries, as the amount of parallelism available within basic blocks tends to be quite limited ([Smith89], Wall91] In this paper we survey some common compiler and architectural techniques for increasing program ILP and making more effective use of the available hardware resources. We begin by discussing Trace Scheduling ( Fisher81] Fisher83] in section 2. In trace scheduling, compilation ....
Michael D. Smith, Mike Johnson, Mark A. Horowitz, Limits on Multiple Instruction Issue , Proc. ASPLOS 89, pp. 290-302
....we do not have any instruction to dispatch since we don t know what to execute next. Previous superscalar implementations stall the execution of one or more functional units. However, the intrinsic parallelism of most non numerical applications is proved to be insufficient to make this effective[6]. The solution is very similar to the single issue case. We use branch prediction to assume either the taken or non taken path will be reached, and execute the instruction located there. By doing this, the processor avoids twisting its thumbs. Now instead of sitting idle waiting for branch ....
M.D. Smith, M. Johnson, and M.A. Horowitz. "Limits on Multiple Instruction Issue". Proc. Third Int. Conf. on Architectural Support for Programming Languages and Operating Systems, April 1989, pp. 290-302.
....and a factor 2 3 for the numeric benchmarks. They also investigated the increase in parallelism when unrolling loops for the numeric benchmarks. They found a degree of up to 6 parallelism with unrolling degrees of 10. Another study for superscalar processors was carried by M.D. Smith et al. SJH89] This study was based on traces executed in the MIPS R2000 processor. The experiment was done for non numeric programs and with several configurations of functional units. In fact it was not an study on the limits of available parallelism, but an study of attainable parallelism with a limited ....
M.D. Smith, M. Johnson, and M.A. Horowitz. Limits on multiple instruction issue. In Proc., Third Internat. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 290--302, April 1989.
....stream or continue executing the sequential (fall through) stream. This performance diminishing effect of branch instructions is further amplified in computers with multiple pipelined functional units. This effect has been known for some time [3] and has recently received further attention [4] with the advent of superscalar and scalable compound instruction set machines (SCISM) 5] These machines, which attempt to execute multiple instructions in parallel from a single instruction stream, are particularly susceptible to the adverse effects of branches, because not only will branch ....
M. D. Smith, M. Johnson, and M. Horowitz, "Limits on multiple instruction issue," Proceedings of ASPLOS III, ACM, pp. 290--302, 1989.
.... fo givenproW64W andco[804 the relative perfoB manceimpro vements o superscalar and superpipelined pro cesso against thetraditio#0 scalarpro cesso [2] Ho wever, Smithcoth[5F7 that thepro87W withoW cohoW mathematicalo eratioa do es no haveenoB7 parallelismto execute two instructio6 per cycle [3]. In theo[56088[#BF issue,instructio mayalso coo pleteoe o oete because they are no issued in sequential oal[ and have di#erent latency in theiro eratio48 To take full advantageo theo[8WB40[#B8 issue, the pro cesso shoo emplo y theo[885 F[#B5 co885 F[#B Ho wever, theo[57W4 [#B0 co57W4 [# leadsto a ....
M.D. Smith, M. Johnson, and M.A. Horowitz, "Limits on multiple-instruction issue," Proc. Third Int. Conf., Architectural Support Programming Language and Oper. Sys., pp.290--302, April 1989.
....on the cache organization. The latency tolerance of references that hit in the cache is not accounted for, even though it may be quite large. We evaluate the latency tolerance of individual load instructions without being tied down to a specific memory system. Graph based analyses have been used [2, 9, 15, 21, 25, 28] to study the amount of parallelism available in programs. They work on a static dynamic execution trace under idealistic assumptions such as unconstrained resources, single cycle operational latencies for functional units, perfect branch prediction and alias analysis. They use the dependence ....
M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on Multiple Instruction Issue. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 290--302, May 1989.
.... the hardware needed to detect data dependencies between instructions, the complexities of fetching noncontiguous instructions from memory, and gathering multiple data items from memory in single cycles, arrived at much more pessimistic results that suggested 2 3 as a realistic limit for IPC [JW89, SJH89] Nevertheless, the limit studies showed that the limitations were one of physical implementation, not logical limitations, and thus provided a realistic goal for implementers. In fact, before some of these studies there was already a signi cant body of research that proposed to eliminate the ....
.... group at the University of Illinois showed that 16 or more processors could be kept busy on workloads characterized by FORTRAN DO loops [KBC 74] Furthermore, not all researchers who considered the attenuation that results from implementation complexities were as pessimistic as [JW89] and [SJH89] In [BYP 91] as the title suggests, the authors argue a strong case for implementation complexities being less limiting. Indeed, in the past few years manufacturers have now started to produce machines with the ability to issue six instructions per cycle, with more on the horizon [Gwe97] ....
M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on multiple instruction issue. In Proc. ASPLOS-3, number 5, pages 290-302, May 1989.
....to execution resource limits such as inter instruction dependencies, load and store latencies, floatingpoint latencies, and so on. Instruction fetch bandwidth and branch latency are identified by Smith, Johnson, and Horowitz as principal barriers to increased performance in superscalar processors [71]. 2.3 Normalizing Performance Measurements The instruction count, or path length IC is the total number of instructions in an execution trace for a given program. Given the total number of cycles to complete program P for two different machines, if clock rates are identical, it is easy to decide ....
.... demands are high, and cache misses can be expensive (a stall cycle can represent several lost issue opportunities) Smith, Johnson, and Horowitz identify instruction fetch performance as more critical than instruction parallelism in limiting superscalar performance on non scientific programs [71]. The traffic reduction and increased cache performance provided by 16 bit instructions could help. Like VLIWs, the increase in antidependencies arising from the small register name space could limit parallelism, but register renaming can help mitigate this problem. Machines with deep pipelines ....
Michael D. Smith, Mike Johnson, and Mark A. Horowitz. Limits on multiple instruction issue. ACM SIGPLAN Notices, Proceedings ASPLOS-III, 24:290--301, May 1989. 173
....for this purpose. Both of these architectures have the capability of issuing multiple operations each cycle. The problem is that the available operation level parallelism within each basic block of a program has been shown to be close to 2, hardly enough to justify these new architectures [1] [2]. To increase the available operation level parallelism, many new scheduling techniques have been developed: trace scheduling [3] superblock scheduling [4] and percolation scheduling [5] to name a few. All of these global compaction techniques are used in the compiler to schedule operations ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989.
....of specialized opcodes available in the architecture, and 12 to take advantage of new optimization opportunities after spill code has been added by the register allocator. 2. 3 Superblocks For most non numeric programs, the ILP available within individual basic blocks is extremely limited [26] [27], 28] An ILP compiler must be able to optimize and schedule instructions across basic block boundaries to find sufficient parallelism. An effective structure for ILP compilation is the superblock [23] 29] The formation and optimization of superblocks increases the ILP available to the ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989. 167
....depends on the ability consistently to present it with a sucient number of independent instructions. Studies have shown that by using conventional code optimization and scheduling methods, wide ILP processors have diculty sustaining a speedup of more than two for nonnumeric programs [1] [2], 3] These low speedups are a result of numerous challenges encountered in the process of extracting ILP. Although ILP can be extracted solely by the compiler, or solely by the hardware, the compiler and hardware have individual strengths and the task of extracting ILP should be divided ....
M. D. Smith, M. Johnson, and M. A. Horowitz, \Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989, pp. 290-302.
....very little work on analyzing and optimizing pointer intensive programs. Meanwhile, hardware architects are in a quandary about how much superscalar parallelism to make available on VLSI processor chips. There have been numerous studies on the limits of instruction level parallelism [Wall 1991; Smith et al. 1989], but no widespread agreement. The IBM Power2 architecture employs 6 functional units (two integer, two floating point, and two branch) and can dispatch up to 6 instructions per cycle. Whether a machine with twice as many again functional units would be able to deliver significant speedups over a ....
Smith, M. D., Johnson, M., and Horowitz, M. A. 1989. Limits on multiple instruction issue. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, (Boston, Mass., Apr.). SIGPLAN Not., 24, Special Issue, 290--302.
....method to improve instruction execution rate is to increase the number of instructions executed per cycle. This is done by fetching, decoding, and executing multiple instructions per cycle. This is often referred to as multiple instruction issue [44] 12] 27] 31] 32] 19] 35] 36] 42] 24] [41] . The timing diagram of such a pipeline is shown in Figure 6. In this example, two 4 Although the compare and branch instructions are assumed in the example, the methods in this paper apply to condition code branches as well. 5 Although unconditional branch instructions can redirect the ....
....return if one occurs to E 0 or F 0 . Section 4.2 presents an alternative approach to reducing code expansion. 3. 6 Extension to Out of order Execution Inline Target Insertion can be extended to handle instruction sequencing for out of order execution machines [46] 47] 45] 18] 19] [41] . The major instruction sequencing problem for out of order execution machines is the indeterminate timing of deriving branching conditions and target addresses. It is not feasible in general to design an efficient sequencing pipeline where branches always have their conditions and target ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on Multiple Instruction Issue", Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp.290-302, April, 1989.
....scheduling, superblock, superpipelining, superscalar. 1 Technical Report CRHC 91 29, University of Illinois 2 1 Introduction For non numeric programs, there is insufficient instruction level parallelism available within a basic block to exploit superscalar and superpipelined processors[20] [21] . To schedule instructions beyond the basic block boundary, instructions have to be moved across conditional branches. There are two problems that need to be addressed in order for a scheduler to move instructions above branches. First, to schedule the code efficiently, the scheduler must ....
....the program or incorrectly overwrites a value when the branch is mispredicted. Various Technical Report CRHC 91 29, University of Illinois 3 hardware techniques can be used to prevent such hazards. Buffers can be used to store the values of the moved instructions until the branch commits [12] [21] [22] If the branch is taken, the values in the buffers are squashed. In this model, exception handling can be delayed until the branch commits. Alternatively, non trapping instructions can be used to guarantee that a moved instruction does not cause an exception [8] In this paper we focus on ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on Multiple Instruction Issue", Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989.
.... For example, Wall and Jouppi have examined this issue for a variety of realistic and idealized machine configurations [23] 60] Other work by Smith, Johnson, and Horowitz has also explored the limits on instruction issue afforded by a suite of integer and scalar floating point applications [55]. These works have focused on inherent parallelism in the application. Lam and Wilson have explored the impact of control flow and showed how relaxing control dependence constraints potentially improves performance [31] Woo et al. have characterized the behavior of the SPLASH 2 suite in terms of ....
M. D. Smith, M. Johnson, and M. Horowitz. Limits on Multiple Instruction Issue. In Proccedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 290--302, Apr. 1989.
....processors become deeper and the number of concurrently executing instructions increases. Traditionally, the scheduling scope has been limited to one basic block. For high performance processors, there is typically insufficient ILP within a basic block to fully utilize the processor resources [1] [2]. Thus, compilers for high performance processors must look beyond basic block boundaries to schedule instructions. There are two problems associated with scheduling in the presence of conditional branches. First, to achieve a good schedule the compiler must take into account the resource and ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989.
....on the ability of compilers to provide sufficient instruction level parallelism (ILP) in program code. However, recent studies show that conventional code optimization and scheduling methods cannot provide enough ILP to obtain a sustained speedup of more than two for nonnumeric programs [1] [2], 3] The high frequency of conditional branch instructions in nonnumeric programs is mostly responsible for these poor results. Branch instructions impede the ability of the compiler to extract ILP in several ways. Branches impose restrictions on the ability of the compiler to move code. Moving ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989, pp. 290--302.
....COMPILER CONTROLLED SPECULATION Roger Alexander Bringmann, Ph.D. Department of Computer Science University of Illinois at Urbana Champaign, 1995 Wen mei W. Hwu, Advisor The available instruction level parallelism (ILP) is extremely limited within basic blocks of non numeric programs [1] [2] [3] An effective VLIW or superscalar processor must optimize and schedule instructions across basic block boundaries to achieve higher performance. An effective structure for ILP compilation is the superblock [4] The formation and optimization of superblocks increase ILP available to the ....
....of a scheduler is straightforward if list scheduling is applied within basic blocks. Unfortunately, there is insufficient instruction level parallelism available within basic blocks of non numeric benchmarks to fully utilize the functional units of wide issue superscalar and VLIW architectures [1, 2, 7]. Therefore global scheduling techniques such as trace scheduling [8] and superblock scheduling [4] have been proposed to permit greater scheduling and optimization freedom beyond basic block boundaries. Using these techniques, the program is divided into a set of traces or superblocks that ....
[Article contains additional citation context not shown here]
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989.
....for an ILP compiler is to extract ILP in nonnumerical code, such as user application programs or system programs. It has been said that nonnumerical code leads to small speedup (as little as two) and that ILP in such code is difficult to exploit due to its irregularity [Jouppi and Wall 1989; Smith et al. 1989]. The following is a discussion of typical problems in exploiting irregular ILP and a brief overview of corresponding solutions proposed in this article. Nonnumerical programs include a large number of conditional branches and control join points, resulting in basic blocks with few instructions. ....
Smith, M., Johnson, M., and Horowitz, M. 1989. Limits on multiple instruction issue. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, 290--302.
....is magnified both by the length and the number of pipelines. Hence, locating instructions to fill the branch delay slots becomes increasingly more important so that sufficient program parallelism [19] is available to match the increased machine parallelism [19] offered by the processor. Others [18, 30, 26] have argued that speculative code motion is necessary for extracting instruction level parallelism from branch intensive, non scientific programs. While compilation techniques like loop unrolling and software pipelining work well at extracting parallelism from loop intensive programs, they ....
M.D. Smith, M. Johnson, M.A. Horowitz. Limits on multiple instruction issue. ASPLOS III, pp. 290-302. Boston, MA, April, 1989.
....units Table 1 Performance of one group As can be seen, there are no significant benefits in using more then 2 arithmetic units within a group. This topic is currentlyreceiving much attention, and, with adequate interpretation, the same conclusion has been found in different situations [7, 11, 14], and appears to be a quite general propertyof programs. The number of functional units has therefore been fixed as two. There are two couples of address units, one for reads and one for writes; each is composed of one single reference and one multiple reference unit. A branch unit completes the ....
Smith, M. D., Johnson, M. and Horowitz, M. A. "Limits on Multiple Instruction Issue", 3rd Int. Conference on Architectural Support for Programming Languages and Operating Systems, 290-302, April 1989.
....of instructions, enabling it to expose greater amount of ILP. While correct branch prediction can increase ILP, incorrect prediction often results in large performance penalties. Recent studies have shown that imperfect branch prediction can reduce performance by a factor of two to more than ten [9] [10] 11] These performance penalties are attributed to several conditions. First, a large number of instructions, termed speculative instructions, are often executed from the predicted direction of each branch. When the branch is mispredicted, all speculative instructions must be discarded ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989.
....demand that a large amount of instruction level parallelism (ILP) be exposed to capitalize on their offered performance. Furthermore, due largely to the frequency of control operations in code, fetching enough instructions to feed such high performance cores has become a significant challenge [5]. One approach to meeting these demands for ILP and efficient instruction fetch has been dubbed explicitly parallel instruction computing (EPIC) and is the basis for the IA 64 architecture released recently by Intel and Hewlett Packard [6] EPIC architectures permit the compiler to express ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, April 1989, pp. 290--302.
....these approach has been routinely applied by the microprocessor community, there are very few studies of this kind in the vector computer arena. Microprocessor design is more and more driven by performance studies that try to identify all kinds of parallelism available in the instruction stream [3, 4, 5, 6], and that evaluate possible bottlenecks in the architecture. The detailed knowledge of the relative frequency of execution of instructions [7, 8] leads to the inclusion of new compound operations (like the fused multiply add instruction) in the instruction set so that the common case is ....
Michael D. Smith, Mike Johnson, and Mark A. Horowtiz. Limits on multiple instruction issue. ASPLOS, pages 290--302, 1989.
....amount of exploitable instruction level parallelism (ILP) is, however, severely limited in non numerical applications due to dependence constraints. Because of dependence constraints, it is difficult to dramatically improve performance with just extra function units. In fact, limit studies[1] 2] 3][4], studies of the limitation of exploitable ILP in applications, show that a basic block scheduling, a code scheduling only within a basic block, exploits a very limited amount of ILP. The limit studies also show that speculative execution dramatically increases the amount of exploitable ILP. In ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on Multiple Instruction Issue," In Proc. Third Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp.272-282, April 1989.
.... For example, Wall and Jouppi have examined this issue for a variety of realistic and idealized machine configurations [43, 106] Other work by Smith, Johnson, and Horowitz has also explored the limits on instruction issue afforded by a suite of integer and scalar floating point applications [92]. These works have focused on inherent parallelism in the application. Lam and Wilson have explored the impact of control flow and showed how relaxing control dependence constraints potentially improves performance [55] Woo et al. have characterized the behavior of the SPLASH 2 suite in terms of ....
M. D. Smith, M. Johnson, and M. Horowitz. Limits on Multiple Instruction Issue. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 290--302, Apr. 1989.
....to add instructions to a RISC processor We think so. First, there is no real shortage of chip area on most modern RISC processors [7] and second, many modern super scalar processors have more ALU s than are needed for the amount of instruction level parallelism found in most applications [18]. The real problem is the cost of the design work required to implement these instructions in very high clock rate technologies. The question then is one of economics: is the market for such a chip large enough for the vendor to make a profit The UltraSPARC and HP PA 7100LC may provide existence ....
M. D. Smith, M. Johnson, and M. A. Howowitz. Limits on multiple instruction issue. In Proceedings Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), pages 290--302, Boston MA, April 1989.
....that a unique value of a variable has a unique name. The cost of this renaming is a potentially large increase in the memory requirements of the program [15] 2.2. Available Parallelism Several studies have attempted to determine how much parallelism is actually available in application programs [2, 3, 12, 16, 19, 22, 25, 28, 30, 33, 35, 36]. These studies have examined a wide variety of numeric and non numeric application programs and have measured speedups ranging from slightly more than one to as much as several thousand when ignoring all resource dependences [20] In all cases, these studies indicate that maximum speedups of only ....
Michael D. Smith, Mike Johnson, and Mark A. Horowitz, "Limits on Multiple Instruction Issue," International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290-302, April 1989.
....increase, it is conceivable that some future single chip CMOS VAX implementation might achieve CPI numbers that are close to the VAX 9000 s. Just as VAX CPI can be improved by the gate intensive approach of the model 9000 design, so RISC CPI can be improved by superscalar or superpipelined designs [18, 29]. The IBM RISC System 6000 [17] for example, has a peak issue rate of four instructions per cycle. So while VAX may catch up to current singleinstruction issue RISC performance, RISC designs will push on with earlier adoption of advanced implementation techniques, achieving still higher ....
Smith, M.D., Johnson, M., and Horowitz, M.A. Limits on Multiple Instruction Issue. Proc. Third Int. Conf. on Architectural Support for Prog. Lang. and Op. Syst., ACM/IEEE, Boston, MA, April 1989, pp. 290-302.
....Action (APPARC) and by the CEPBA (European Center for Parallelism of Barcelona) are very few studies of this kind in the vector computer arena. Microprocessor design is more and more driven by performance studies that try to identify all kinds of parallelism available in the instruction stream [3, 4, 5], and that evaluate possible bottlenecks in the architecture. The detailed knowledge of the relative frequency of execution of instructions [6, 7] leads to the inclusion of new compound operations (like the fused multiply add instruction) in the instruction set so that the common case is ....
Michael D. Smith, Mike Johnson, and Mark A. Horowtiz. Limits on multiple instruction issue. ASPLOS, pages 290--302, 1989.
....parallelism present in the code. Several mechanisms have been developed that increase the number of instructions that a processor can issue in parallel, thereby increasing performance. This paper focuses on two such mechanisms: register renaming and dynamic speculation. A number of recent studies [5, 13, 10] have shown that they can make a significant impact on performance. For instance, Johnson states in [5] that removing renaming and speculation from his superscalar implementation would reduce the speedup obtained by 36 and 30 respectively. Register renaming improves performance in code not ....
M.D. Smith, W.M. Johnson, and M.A. Horowitz. Limits on multiple instruction issue. In Third International Symposium on Architectural Support for Programming Languages and Operating Systems, pages 290--302, April 1989.
....3. The value predictors considered in this work are presented in section 4. The results of this study are detailed in section 5. Finally, section 6 summarizes the main conclusions of this work. 2 Related work There have been a plethora of works dealing with the limits of the ILP [1] 2] 6] 9][13][17] 18] Each work studies the ILP that could be exploited under some constraints such as fetch width, instruction window size, branch prediction, register renaming, memory aliasing, etc. The main conclusion that can be extracted from all these works is that one of the main features that limit ....
M.D. Smith, M. Johnson and M.A. Horowitz "Limits on Multiple Instruction Issue" In Proc. of the ACM Conf. on Architectural Support for Programming Languages and Operating Systems, 1989
....the Center for Reliable and High Performance Computing, University of Illinois, UrbanaChampaign, Illinois, 61801. 1 Introduction For non numeric programs, there is insufficient instruction level parallelism available within a basic block to exploit superscalar and superpipelined processors [1][2][3] To schedule instructions beyond the basic block boundary, instructions have to be moved across conditional branches. There are two problems that need to be addressed in order for a scheduler to move instructions above branches. First, to schedule the code efficiently, the scheduler must ....
....should neither cause an exception which terminates the program nor incorrectly overwrite a value when the branch is mispredicted. Various hardware techniques can be used to prevent such hazards. Buffers can be used to store the values of the moved instructions until the branch commits [16][2][17] If the branch is taken, the values in the buffers are squashed. In this model, exception handling can be delayed until the branch commits. Alternatively, non trapping instructions can be used to guarantee that a moved instruction does not cause an exception [18] In this paper we focus on ....
M. D. Smith, M. Johnson, and M. A. Horowitz, "Limits on multiple instruction issue," in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989.
....Another effect of using hardware to detect instruction level parallelism is that the hardware can only analyze a small window of dynamic instructions during each cycle, thus limiting the possible candidates for parallel issue. Finally, instruction fetch efficiency, defined in Smith et al. [21] as the average number of useful instructions fetched per cycle, is reduced when executing from scalar object code. As a result of the large number of branches during the execution of nonnumerical code, dynamic schedulers suffer a significant performance penalty due to branch point misalignment in ....
....reorder buffer space in MATCH and enough general purpose registers in TORCH. Finally, we assumed a small number of functional units since the fetch efficiency and not the number of functional units is the limiting factor when exploiting instruction level parallelism in non numerical applications [21]. As a result, we maximize the functional unit cost performance tradeoff by making the load store pipe, our most expensive functional unit to duplicate, the most frequently used resource. With one of each functional unit, the integer ALU is the most frequently used functional unit. By adding an ....
M.D. Smith, M. Johnson, M.A. Horowitz, "Limits on Multiple Instruction Issue." Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (April 1989), pp. 290-302.
....to exploit ILP. In non numerical applications, the amount of ILP is limited. Limit studies, studies which try to bound the amount of exploitable ILP in applications, show that superscalar processors must look beyond branch boundaries to exploit the available ILP in non numerical applications [22][29] These studies show that good performance requires both a good instruction schedule and speculative execution, the execution of instructions before it is known for certain whether those instructions will be executed. What is not known is how to best schedule instructions for a superscalar ....
M.D. Smith, M. Johnson, and M.A. Horowitz. Limits on Multiple Instruction Issue. In Proc. Third Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 290--302, April 1989.
No context found.
M. D. Smith, M. Johnson, and M. Horowitz, "Limits on multiple instruction issue," in Int. Symp. Computer Architecture, Boston, MA, Apr. 1989, pp. 290--302.
No context found.
M. D. Smith, M. Johnson, et al. Limits on multiple instruction issue. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) , pages 290--302. ACM Press, 1989.
No context found.
M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on multiple instruction issue. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), volume 24, pages 290--302, New York, NY, 1989. ACM Press.
No context found.
M. D. Smith, M. Johnson, and M. A. Horowitz. "Limits on multiple instruction issue." In Proc. of the 3rd Intl. Conf. on Architectural Support for Programming Languages and Operating System, pages 290--302, 1989.
No context found.
M. Smith, M. Johnson, and M. Horowitz. Limits on multiple instruction issue, 1989.
No context found.
M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on multiple instruction issue. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 290--302. ACM Press, 1989.
No context found.
M. D. Smith, M. Johnson, and M. A. Horowitz. "Limits on multiple instruction issue." In Proc. of the 3rd Intl. Conf. on Architectural Support for Programming Languages and Operating System, pages 290--302, 1989.
No context found.
Smith, M. D., Johnson, M. and Horowitz, M. A. Limits on multiple instruction Issue. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, Massachusetts, ACM, 290-302, 1989.
No context found.
M. D. Smith, M. Johnson, and M. A. Horowitz. Limits on multiple instruction issue. In ASPLOS III, pages 290--302, Boston, Massachusetts, 1989.
No context found.
M. D. Smith, M. Johnson and M. A. Horowitz. Limits on Multiple Instruction Issue. Proceeding of the 3 rd International Conference on Architectural Support for Programming Languages and Operating System. April, 1989, pp. 290-302.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC