16 citations found. Retrieving documents...
Wen-Mei Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pages 242-251, June 1989.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
A Stream Processor Front-end - Ramírez, Larriba-Pey, Valero (2000)   (Correct)

....feed even the more aggressive superscalar processor. Plus, the use of code replicating techniques [4, 9] can lead to much longer streams, and completely solve the problem of stream length. 1. 2 Instruction cache misses Code reordering techniques also reduce the number of instruction cache misses [1, 2, 5, 7] by almost an order of magnitude. The same code reordering we need to enlarge our streams can also minimize the number of instruction cache misses as shown in Figure 2. 8 KB 16 KB 32 KB Instruction cache size (KB) 100000 1E 06 1E 07 Instruction cache misses original optimized Figure 2: ....

Wen-Mei Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pages 242-251, June 1989.


The Predictability of Libraries - Calder, Grunwald, Srivastava (1995)   (Correct)

....behavior, and allows comparison to earlier branch prediction studies. i 1 Introduction Profile guided code optimizations have been shown to be effective by several researchers. Among these optimizations are basic block and procedure layout optimizations to improve cache and branch behavior [3, 10, 12], register allocation, and trace scheduling [5, 6, 8, 11] The technique that all these optimizations have in common is that they use profiles from a previous run of a given program to predict the behavior of a future run of the same program. However, many researchers believe that collecting ....

....of common behavior in programs, and indicates that a range of optimizations may be applied. Certainly, dynamic branch prediction methods result in smaller mispredict rates, but statically predictable execution can be used for a number of optimizations. Pettis and Hansen [12] and Hwu and Chang [10] both examined profile optimizations to improve instruction locality. They found both basic block reordering algorithms and procedure layout algorithms effective at reducing the instruction cache miss rate. In a similar study [3] we showed that profile based basic block reordering (Branch ....

Wen-mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251. ACM, 1989.


Software Trace Cache for Commercial Applications - Ramirez, Larriba-Pey..   (Correct)

....and to predict multiple branches per cycle [23, 26] Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement [8, 7] or basic block mapping, either procedure oriented [18] or using a global scope [11, 24]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other classic techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling ....

....of non consecutive basic blocks. 1.2 Related work There has been much work on code mapping algorithms to optimize the instruction cache miss rate. These works were targeted at less aggressive processors, which do not need to fetch instructions from multiple basic blocks per cycle. Hwu and Chang [11] use function inline expansion, and group into traces those basic blocks which tend to execute in sequence as observed on a pro le of the code. Then, they map these traces in the cache so that the functions which are executed close to each other are placed in the same page. Pettis Hansen [18] ....

Wen-Mei Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pages 242-251, June 1989.


SPAID: Software Prefetching in Pointer- and.. - Lipasti, Schmidt.. (1995)   (38 citations)  (Correct)

.... Efforts to improve instruction cache behavior of programs have their roots in methods to improve paging behavior of main memory [HG71, Fer74, Har88] A popular area of research has been repositioning of code sections by the compiler, both at the basic block level and the procedure level [HC89, GC90, PH90, CMH91, Wu92] Some such methods operate on the executable after link time, allowing intermingling of basic blocks from different procedures [Hei94b, Hei94a] while others take into account the branch prediction architecture of the hardware [CG94] McFarling [McF91] has investigated ....

Wen-Mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251, Jerusalem, May--June 1989.


Code Reordering of Decision Support Systems for.. - Ramírez.. (1998)   (Correct)

....and to predict multiple branches per cycle [22, 28] Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement [8, 7] or basic block mapping, either procedure oriented [18] or using a global scope [9, 24]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling [5] and ....

....distributed evenly among the first SetAssoc passes. The rest of the sequences, and the basic blocks not placed in sequences are mapped as in the direct mapped case. 7 Related Work The software approach to the i cache miss rate problem has been the use of code reordering algorithms. Hwu and Chang [9] use function inline expansion, and group into traces those basic blocks which tend to execute in sequence as observed on a profile of the code. Then, they map these traces in the cache so that the functions which are executed close to each other are placed in the same page. Our approach protects ....

Wen-mei Hwu and Pohua P. Chang, Achieving High Instruction Cache Performance with an Optimizing Compiler, Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pp 242-251, June 1989.


Software Trace Cache - Ramírez, Larriba-Pey.. (1999)   (3 citations)  (Correct)

....University of Illinois at Urbana Champaign, USA. blocks in consecutive memory positions and, therefore, increase the number of useful instructions fetched per access. Unfortunately, past work on code reordering techniques has largely focused on simply reducing the instruction cache miss rate [9, 11, 14, 6, 5, 8]. This approach made sense in the context of the simple, less aggressive processors for which past work was done. However, in the modern, wide issue superscalars, ensuring that sequentially executed instructions are mapped in consecutive memory positions can be more crucial than keeping the number ....

....that sequentially executed instructions are mapped in consecutive memory positions can be more crucial than keeping the number of misses low. Furthermore, those past works that focused on increasing the spatial locality of codes, often ignored trying to reduce the instruction cache miss rate [3, 9]. Finally, to our knowledge, code reordering techniques have not been applied to challenging codes like databases and very large integer SPEC applications. In this paper, we address all these issues. We present a fully automated, compile time code reordering technique that focuses on maximizing ....

[Article contains additional citation context not shown here]

Wen mei Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pages 242--251, June 1989.


Red blue traces: Trace cache redundancy - Ramírez, Larriba-Pey.. (1999)   (Correct)

....techniques define a logical ordering of basic blocks and reorder instructions in those logical groups crossing the basic block boundary to optimize instruction scheduling for VLIW processors. There are also other code reordering algorithms which target an increase in the instruction cache hit rate [7, 11, 16]. In order to exploit more spatial locality, these techniques also increase the code sequentiality, but they target a single basic block fetch per cycle. The full potential of these techniques has not been exploited. The software trace cache [12, 13, 14] is the first code reordering targeting the ....

Wen-Mei Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. Proceedings of the 16th Annual Intl. Symposium on Computer Architecture, pages 242--251, June 1989.


The Predictability of Branches in Libraries - Calder, Grunwald, Srivastava (1995)   (14 citations)  (Correct)

....It replaces Technical Note TN 50, an earlier version of the same material. 1 Introduction Profile guided code optimizations have been shown to be effective by several researchers. Among these optimizations are basic block and procedure layout optimizations to improve cache and branch behavior [3, 10, 12], register allocation, and trace scheduling [5, 6, 11, 7] The technique that all these optimizations have in common is that they use profiles from a previous run of a given program to predict the behavior of a future run of the same program. However, many researchers believe that collecting ....

....of common behavior in programs, and indicates that a range of optimizations may be applied. Certainly, dynamic branch prediction methods result in smaller mispredict rates, but statically predictable execution can be used for a number of optimizations. Pettis and Hansen [12] and Hwu and Chang [10] both examined profile optimizations to improve instruction locality. They found both basic block reordering algorithms and procedure layout algorithms effective at reducing the instruction cache miss rate. In a similar study [3] we showed that profile based basic block reordering (Branch ....

Wen-mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251. ACM, 1989.


Quantifying Behavioral Differences Between C and C++ Programs - Calder, Grunwald, Zorn (1995)   (47 citations)  (Correct)

....function. If function calls can increase cache misses, the behavior of C functions can lead to fewer conflicts. Table 4.2 also shows that C can possibly benefit more than C programs from basic block reordering and procedure layout algorithms to eliminate cache conflicts. These algorithms [14, 40, 41] have been shown to efficiently reduce the number of instruction cache misses and improve branch prediction. 4.9.2 Data Cache The data (D) cache miss rate is a measure of the locality of reference of a program s access to data in the stack, static data segment, and heap. Table 18 shows the data ....

....the memory allocator for the application [36] will be more effective for C programs. The negligible difference in data cache performance shown in x4.9.2 implies that specific C optimizations for data cache locality may not be necessary. By comparison, optimizations for instruction caches [40, 41, 50] and possibly virtual memory systems [51, 52, 53, 54] will be more important for C programs than for C programs. Our data also indicates that link time optimizations, such as those proposed by Wall [55] and others will become more important. Object oriented languages, such as C , allow ....

Wen mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual Annual International Symposium on Computer Architecture, SIGARCH Newsletter, pages 242--251. ACM, ACM, 1989.


Next Cache Line and Set Prediction - Calder (1995)   (26 citations)  (Correct)

....miss rate is lowered, there is an increased probability that a cache line will still be resident when a NLS predictor is used. The BTB architecture will not benefit from the lower cache miss rate, and the there is no change in the BEP for varying cache configurations. Whole program restructuring [8, 4, 14] is one technique that can be used to reduce the instruction cache miss rate at no additional architectural cost. Why does the NLS architecture have significantly better BEP performance than the BTB for some programs, such as gcc, cfront and groff, but only slightly better or comparable ....

Wen-mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251. ACM, 1989.


Evidence-based Static Branch Prediction using Machine.. - Calder, Grunwald.. (1996)   (22 citations)  (Correct)

....process of correctly predicting whether branches will be taken or not before they are actually executed. Branch prediction is important, both for computer architectures and compilers. Compilers rely on branch prediction and execution estimation to implement optimizations such as tracescheduling [14, 13, 15] and other profile based optimizations [9, 10] Wide issue computer architectures rely on predictable control flow, and failure to correctly predict a branch results in delays for fetching and decoding the instructions along the incorrect path of execution. The penalty for a mispredicted branch ....

Wen-mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251. ACM, 1989.


Quantifying Behavioral Differences Between C and C++ Programs - Calder (1994)   (47 citations)  (Correct)

....the memory allocator for the application [18] will be more effective for C programs. The negligible difference in data cache performance shown in x4.7.2 implies that specific C optimizations for data cache locality are not necessary. By comparison, optimizations for instruction caches [35, 37] and possibly virtual memory systems [1, 3, 16, 20, 21] will be more important for C programs than for C programs. One of the most notable observations from the programs we instrumented is that C programs have deeper call stacks, with more variation in the call depth stack depth, than C ....

Wen mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual Annual International Symposium on Computer Architecture, SIGARCH Newsletter, pages 242--251. ACM, ACM, 1989.


The Precomputed Branch Architecture - Calder, Grunwald (1999)   (Correct)

....calls will span branch spaces. Therefore, all intra procedural branches will be compiled as pre computed branches. There are myriad ways to partition programs, and a number of alternatives have been examined in the effort to reduce page faults [1, 2, 15, 13] and instruction cache conflicts [16, 25, 30]. The goals of our study are different than these other studies; we are more interested in reducing the number of indirect jumps than reducing cache conflicts and paging. None the less, the best performing algorithm we examined (MaxCut) for code partitioning is very similar to the greedy layout ....

....branch spaces. A similar technique is used in the Depth First Profile method; a depth first search orders the nodes, always visiting the out going edge with the highest call frequency. This Depth First Profile algorithm is very similar to the procedure layout algorithm proposed by Hwu and Chang [25]. The MaxCut partitioning uses a greedy max cut algorithm to partition the graph using the call frequency to guide the partitioning. This algorithm is very similar to the greedy approach for procedure mapping proposed by Pettis and Hansen [30] TheMaxCut partitioning algorithm processes the edges ....

[Article contains additional citation context not shown here]

Wen mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 168th Annual International Symposium of Computer Architecture, pages 242--251. ACM, ACM, 1989.


Evidence-based Static Branch Prediction using Machine.. - Calder, Grunwald.. (1997)   (22 citations)  (Correct)

....process of correctly predicting whether branches will be taken or not before they are actually executed. Branch prediction is important, both for computer architectures and compilers. Compilers rely on branch prediction and execution estimation to implement optimizations such as tracescheduling [15, 14, 17] and other profile based optimizations [10, 11] Wide issue computer architectures rely on predictable control flow, and failure to correctly predict a branch results in delays for fetching and decoding the instructions along the incorrect path of execution. The penalty for a mispredicted branch ....

Wen-mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251. ACM, 1989.


Reducing Branch Costs via Branch Alignment - Calder, Grunwald (1994)   (41 citations)  (Correct)

....virtual memory pages [1, 8, 11, 13, 10] Other researchers extended this work to lower levels of the memory hierarchy, optimizing the performance of instruction caches. McFarling [15] described an algorithm to reduce instruction cache conflicts for a particular class of programs. Hwu and Chang [18] describe a more general and more effective technique using compile time analysis in the IMPACT I compiler system. Using profile based transformations, the IMPACT I compiler inlines subroutines and performs trace analysis. For each subroutine, instructions are packed using the most frequently ....

....buffer (BTB) architectures. Yet, they also only examined if then else constructs. Yeh et al. [26] commented that with trace scheduling, taken branches could only be reduced from 62 of the executed conditional branches to 50 of executed conditional branches. The earlier study by Hwu and Chang [18] showed a 58 fall through rate after branch alignment. The papers by McFarling and Hennessy, Bray and Flynn, and Pettis and Hansen did not report the change in the percentage of taken branches. The branch alignment reordering algorithm proposed by Hwu et al. is more general than McFarling s and ....

[Article contains additional citation context not shown here]

Wen mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performance with an optimizing compiler. In 16th Annual International Symposium of Computer Architecture, pages 242--251. ACM, ACM, 1989.


Corpus-based Static Branch Prediction - Calder, Grunwald, Lindsay.. (1995)   (13 citations)  (Correct)

....process of correctly predicting whether branches will be taken or not before they are actually executed. Branch prediction is important, both for computer architectures and compilers. Compilers rely on branch prediction and execution estimation to implement optimizations such as trace scheduling [12, 13] and other profile directed optimizations [8, 9] Wide issue computer architectures rely on predictable control flow, and failure to correctly predict a branch results in delays for fetching and decoding the instructions along the incorrect path of execution. The penalty for a mispredicted branch ....

Wen-mei W. Hwu and Pohua P. Chang. Achieving high instruction cache performancewith an optimizing compiler. In 16th Annual International Symposium on Computer Architecture, pages 242--251. ACM, 1989.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC