16 citations found. Retrieving documents...
Jack L. Lo and Susan J. Eggers, \Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism", SIGPLAN 1995

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
The Interaction of Architecture and Compilation.. - Adve, Berger, Eigenmann (1997)   (1 citation)  (Correct)

.... schedule multiple load misses within one instruction window so that they can be overlapped [PRA97] Most compiler scheduling algorithms have assumed either that all loads will hit or that all loads will miss in the cache, although more recent techniques perform latency sensitive load scheduling [LE95] Nevertheless, significant challenges, such as compile time memory disambiguation involving pointer analysis, must be addressed to exploit these techniques well. Explicit cache control The impact of load misses can be reduced by improving the locality of accesses using iteration space ....

Jack L. Lo and Susan J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the Conference on Programming Language Design and Implementation, 1995.


General-Purpose Architecture Instruction Scheduling Techniques - De Sutter (1998)   (Correct)

....than for the architecture. Results discussed in [48] show a speed up of 3 18 compared to traditional LS. When BLS is combined with other ILP optimizations such as loop unrolling, trace scheduling (see section 3.1. 1) and cache locality analysis, speedups are achieved in the range of 15 to 40 [52]. 2.3.2 Stochastic Algorithms Though LS is simple and generates near optimal code, it has some disadvantages, speci cally concerning the priority function: the heuristics generally are machine dependent; the priority function is not easily understood by humans, thus making it dicult to ....

Lo, J., and Eggers, S. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the PLDI '95 conference on Programming language design and implementation (June 1995), pp. 151-162.


Exploiting Instruction-Level Parallelism for Memory System.. - Pai (2000)   (Correct)

....than a single instruction window (possibly because of unroll and jam or inner loop unrolling) In such cases, the instruction scheduler should pack independent miss references in the loop body close to each other. The technique of balanced scheduling can provide some of these benefits [KE93, LE95] but may also miss some opportunities since it does not explicitly consider window size. Nevertheless, this heuristic worked well for the 52 code sequences we examined. More appropriate local scheduling algorithms remain the subject of future research. 4.3 Measuring the Impact of Optimizations ....

....the loop nest level, but also discusses the possible interaction between clustering and basic block scheduling. We have not yet dealt with clustered codes that are limited by basic block size and not amenable to previously understood local scheduling techniques such as balanced scheduling [KE93, LE95] In such situations, all of the independent misses exposed by the transformation will not actually issue to the memory system together, limiting the system s latency 128 tolerance ability. To improve latency tolerance, the instruction scheduler can reschedule independent misses to insure that ....

Jack L. Lo and Susan J. Eggers. Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 151--162, July 1995.


Data-Specific Optimizations - Jinturkar (1996)   (1 citation)  (Correct)

....The review of previous literature indicates that there is no standard term for the amount a loop is unrolled. For instance, if the loop body of a rolled loop is replicated n times to get an unrolled loop, then this amount has been described as unroll depth [Dong79] of (n 1) unroll factor [Lo95, Baco94] of (n 1) unroll amount [Freu94] of (n 1) or unroll factor [Scho89] of n. Some researchers do not use any term at all [Wall93, Mahl92] Figure 5: A counting and a non counting loop. for (i = 0; i 12; i ) a[i] i; i = 0; for (a = head, i= 0; a = NULL; a = a next) a.num = i ; ....

....The remaining registers are available for use in applying other code improvements. 3.6. 5 Instruction scheduling Instruction scheduling attempts to reorder instructions so that the pipeline performance is improved [Henn90] Instruction scheduling can improve the performance of an unrolled loop [Lo95]. However, the ability of an instruction scheduler to reorder instructions in an unrolled loop is limited by artificial dependencies created by naive reuse of registers by loop unrolling and other data dependencies. The artificial dependencies can be eliminated by applying register renaming ....

Lo, L. J., and Eggers, S., J., "Improving Balanced Scheduling with Compiler Optimizations that increase Instruction-level Parallelism", Proceedings of SIGPLAN `89 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995, pp. 151-162.


Effects of Loop Unrolling and Loop Fusion on Register.. - Dale Shires July   (Correct)

....of code execution speed. One way they increase performance is by taking advantage of parallelism found in algorithms. To this end, many of these systems offer multiprocessor parallelism. Furthermore, many also offer software pipelining to take full advantage of lowlevel, or code level, parallelism [1]. This is parallelism actually present in the way machine instructions are dispatched. Also of paramount importance is that these machines take full advantage of their complicated memory systems. Most of the standard optimization techniques will really only provide maximum performance if the ....

J. L. Lo and S. J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In ACM SIGPLAN 1995, pages 151--162, 1995.


Code Transformations to Improve Memory Parallelism - Pai, Adve (1999)   (6 citations)  (Correct)

....larger than a single instruction window (possibly because of unroll and jam or inner loop unrolling) In such cases, the instruction scheduler should pack independent miss references in the loop body close to each other. The technique of balanced scheduling can provide some of these benefits [12, 13], but may also miss some opportunities since it does not explicitly consider window size. Nevertheless, this heuristic worked well for the code sequences we examined. More appropriate local scheduling algorithms remain the subject of future research. 4. Experimental Methodology 4.1 Evaluation ....

J. L. Lo and S. J. Eggers, "Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism," in Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pp. 151--162, July 1995.


Code Transformations to Improve Memory Parallelism - Pai, Adve (1999)   (6 citations)  (Correct)

....larger than a single instruction window (possibly because of unroll and jam or inner loop unrolling) In such cases, the instruction scheduler should pack independent miss references in the loop body close to each other. The technique of balanced scheduling can provide some of these benefits [9, 10], but may also miss some opportunities since it does not explicitly consider window size. Nevertheless, this heuristic worked well for the code sequences we examined. More appropriate local scheduling algorithms remain the subject of future research. 6 Processor parameters Clock rate 500 MHz ....

J. L. Lo and S. J. Eggers. Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism. In Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 151--162, July 1995.


Code Transformations to Improve Memory Parallelism - Pai, Adve (1999)   (6 citations)  (Correct)

....larger than a single instruction window (possibly because of unroll and jam or inner loop unrolling) In such cases, the instruction scheduler should pack independent miss references in the loop body close to each other. The technique of balanced scheduling can provide some of these benefits [6, 7], but may also miss some opportunities since it does not explicitly consider window size. Nevertheless, this heuristic worked well for the code sequences we examined. More appropriate local scheduling algorithms remain the subject of future research. 4. Experimental Methodology 4.1. Evaluation ....

J. L. Lo and S. J. Eggers. Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism. In Proc. of the Conf. on Programming Language Design and Implementation, 1995.


Load Scheduling with Profile Information - Lindenmaier, McKinley, Temam (2000)   (Correct)

....profiles to sharpen constant propagation [2] Our work is unique in that it uses information at the instruction level, and integrates it into a scheduler. Previous work on using instruction level parallelism (ILP) to hide latencies for nonblocking caches has two major differences from this work [4, 6, 8, 10, 12]. First, previous work uses static locality analysis which works very well for regular array accesses. Secondly, these schedulers only differentiates between a hit or a miss. Since we use performance counters, we can improve the schedules of pointer based codes that compilers have difficulty ....

....0.7 0.8 0.9 1 0 20 40 60 80 100 Percentage Hit in First Level Cache strict heuristic generous heuristic Fig. 2. Simulated number of loads and comparison of heuristics to simulation. 6 4. 1 Balanced Scheduling We use the Multiflow compiler [7, 11] with the Balanced Scheduling algorithm [8, 10], and additional optimizations, e.g. unrolling, to generate ILP and traces of instructions that combine basic blocks. Below we first briefly describe Balanced scheduling and then we present our modifications. Balanced scheduling first creates an acyclic scheduling data dependency graph (DAG) ....

J. L. Lo and S. J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 151--162, San Diego, CA, June 1995.


Static Branch Prediction Using High-Level Language Control.. - Sokolova, Kaeli   (Correct)

....Prediction Static program based branch prediction exploits information that is obtained by static analysis of a program. The ability to predict the correct direction of the control flow stream at compilation time allows the compiler to perform various 2 optimizations such as trace scheduling [8, 9], code reordering [10, 11, 12] inter procedural register allocation [13] optimization and scheduling of superblocks [14] and hyperblocks [15] and improved branching [16] Many static branch prediction techniques have been proposed [17] Since the predicted branch outcome is determined prior to ....

J.L.Lo and S.J. Eggers. Improving Balanced Scheduling with Compiler Optimizations that Increase InstructionLevel Parallelism. In Proc. of ACM Programming Language Design and Implementation, pages 151--162, June 1995.


Code Transformations to Improve Memory Parallelism - Vijay Pai And (1999)   (6 citations)  (Correct)

....larger than a single instruction window (possibly because of unroll and jam or inner loop unrolling) In such cases, the instruction scheduler should pack independent miss references in the loop body close to each other. The technique of balanced scheduling can provide some of these benefits [6, 7], but may also miss some opportunities since it does not explicitly consider window size. Nevertheless, this heuristic worked well for the code sequences we examined. More appropriate local scheduling algorithms remain the subject of future research. 4. Experimental Methodology 4.1. Evaluation ....

J. L. Lo and S. J. Eggers. Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism. In Proc. of the Conf. on Programming Language Design and Implementation, 1995.


Modulo Scheduling with Cache Reuse Information - Ding, Carr, Sweany (1997)   (6 citations)  (Correct)

....exacerbates the already serious register proliferation problem of modulo scheduling by (perhaps unnecessarily) increasing the number of overlapped loop iterations in an attempt to hide latency. To address the problem of local instruction scheduling with uncertain latencies, Eggers and co workers [9, 12] have suggested balanced scheduling for architectures with non blocking caches. Balanced scheduling sets memory latencies based, not upon some architecturally predefined value, but rather based upon the number of instructions available to hide the latency of a particular load. While balanced ....

Lo, J. L., and Eggers, S. J. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Conference Record of SIGPLAN Programming Language and Design Implementation (June 1995).


An Aggressive Approach to Loop Unrolling - Davidson, Jinturkar (1995)   (2 citations)  (Correct)

....Our review of previous literature indicates that there is no standard term for the amount a loop is unrolled. For instance, if the loop body of a rolled loop is replicated n times to get an unrolled loop, then this amount has been described as unroll depth [Dong79] of (n 1) unroll factor [Lo95, Baco94] of (n 1) unroll amount [Freu94] of (n 1) or unroll factor [Scho92] of n. Some researchers do not use any term at all [Wall92, Mahl92] In this paper, we use the term unroll factor in the manner used by Schofield. Thus, if the loop body of a rolled loop is replicated n times to get an ....

....The remaining registers are available for use in applying other code improvements. 7. 5 Instruction scheduling Instruction scheduling attempts to reorder instructions so that the pipeline performance is improved [Henn90] Instruction scheduling can improve the performance of an unrolled loop [Lo95]. However, if register renaming is applied after register allocation, the ability of an instruction scheduler to reorder instructions in an unrolled loop is limited by artificial dependencies created by naive reuse of registers by loop unrolling. These artificial dependencies can be eliminated by ....

Lo, L. J., and Eggers, S., J., "Improving Balanced Scheduling with Compiler Optimizations that increase Instruction-level Parallelism", Proceedings of SIGPLAN `89 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995, pp. 151162.


Towards Identifying and Monitoring Optimization Impacts - Way, Pollock (1997)   (2 citations)  (Correct)

....the target machine s resources. The most well known examples of this work focus on the interactions between software pipelining register allocation [16, 17, 21, 23, 27, 32, 38, 44] instruction scheduling and register allocation [3, 5, 6, 19, 20, 30, 33, 34] instruction scheduling and cache usage [28], and scalar replacement and register allocation [8] All have in common the goal of creating a good match between the program characteristics, such as instruction placement and register usage, and architectural features such as the availability of registers, memory access overhead, cache usage ....

Jack L. Lo and Susan J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 1995.


Supporting Ada 95 Passive Partitions in a Distributed Environment - Mueller (1997)   (1 citation)  (Correct)

....by the darker shading) this local DSM view is transparent to the user. To the user, the DSM portion of the addressing space seems as a globally consistent, shared data area. For simplicity, it is assumed for now that read and write accesses are not distinguished and that sequential consistency [14] is preserved, i.e. memory accesses are globally ordered and a page is owned by one and only one node at a time. These assumptions will be lifted later. The operational model can be described as follows. Initially, each DSM page is owned by a single designated node. A page table is initialized ....

J. L. Lo and S. J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 1995.


Analysis of Profiling Information for Cache Sensitive Scheduling - Lindenmaier (1999)   (Correct)

No context found.

Jack L. Lo and Susan J. Eggers, \Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism", SIGPLAN 1995

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC