14 citations found. Retrieving documents...
S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. Proc. of 29th Hawaii International Conference on System Sciences, 1996.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Exploiting Program Hotspots and Code Sequentiality .. - Hu, Nadgir.. (2003)   (Correct)

....Entire cache DHS PA Periodic Hot backward branch Not Hot When previous line is accessed Cache line Entire cache DHS Bank PA Periodic Hot backward branch Switch banks Not Hot When previous line is accessed Cache line Entire cache Table 1: Leakage control schemes evaluated. tiality of the code [4, 5, 10]. The sequential nature of code can be exploited to predict the next cache line that will be accessed and mask the penalty for transitioning a cache line from drowsy to active mode just in time for access. Specifically, we propose a scheme that preactivates the next cache line, JITA. The leakage ....

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proc. the 29th Annual Hawaii International Conference on System Sciences, pages 183--192, Maui, HI, January 1996.


Compiler-Directed Instruction Cache Leakage Optimization - Zhang, Hu (2002)   (2 citations)  (Correct)

....reasons why a compiler optimization may not perform as expected. For example, selection of tile size is very critical for the effectiveness of loop tiling [10] A wrong tile size can lead to increased execution time. Similarly, unrolling factor is a very critical parameter in loop unrolling [3]. In this work, we have used these optimizations without trying to tune their parameters. however, the picture changes. As compared to the original code, the fused code incurs much larger energy consumption (except for Strategy II) This is because at each iteration of the fused loops we need to ....

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proc. the 29th Annual Hawaii International Conference on System Sciences, Maui HI, January 1996, pp. 183--192.


Comparing and Combining Read Miss Clustering and Software.. - Pai, Adve (2001)   (1 citation)  (Correct)

....unroll and jam for scalar replacement or locality [5] However, that work did not seek to improve prefetching, but instead assumed that prefetching was effective given enough hardware resources. Carr et al. have used unroll and jam to improve software pipelining, without considering cache misses [6]. That study would reduce floating point stalls in the software pipelining prologue and steady state by improving floatingpoint parallelism, but would not lengthen the steady state. Our previous work suggested negative interactions between clustering and prefetching [23] However, that work ....

S. Carr, C. Ding, and P. Sweany. Improving Software Pipelining with Unroll-and-Jam. In Proceedings of 29th Hawaii International Conference on System Sciences, Jan. 1996.


Exploiting Instruction-Level Parallelism for Memory System.. - Pai (2000)   (Correct)

.... also be used to increase floating point unit parallelism in certain loops with recurrences carried on the inner loop but not on an outer loop [CCK88] Such use of unroll and jam can also improve the interaction of unrolland jam with software pipelining in the presence of inner loop recurrences [CDS96] More recent work has extended the heuristics used in unroll and jam by incorporating the effects of cache misses and prefetching into the balance calculation used to determine the degree of unrolling [Car96] However, previous work has not sought to exploit unroll and jam either to encourage ....

Steve Carr, Chen Ding, and Philip Sweany. Improving Software Pipelining with Unroll-and-Jam. In Proceedings of 29th Hawaii International Conference on System Sciences, January 1996.


Smart Register Files for High-Performance Microprocessors - Postiff, Mudge (1999)   (Correct)

....can significantly increase the number of registers required. Even the traditional optimizations such as copy propagation, common subexpression elimination, induction variable elimination, and code hoisting increase register pressure by adding temporary scalar values and extending scalar lifetimes [16, 36]. Furthermore, whole classes of variables are ignored in most register allocation studies (e.g. those referenced above) sGlobal variables are not usually allocated to registers, even though they may be able to reside in a register for their entire lifetime. There are significant numbers of global ....

Chen Ding. Improving Software Pipelining with Unroll-and-Jam and Memory Reuse Analysis. Master's Thesis, Michigan Technological University, Department of Computer Science, 1996.


Combining Optimization for Cache and Instruction-Level Parallelism - Carr (1996)   (23 citations)  (Correct)

....is bounded by register pressure and the new model can provide no benefit. Finally, we mention the interaction of unroll and jamand software pipelining. While unroll and jam can dramatically improve the schedules produced by software pipelining, it can also greatly exacerbate register pressure [6, 10]. In this experiment, we artificially reduce the number of registers allowed to limit unroll and jam s potential negative effect on software pipelining. An important future research topic is to predict register pressure before software pipelining based upon the loop body and unroll amounts. 6. ....

C. Ding. Improving software pipelining with unroll-andjam and memory-reuse analysis. Master's thesis, Michigan Technological University, June 1996.


Optimizing Loop Performance for Clustered VLIW Architectures - Qian, Carr, Sweany (2002)   (2 citations)  Self-citation (Carr Sweany)   (Correct)

No context found.

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Maui, HI, January 1996.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1997)   (13 citations)  Self-citation (Carr)   (Correct)

....to generate highly parallel code [27, 33] However, software pipelining requires a lot of registers to be successful. In our code, we performed unroll and jam to improve cache performance. However, unroll and jam can significantly increase register pressure and cause software pipelining to fail [7]. On our version of LU decomposition, the HP compiler diagnostics reveal that software pipelining failed on the main computational loop due to high register pressure. Given that the hand optimized version is highly software pipelined, the result would be a highly parallel hand optimized loop and a ....

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Maui, HI, January 1996.


Improving Software Pipelining with Unroll-and-Jam and Memory Reuse .. - Ding (1996)   (8 citations)  Self-citation (Ding)   (Correct)

....Rocket Fortran front end SP Algorithm Experimental set up Unroll and Jam Memoria Figure 6.1: Experimental Method 6.2 Improvement Obtained by Using Unroll and Jam This section presents the result of combining unroll and jam with software pipelining. A similar experiment has been done previously [8]. The results of the previous experiment were not complete for the following reasons. First, the array analysis result was added into the compiler by hand, so it is not completely accurate. Second, the implementation of modulo variable expansion is not as sophisticated as the current ....

....experiment. Moreover, in the previous experiment, we did not perform register assignment and code generation. These shortcomings have been addressed in the latest experiment. Due to the above reasons, the results obtained in this experiment are more complete than previous results described in [8]. 6.2.1 Machine Model For our target architecture we chose a machine with four integer units and two floatingpoint units. Only one of the integer units can be used for memory operations. Each integer operation has a latency of two cycles while each floating point operation has a latency of four ....

S. Carr, C. Ding, and P. Sweany. Improving Software Pipelining With Unroll-and-Jam. In 28th Hawaii International Conference on System Sciences, 1996. 93


Combining Optimization for Cache and Instruction-Level Parallelism - Carr (1996)   (23 citations)  Self-citation (Carr)   (Correct)

....dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops [3, 6, 7, 14, 16, 17] or on improving cache performance [9, 15, 18] This paper presents a performance metric that can be used to guide the optimization of nested loops considering the combined effects of ILP, data reuse and latency hiding techniques. We have implemented the technique in a source to source ....

....or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions IEEE Service Center 445 Hoes Lane P.O. Box 1331 Piscataway NJ 08855 1331, USA. Telephone: Intl. 908 562 3966. resources provided by a machine [3, 6]. Second, memory speeds have not kept pace with microprocessor power. Although hardware designers have introduced multiple levels of cache to reduce latency, poor cache performance leaves many scientific applications bound by main memory access time. Previous work has separately addressed both the ....

[Article contains additional citation context not shown here]

S. Carr, C. Ding, andP. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Annual Hawaii International Conferenceon System Sciences, Maui, HI, January 1996.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1996)   (13 citations)  Self-citation (Carr)   (Correct)

....to generate highly parallel code [24, 31] However, software pipelining requires a lot of registers to be successful. In our code, we performed unroll and jam to improve cache performance. However, unroll and jam can significantly increase register pressure and cause software pipelining to fail [6]. On our version of LU decomposition, the HP compiler diagnostics reveal that software pipelining failed on the main computational loop due to high register pressure. Given that the hand optimized version is highly software pipelined, the result would be a highly parallel hand optimized loop and a ....

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Maui, HI, January 1996.


Compiler Blockability of Dense Matrix Factorizations - Carr, Lehoucq (1996)   (13 citations)  Self-citation (Carr)   (Correct)

....to generate highly parallel code [25, 32] However, software pipelining requires a lot of registers to be successful. In our code, we performed unroll and jam to improve cache performance. However, unroll and jam can significantly increase register pressure and cause software pipelining to fail [6]. On our version of LU decomposition, the HP compiler diagnostics reveal that software pipelining failed on the main computational loop because of high register pressure. Given that the hand optimized version is highly software pipelined, the result would be a highly parallel hand optimized loop ....

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Maui, HI, January 1996.


Analytic Models and Empirical Search: A Hybrid.. - Epshteyn.. (2005)   (Correct)

No context found.

S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-jam. Proc. of 29th Hawaii International Conference on System Sciences, 1996.


Low-cost Register-pressure Prediction for Scalar Replacement.. - Ma, Carr, Ge   (Correct)

No context found.

C. Ding. Improving software pipelining with unroll-andjam and memory-reuse analysis. Master's thesis, Michigan Technological University, 1996.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC