| W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Trans. on Computers, C-30(5):341--356, May 1981. |
....per iteration, C, is likely to decrease with more aggressive processor architectures. Both hardware trends increase the prefetch distance. Additionally, software that more aggressively uses locality transformations such as tiling sees shorter inner loops with each inner loop initiated more times [1, 25, 32]. These hardware and software trends increase the impact of prologue late prefetches, short steady states, and hard toprefetch references, all of which can be addressed by read miss clustering. On the other hand, we expect the impact of prefetching instruction overhead to be less important as ....
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Trans. on Computers, C-30(5):341--356, May 1981.
....per iteration, C, is likely to decrease with more aggressive processor architectures. Both hardware trends increase the prefetch distance. Additionally, software that more aggressively uses locality transformations such as tiling sees shorter inner loops with each inner loop initiated more times [AKL81, Por89, WL91] These hardware and software trends increase the impact of prologue late prefetches, short steady states, and hard to prefetch references, all of which can be addressed by read miss clustering. 5.2.2 Addressing a Limitation of Clustered Prefetching Unroll and jam produces a ....
Walid Abu-Sufah, David J. Kuck, and Duncan H. Lawrie. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Transactions on Computers, C-30(5):341--356, May 1981.
....is based on iteration space transformations. McKellar and Coffman [42] performed one of the first studies on program transformations for locality. They showed that by using sub matrix operations it is possible to obtain impressive speedups over the original matrix codes. Later Abu Sufah et al. [1] focused on automating page locality improving techniques within a compilation framework, and discussed a transformation technique called vertical distribution, which is very similar to tiling. In his dissertation, Porterfield [46] uses loop transformation techniques such as skewing and tiling. ....
A. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Trans. on Computers, C-30(5):341--356, 1981.
....flavor. Thus the general idea is similar although the analysis used is different. Loop fission has been used to optimize FORTRAN programs. Loop fission breaks a single loop into several smaller loops to improve the locality of data reference. This can improve paging performance dramatically [ABU81]. Our transformations serve a similar function breaking a large loop into several small ones to enable database style optimization. 2.2. RELATED PARALLELIZATION WORK Our parallelization work uses both transformations and pointer based join techniques. We discuss the related work for each in ....
Walid Abu-Sufah, David J. Kuck, and Duncan H. Lawrie. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Trans. on Computers C-30,5 (May 1981), 341-355.
.... policies [2, 11] and (2) techniques which consider re shaping the data reference patterns in order to exploit the given hardware facilities and system software [23, 32] The latter group then paved the way for automatic program restructuring techniques like loop distribution and page indexing [1]. In general, these techniques apply to already written programs and consist of re arranging the code and data to make program s access pattern more local. Another system software based technique is built upon the file systems and run time libraries. Several approaches considered extending the ....
....in the sense that once the I O time is reduced by our optimization, the remaining I O time can be hidden by prefetching. There has been a few papers on out of core compilation. Some of them consider optimizing the performance of virtual memory (VM) The most notable work is from Abu Sufah et al. [1], which deals with optimizations to enhance the locality properties of programs in a VM environment. Among the program transformations used are loop fusion, loop distribution and tiling (page indexing) It should be emphasized that in principle, our file layout determination scheme can be applied ....
W. Abu-Sufah, D. Kuck, and D. Lawrie. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Transactions on Computers, C-30(5):341--356, May 1981.
....our optimization, the remaining I O time can be hidden by prefetching. There has been a few papers on out of core compilation. The approaches can be divided into two groups: The first group considers optimizing the performance of virtual memory (VM) The most notable work is from Abu Sufah et al. [1], which deals with opti mizations to enhance the locality properties of programs in a VM environment. Among the program transformations used are loop fusion, loop distribution and tiling (page indexing) In principle, our disk layout determination scheme can be applied for optimizing the ....
W. Abu-Sufah. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations, IEEE Transactions on Computers, C-30(5):341--355, 1981.
.... reuse distance (log scale, base 2) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 NAS SP, 28x28x28 program order reuse based fusion reuse driven execution Figure 3: Effect of reuse driven execution A[i] f(A[i 1] end for for i=2,N for i=2,N A[i] f(A[i 1] end for A[1] = A[N] b) example of loops that cannot be fused for i=2, N 2 if (i=2) A[i] 0.0 else A[i] f(A[i 1] end for A[1] A[N] B[3] g(A[1] B[i 2] g(A[i] for i=3, N 2 A[i] f(A[i 1] end for B[i] g(A[i 2] end for for i=3, N A[1] A[N] A[2] 0.0 (a) fusion by statement ....
....order reuse based fusion reuse driven execution Figure 3: Effect of reuse driven execution A[i] f(A[i 1] end for for i=2,N for i=2,N A[i] f(A[i 1] end for A[1] A[N] b) example of loops that cannot be fused for i=2, N 2 if (i=2) A[i] 0. 0 else A[i] f(A[i 1] end for A[1] = A[N] B[3] g(A[1] B[i 2] g(A[i] for i=3, N 2 A[i] f(A[i 1] end for B[i] g(A[i 2] end for for i=3, N A[1] A[N] A[2] 0.0 (a) fusion by statement embedding, loop alignment and loop splitting s3 s4 Figure 4: Examples of loop fusion can be considered by adding loop ....
[Article contains additional citation context not shown here]
W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-30(5):341--356, May 1981. 11
....matches the capabilities of the cache hardware. Such improvements have primarily been limited to scientific code with predictable control flow and regular memory access patterns, due to the ease with which rudimentary loop transformations can dramatically improve temporal and spatial locality [40,41]. Explicit prefetching in advance of memory references with poor or no locality has also been examined extensively in this context, both with [21,22] and without additional hardware support [23,24] Dynamic hardware techniques for controlling cache memory allocation that significantly reduce ....
W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. "On the performance enhancement of paging systems through program analysis and transformations." IEEE Transactions on Computers, C-- 30(5):341--356, May 1981.
....in loops [MPS94] Improvements to the data cache performance of programs have primarily been limited to scientific code that operates on loops. Many of these investigations focus on analysis of data utilization to guide program transformations, particularly on loops, to improve data locality [ASKL81, GJG88, CK89, FST91, LRW91,WL91,KM93,CMT94] With the advent of prefetch instructions, other researchers have sought ways for compilers to intelligently insert such instructions to improve data cache performance. Most such techniques do not require additional hardware support [Por89, CKP91, FP91, ....
Walid Abu-Sufah, David J. Kuck, and Duncan H. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-- 30(5):341--356, May 1981.
....flavor. Thus the general idea is similar although the analysis used is different. Loop fission has been used to optimize FORTRAN programs. Loop fission breaks a single loop into several smaller loops to improve the locality of data reference. This can improve paging performance dramatically [ABU81]. Our transformations serve a similar function breaking a large loop into several small ones to enable database style optimization. 3 Introduction to Self commutativity Before examining the different transformation strategies that we have developed, we first examine when a simple group by loop ....
Walid Abu-Sufah, David J. Kuck, and Duncan H. Lawrie. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations. IEEE Trans. on Computers C-30,5 (May 1981), 341-355.
....memory management has focused on increasing temporal and spatial reuse in regular applications. Cache and register blocking techniques group computations on data tiles to enhance temporal reuse[5, 22] Various loop reordering schemes seek to arrange stride one data access to maximize spatial reuse[1, 12, 17]. Data transformations can often be used to effect spatial reuse when computation transformation is insufficient or illegal[8] None of these strategies, however, works well with dynamic and irregular computations because the unpredictable nature of data reuse prevents effective static analysis. ....
....processor to remap memory. However, compiler analysis similar to ours is necessary to effectively control such hardware features. The goal of improving data reuse has been pursued for regular applications by loop and data transformations such as cache blocking[5, 22] memory order loop permutation[1, 12, 17], and data reshaping[8, 3, 15] However, static loop and data transformations developed for regular applications cannot optimize dynamic computations where the data access pattern remains unknown until run time and changes during the computation. Various static data placement schemes have been ....
W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-30(5):341-- 356, May 1981.
....earlier, using the loop bounds and index variables, the compiler can determine which array requires more I O accesses and accordingly allocate the available memory. 5 Related Work Abu Sufah first investigated strategies for improving performance of fortran programs in virtual memory environment [ASKL81] Compiler transformations such as tiling, strip mining, loop interchange, loop skewing are proposed by Wolfe [Wol89b] Transformations like Unroll and jam and Scalar replacement are proposed by Carr [Car93] Callahan studies the problem of register allocation [CCK90] The notion reference window ....
W. Abu-Sufah, David Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and program transformation. IEEE Transactions on Computers, C-30(5):341--356, May 1981.
....To achieve the best performance in out of core computations, both data (layout) and control (loop) transformations are necessary; and parallelism and locality should be handled in a unified way. We note that our approach is based on explicit file I O and is different from those presented in [1, 17, 19, 27]. 3 Algorithm for Optimizing Locality in Files In this section we present an algorithm based on explicit I O to reduce the time spent in I O. Our algorithm automatically transforms a given loop nest to exploit spatial locality in files, assigns appropriate file layouts for out of core arrays, ....
....message passing systems. It should be noted that our algorithms are general in the sense that they can be incorporated to any out of core compilation framework for parallel and sequential machines. Previous work considers optimizing the performance of virtual memory (VM) Abu Sufah et al. [1], dealt with optimizations to enhance the locality properties of programs in a VM environment. In principle, our file layout determination scheme can be applied for optimizing the performance of the VM as well (by changing tile sizes to take the page size into account) But, we believe that the ....
W. Abu-Sufah. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-30(5), pages 341--355, May 1981.
....will be more effective for C programs. The negligible difference in data cache performance shown in x4.9.2 implies that specific C optimizations for data cache locality may not be necessary. By comparison, optimizations for instruction caches [40, 41, 50] and possibly virtual memory systems [51, 52, 53, 54] will be more important for C programs than for C programs. Our data also indicates that link time optimizations, such as those proposed by Wall [55] and others will become more important. Object oriented languages, such as C , allow programmers to extend the class hierarchy without affecting ....
W. A. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the performance enhancement of paging systems through program analysis and transformation. IEEE Transactions on Computers, C-30(5):341--356, May 1981.
....compiling perfectly nested loops for distributed memory message passing machines is addressed. In [2] a solution to the problem of determining loop and data partitions automatically for programs with multiple loops and arrays is presented. Our work also bears similarity to that of Abu Sufah et al.[1] which deals with optimizations to enhance the locality properties of programs in a virtual memory environment. Among the program transformations used by them are loop fusion, loop distribution and tiling (page indexing) We, instead, do not assume the existence of virtual memory; but sensing the ....
W.Abu-Sufah. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations, IEEE Transactions on Computers, C30 (5), pages 341-355, May 1981.
....some basic concepts and assumptions. After that, the solutions to the thrashing problem are given, and finally the experimental results conducted on a SGI multiprocessor system are shown. 2 Related Work Extensive research has been reported in the literature regarding efficient memory hierarchy [2, 6, 7, 10, 11, 14, 15, 28, 33, 34]. Abu Sufah, Kuck and Lawrie used loop blocking to improve the paging performance by improving locality of references [2] Wolfe proposed iteration space tiling as a way to improve data reuse in cache or local memories [34] Gallivan, Jalby and Gannon defined a reference window for a dependence as ....
....multiprocessor system are shown. 2 Related Work Extensive research has been reported in the literature regarding efficient memory hierarchy [2, 6, 7, 10, 11, 14, 15, 28, 33, 34] Abu Sufah, Kuck and Lawrie used loop blocking to improve the paging performance by improving locality of references [2]. Wolfe proposed iteration space tiling as a way to improve data reuse in cache or local memories [34] Gallivan, Jalby and Gannon defined a reference window for a dependence as the variables referenced by both the source and the sink of the dependence [14, 15] After executing the source of the ....
W. Abu-Sufah, D. Kuck, and D. Lawrie. On the performance enhancement of paging systems through program analysis and transformations. IEEE Transactions on Computers, C-30(5), May 1981.
....the copy is amortized by exploiting the temporal locality. However if there was no page thrashing and no false sharing on array A in the original loop, there is no gain in using this transformation. When applying this kind of optimization, the size of temporaries must be limited. These techniques [63, 4, 41, 3, 23, 64, 55, 16, 44] are well known but usually targeted for hardware cache or local memory. Most of these techniques should be revisited to take into account the characteristics of shared virtual memory and in particular the false sharing phenomena. Barrier Removal When programming with a shared memory model ....
Kuck D. Abu-Sufah W. and Lawrie D. On the performance enhancement of paging system through program analysis and transformations. IEEE Transactions on COmputers, May 1981.
....[18] will be more effective for C programs. The negligible difference in data cache performance shown in x4.7.2 implies that specific C optimizations for data cache locality are not necessary. By comparison, optimizations for instruction caches [35, 37] and possibly virtual memory systems [1, 3, 16, 20, 21] will be more important for C programs than for C programs. One of the most notable observations from the programs we instrumented is that C programs have deeper call stacks, with more variation in the call depth stack depth, than C programs. Procedure activation and calling conventions are at ....
W. A. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the performance enhancement of paging systems through program analysis and transformation. IEEE Transactions on Computers, C-30(5):341--356, May 1981.
.... of size M N (assuming normal row major array storage) the stride of the references to the arrays in the inner loop will be N (the distance in memory words between successive accesses to the same array) if N is very large (larger than the page size) this could cause virtual memory thrashing [AKL81]. However, most compilers proceed without considering this effect. For a vector computer, the choice is no longer so obvious. The inner loop here cannot be vectorized, due to the dependence self cycle S 3# d ( # )# S 3# . Also, vector computers are much more sensitive to the stride of memory ....
W. A. Abu-Sufah, D. J. Kuck and D. H. Lawrie, On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations, IEEE Trans. on Computers C-30, 5 (May 1981), 341-356.
....are not spilt across branch spaces; only procedure calls will span branch spaces. Therefore, all intra procedural branches will be compiled as pre computed branches. There are myriad ways to partition programs, and a number of alternatives have been examined in the effort to reduce page faults [1, 2, 15, 13], and instruction cache conflicts [16, 25, 30] The goals of our study are different than these other studies; we are more interested in reducing the number of indirect jumps than reducing cache conflicts and paging. None the less, the best performing algorithm we examined (MaxCut) for code ....
W. A. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the performance enhancement of paging systems through program analysis and transformation. IEEE Transactions on Computers, C-30(5):341--356, May 1981.
....and QR decompositions and methods for partial differential equations. This approach to automatic blocking, through loop strip mining and interchange, was first advocated by Wolfe [18] It is derived from earlier work of Abu Sufah, Kuck, and Lawrie on optimization in a paged virtual memory system [1]. Wolfe introduced the term tiling. A tile is the collection of work to be done, i.e. the set of values of the point loop indices, for a fixed set of values of the block or outer loop indices. We like this terminology since it allows us to distinguish what we are doing which is to decompose ....
W. A. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the performance enhancement of paging systems throught program analysis and transformations. IEEE Transactions on Computers, C-30:341--356, 1981.
....are executed by the owner processor. 3. There is a fixed distribution of array elements. Data re organization costs are architecturespecific) 2. 1 Related work The research on problems related to memory optimizations goes back to studies of the organization of data for paged memory systems [1]. Balasundaram and others [3] are working on interactive parallelization tools for multicomputers that provide the user with feedback on the interplay between data decomposition and task partitioning on the performance of programs. Gallivan et al. 7] discuss problems associated with automatically ....
W. Abu-Sufah, D. Kuck and D. Lawrie, "On the Performance Enhancement of Paging Systems through Program Analysis and Transformations," IEEE Trans. Computers, Vol. C-30, No. 5, pages 341--356, May 1981.
....flow dependences. 1 Introduction and Related Work Since, today, almost every processor has some kind of memory hiearachy organized into layers with different costs, compiler optimizations to reduce memory access costs are important. Tiling, one such optimization, was first used by Abu Sufah et al.[Abu81] in order to optimize loop nests in a paging memory system. The later applications were generally on cache memories and registers [Wol89, WL91] In [RS92] a number of loop iterations were aggregated into tiles that execute atomically without any synchronization in a distributedmemory ....
W.Abu-Sufah. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations, IEEE Transactions on Computers, C-30(5), pages 341-355, May 1981.
No context found.
Kuck D. Abu-Sufah W. and Lawrie D. On the performance enhancement of paging system through program analysis and transformations. IEEE Transactions on COmputers, May 1981.
No context found.
Kuck D. Abu-Sufah W. and Lawrie D. On the performance enhancement of paging system through program analysis and transformations. IEEE Transactions on COmputers, May 1981.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC