13 citations found. Retrieving documents...
Vijay S. Pai and Sarita Adve. Code transformations to improve memory parallelism. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-32, pages 147--155, November 1999.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Effective Compile-Time Analysis for Data Prefetching in Java - Cahoon (2002)   (Correct)

....reduces the number of conflict misses compared to the longer prefetch distance. 5.2. 5 Case Study: Matrix Multiplication In this section, we examine the effects of loop transformations and additional analyses on performance by applying loop unrolling, software pipelining, and read miss clustering [82] on matrix multiplication. Figure 5.7 presents results for four versions of matrix multiplication with different code and data transformations on an out of order and an in order processor. We provide results for each version with and without prefetching. We normalize all times to original, the ....

....apply Mowry et al. s [79] prefetch algorithm to matrix multiplication in locality unroll pipe. We present results for loop unrolling only in unroll. In cluster, we apply read miss clustering, which is a loop transformation that improves performance by increasing parallelism in the memory system [82]. Transforming locality unroll pipe requires several steps. We unroll the innermost loop four times to generate a single prefetch instruction for an entire cache line. We perform software pipelining on the innermost loop to begin prefetching the array data prior to the loop. We generate a ....

[Article contains additional citation context not shown here]

Vijay S. Pai and Sarita Adve. Code transformations to improve memory parallelism. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-32, pages 147--155, November 1999.


Improving Memory Hierarchy Performance for Irregular .. - Mellor-Crummey.. (2001)   (6 citations)  (Correct)

....techniques are more broadly applicable. Our colleagues have recently also applied space filling curve based reorderings to improve the parallel efficiency of shared memory and software distributed shared memory computations by improving data locality, which reduces communication and false sharing [36, 37]. Our experiences show that good data and computation orders can be achieved for irregular problems using dynamic reorderings, and that the gain in locality from using good data and computation orders can be dramatic. 7. Acknowledgements Discussions with Vikram Adve and Rob Fowler helped shape ....

V. Pai and S. Adve, "Code Transformations to Improve Memory Parallelism," Proceedings MICRO-32, (Nov 1999).


Multi-Chain Prefetching: Effective Exploitation of.. - Kohout, Choi, Kim, Yeung (2001)   (6 citations)  (Correct)

....create intra chain memory parallelism along the backbone, and then DBP hardware sequentially prefetches each rib. Cooperative Chain Jumping exploits inter chain memory parallelism; however, once again it does so only for backbone and rib traversals, and it requires jump pointers. Unroll and jam [12] is a software technique that exploits memory parallelism. For applications with multiple independent pointer chains, like array of lists traversals, unroll and jam initiates independent instances of the inner loop from separate outer loop iterations in order to expose multiple read misses within ....

V. S. Pai and S. Adve. Code Transformations to Improve Memory Parallelism. In Proceedings of the International Symposium on Microarchitecture, November 1999.


Compiler Generated Multithreading to Alleviate Memory Latency - Beyls, D'Hollander (2000)   (Correct)

....processor satis es all above requirements. Some superscalar processors satisfy the rst 2 conditions, but the third one is not met. At rst sight, it seems that out of order processors can continue executing independent instructions during a cache miss. However, they stall on main memory access[Pai et al..1999, Mowry et al..1998] On current systems, a main memory access typically takes between 50 and 100 cycles. A current microprocessor is typically able to execute 4 instructions per cycle. To bridge the main memory access with useful computations, at least 50 4 = 200 independent instructions must ....

Vijay S. Pai and Sarita Adve. Code transformations to improve memory parallelism. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 147-155, november 1999.


Dynamically Allocating Processor Resources between .. - Balasubramonian.. (2001)   (14 citations)  (Correct)

....be combined with the future thread to yield greater speedups. For example, adding the future thread to a base case that has a stride prefetcher results in significant speedups [3] A software approach to tackling the problem of a single cache miss holding up the ROB is described by Pai and Adve [20]. They present a compiler algorithm that restructures code so that cache misses are clustered, thereby increasing the memory parallelism while the ROB is stalled. 5 Conclusions We have designed and evaluated a microarchitecture that dynamically allocates a portion of the processor s physical ....

V. Pai and S. Adve. Code Transformations to Improve Memory Parallelism. In Proceedings of MICRO-32, pages 147--155, 1999.


MIST: An Algorithm for Memory Miss Traffic Management. - Grun, Dutt, Nicolau (2000)   (4 citations)  (Correct)

....However, often cache misses cannot be avoided due to large data sizes, or simply the presence of data in the main memory (compulsory misses) To efficiently use the available memory bandwidth and minimize the CPU stalls, it is crucial to aggressively schedule the cache misses. Pai and Adve [19] present a technique to move cache misses closer together, allowing an out of order superscalar processor to better overlap these misses (assuming the memory system tolerates a large number of outstanding misses) Our technique is orthogonal, since we overlap cache misses with cache hits to a ....

V. Pai and S. Adve. Code transformations to improve memory parallelism. In MICRO, 1999.


Comparing and Combining Read Miss Clustering and Software.. - Pai, Adve (2001)   (1 citation)  Self-citation (Pai Adve)   (Correct)

....Grant No. CCR 9410457, CCR 9502500, CDA 9502791, and CDA 9617383, and the Texas Advanced Technology Program under Grant No. 003604 025. Sarita V. Adve is also supported by an Alfred P. Sloan Research Fellowship. within the processor s out of order instruction window (called read miss clustering) [22]. An alternate, widelyused latency tolerance technique is software controlled non binding prefetching. Prefetching helps tolerate latencies by initiating (often multiple overlapping) data fetches ahead of expected demand misses [4] On the surface, both techniques seem to target the same types of ....

.... is surprising because prefetching is widely believed to be an effective latency tolerance technique that can already exploit parallelism in the memory system (by sending multiple prefetches in parallel) The read miss clustering transformation is based on a novel adaptation of unroll and jam [22]. Specifically, it extends unroll and jam by mapping memory parallelism in a modern ILP system to the previously studied problem of floating point pipelining [2, 3, 20] The new transformation aims to cluster multiple expected read misses together within the same instruction window of an ....

[Article contains additional citation context not shown here]

V. S. Pai and S. Adve. Code Transformations to Improve Memory Parallelism. In Proc. of the 32nd Annual Int'l Symposium on Microarchitecture, pages 147--155, Nov. 1999.


Exploiting Instruction-Level Parallelism for Memory System.. - Pai (2000)   Self-citation (Pai Adve)   (Correct)

....processors. 1.2. 2 ILP Specific Code Transformations to Improve Multiprocessor Performance The second contribution of this dissertation is to propose software code transformations to improve read miss clustering for systems with out of order processors, while preserving cache locality [PA99, PA00] We exploit code transformations already known and implemented in compilers for other purposes, providing the analysis needed to relate them to read miss clustering. The key transformation we use is unroll and jam, which was originally proposed for improving floating point pipelining and for ....

Vijay S. Pai and Sarita Adve. Code Transformations to Improve Memory Parallelism. Journal of Instruction Level Parallelism, 2, May 2000.


Exploiting Instruction-Level Parallelism for Memory System.. - Pai (2000)   Self-citation (Pai Adve)   (Correct)

....of current processors. 1.2. 2 ILP Specific Code Transformations to Improve Multiprocessor Performance The second contribution of this dissertation is to propose software code transformations to improve read miss clustering for systems with out of order processors, while preserving cache locality [PA99, PA00] We exploit code transformations already known and implemented in compilers for other purposes, providing the analysis needed to relate them to read miss clustering. The key transformation we use is unroll and jam, which was originally proposed for improving floating point pipelining and ....

....the system; thus, models that do not properly capture these effects may not be able to effectively characterize an ILP based multiprocessor system. Nevertheless, direct execution simulations appear attractive since they are as much as an order of magnitude faster than the detailed simulator RSIM [DPA99] This portion of the study draws from joint works with Murthy Durbhakula and Parthasarathy Ranganathan. This dissertation then presents two evaluation methodologies to speed up the characterization of ILP multiprocessors. The first is a novel adaptation of direct execution simulation that ....

[Article contains additional citation context not shown here]

Vijay S. Pai and Sarita Adve. Code Transformations to Improve Memory Parallelism. In Proceedings of the 32nd Annual International Symposium on Microarchitecture, pages 147--155, November 1999.


Code Transformations to Improve Memory Parallelism - Pai, Adve (1999)   (6 citations)  Self-citation (Pai Adve)   (Correct)

....under Grant No. 003604 025. Sarita Adve is also supported by an Alfred P. Sloan Research Fellowship. Vijay S. Pai was also supported by a Fannie and John Hertz Foundation Fellowship. This paper extends the authors paper of the same title published in MICRO 32 (Copyright IEEE, November 1999) [1] by including results with more applications and experiments on a real machine. PAI ADVE This paper presents code transformations to improve memory parallelism for systems with out of order processors, while preserving cache locality. We exploit code transformations already known and ....

V. S. Pai and S. Adve, "Code Transformations to Improve Memory Parallelism," in Proceedings of the 32nd Annual International Symposium on Microarchitecture, pp. 147--155, November 1999.


Code Transformations to Improve Memory Parallelism - Pai, Adve (1999)   (6 citations)  Self-citation (Pai Adve)   (Correct)

....Sarita Adve is also supported by an Alfred P. Sloan Research Fellowship. Vijay S. Pai was also supported by a Fannie and John Hertz Foundation Fellowship. This technical report draws from and extends the authors paper of the same title published in MICRO 32 (Copyright IEEE, November 1999) [15]. In particular, Sections 4 and 5 include extended discussion of experiments on a real machine. though ILP techniques successfully and consistently reduced the CPU component of execution time, their impact on the memory (read) stall component was lower and more application dependent, making read ....

V. S. Pai and S. Adve. Code Transformations to Improve Memory Parallelism. In Proceedings of the 32nd Annual International Symposium on Microarchitecture, November 1999.


Code Transformations to Improve Memory Parallelism - Pai, Adve (1999)   (6 citations)  Self-citation (Pai Adve)   (Correct)

....contention are 1 cycle for L1 hits, 10 cycles for L2 hits, 85 cycles for local memory, 180 260 cycles for remote memory, and 210 310 cycles for cache to cache transfers. We also briefly summarize experimental results using a real machine (Convex Exemplar) with more detail in our extended report [12]. 152 Processor parameters Clock rate 500 MHz Fetch rate 4 instructions cycle Instruction window 64 instructions in flight Memory queue size 32 Outstanding branches 16 Functional unit count 2 ALUs, 2 FPUs, 2 address units Functional unit latencies (cycles) 1 (addr. gen. most ALU) 3 ....

....Table 2 summarizes the evaluation workload for the simulated system. The number of processors used for the simulated multiprocessor experiments is based on application scalability, with a limit of 16. The input sizes and processor counts for experiments on the real machine are reported in [12]. Each code is compiled with the Sun SPARC SC4.2 compiler, using the xO4 optimization level. We incorporate miss clustering transformations by hand, following the algorithms presented. Latbench is based on the lat mem rd kernel of lmbench [8] lat mem rd sees inner loop address recurrences from ....

[Article contains additional citation context not shown here]

V. S. Pai and S. Adve. Code Transformations to Improve Memory Parallelism. Technical Report ECE-9909, Rice University, Sep. 1999. http://www.ece.rice.edu/ rsim/pubs/TR9909.ps .


Code Transformations to Improve Memory Parallelism - Vijay Pai And (1999)   (6 citations)  Self-citation (Pai Adve)   (Correct)

....contention are 1 cycle for L1 hits, 10 cycles for L2 hits, 85 cycles for local memory, 180 260 cycles for remote memory, and 210 310 cycles for cache to cache transfers. We also briefly summarize experimental results using a real machine (Convex Exemplar) with more detail in our extended report [12]. Processor parameters Clock rate 500 MHz Fetch rate 4 instructions cycle Instruction window 64 instructions in flight Memory queue size 32 Outstanding branches 16 Functional unit count 2 ALUs, 2 FPUs, 2 address units Functional unit latencies (cycles) 1 (addr. gen. most ALU) 3 (most FPU) 7 ....

....Table 2 summarizes the evaluation workload for the simulated system. The number of processors used for the simulated multiprocessor experiments is based on application scalability, with a limit of 16. The input sizes and processor counts for experiments on the real machine are reported in [12]. Each code is compiled with the Sun SPARC SC4.2 compiler, using the xO4 optimization level. We incorporate miss clustering transformations by hand, following the algorithms presented. Latbench is based on the lat mem rd kernel of lmbench [8] lat mem rd sees inner loop address recurrences from ....

[Article contains additional citation context not shown here]

V. S. Pai and S. Adve. Code Transformations to Improve Memory Parallelism. Technical Report ECE-9909, Rice University, Sep. 1999. http://www.ece.rice.edu/¸rsim/pubs/TR9909.ps .

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC