21 citations found. Retrieving documents...
S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C. Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L. Wright, W. A. Wulf, and J. H. Aylor. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 1996.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
General-Purpose Architectures for Media Processing.. - Parthasarathy..   (Correct)

....be used for these purposes as well. However, our reconfigurable cache organization is general enough to include other applications for the SRAM arrays including instruction reuse. Other reconfigurable approaches like the Impulse Project [20] the Galileo project [15] the Stream Memory Controller [78], and other studies [54, 125] also propose adaptive cache designs to address the problem of inefficient cache usage. However, these approaches primarily focus on application specific mechanisms to either (i) dynamically reorder memory accesses to exploit parallelism and locality, and or (ii) ....

S. A. McKee et al. Design and Evaluation of Dynamic Access Ordering Hardware. In Proceedings of the


Stream Scheduling - Kapasi, Mattson, Dally, Owens, Towles (2001)   (7 citations)  (Correct)

....Controller Host Processor Streaming Memory System Network Interface Network S D R A M Figure 2: Imagine architecture block diagram program. While previous systems have been able to extract highlevel information in order to simultaneously compute and perform stream memory operations [1, 5], they use a set of FIFO queues for stream routing instead of a general stream register file. This paper presents an overview of scheduling for a stream architecture, and is explored in detail by Mattson [4] The next section, Section 2, presents a basic streaming system framework within which ....

S. McKee, C. Oliver, W. Wulf, K. Wright, and J. Aylor. Design and evaluation of dynamic access ordering hardware. In Proc. 10th ACM International Conference on Supercomputing, May 1996.


Impulse: Building a Smarter Memory Controller - Carter, Hsieh, Stoller.. (1999)   (41 citations)  (Correct)

....hardware mechanisms have been proposed to address the problem of increasing memory system overhead. For example, researchers have evaluated the prospects of making the processor cache configurable [25, 26] adding computational power to the memory system [14, 18, 24] and supporting stream buffers [13, 16]. All of these mechanisms promise significant performance improvements; unfortunately, most require significant changes to processors, caches, or memories, and thus have not been adopted in current systems. Impulse supports similar optimizations, but its hardware modifications are localized to the ....

....caches, but would fit in many L2 caches. Impulse can be used to remap x to pages that occupy most of the physically indexed L2 cache, and can remap DATA, ROWS, and COLUMNS to a small number of pages that do not conflict with x. In effect, we can use a small part of the L2 cache as a stream buffer [16] for DATA, ROWS, and COLUMNS. 3.2. Tiled Matrix Algorithms Dense matrix algorithms form an important class of scientific kernels. For example, LU decomposition and dense Cholesky factorization are dense matrix computational kernels. Such algorithms are tiled (or blocked ) in order to increase ....

[Article contains additional citation context not shown here]

S. McKee et al. Design and evaluation of dynamic access ordering hardware. In Proc. of the 10th ACM ICS, Philadelphia, PA, May 1996.


Memory System Support for Image Processing - Zhang, Carter, Hsieh, Kee (1999)   (4 citations)  (Correct)

....this problem with an incremental prefetching technique that reduces stream buffer bandwidth consumption by 50 . In contrast, prefetching within the memory controller itself never wastes system bus bandwidth loading unneeded data onto the processor chip. McKee et al. s Stream Memory Controller [14] combines programmable stream buffers and prefetching within the memory controller with intelligent DRAM scheduling. This system dynamically reorders vector or stream accesses to exploit parallelism in the memory system and to exploit locality of reference among the DRAM page buffers. In the same ....

S. A. McKee et al. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 1996 International Conference on Supercomputing, May 1996.


Impulse: Building a Smarter Memory Controller - Carter, Hsieh, Stoller.. (1999)   (41 citations)  (Correct)

....hardware mechanisms have been proposed to address the problem of increasing memory system overhead. For example, researchers have evaluated the prospects of making the processor cache configurable [26, 27] adding computational power to the memory system [15, 19, 25] and supporting stream buffers [14, 17]. All of these mechanisms promise significant performance improvements; unfortunately, most require significant changes to processors, caches, or memories, and thus have not been adopted in current systems. Impulse supports similar optimizations, but its hardware modifications are localized to the ....

....caches, but would fit in many L2 caches. Impulse can be used to remap x to pages that occupy most of the physicallyindexed L2 cache, and can remap DATA, ROWS, and COLUMNS to a small number of pages that do not conflict with x. In effect, we can use a small part of the L2 cache as a stream buffer [17] for DATA, ROWS, and COLUMNS. The resulting performance should approach that of the column labeled Best in Table 1. 3.2 Tiled Matrix Algorithms Dense matrix algorithms form an important class of scientific kernels. For example, LU decomposition and dense Cholesky factorization are dense matrix ....

[Article contains additional citation context not shown here]

S. McKee et al. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 10th ACM International Conference on Supercomputing, Philadelphia, PA, May 1996.


Impulse: An Adaptable Memory System - Carter, Hsieh, Stoller, Swanson.. (1998)   (2 citations)  (Correct)

....be used to remap x to pages that occupy most of the physically indexed L2 cache, and can remap DATA, ROWS, and COLUMNS to a small number of pages that do not conflict with either x or each other. In effect, we can use a small part of the L2 cache, as a stream buffer for DATA, ROWS, and COLUMNS [10]. 4.3 Tile Remapping Impulse can be used to improve the performance of a tiled matrix multiply. We assume an i j k ordering of the matrix multiply loops, as follows: for i = 0 to n 1 for j = 0 to n 1 for k = 0 to n 1 C[i,j] A[i,k] B[k,j] We want to keep the current tile of the C matrix in ....

....between tiles. 6 Related Work A number of projects have proposed modifications to conventional CPU or DRAM designs to attack the memory wall : supporting massive multithreading [1] moving processing power on to DRAM chips [8] buildingprogrammable stream buffers and memory controllers [10], or developing configurable architectures [17] While these projects show promise, it is now almost impossible to prototype non traditional CPU or cache designs that can perform as well as commodity processors. In addition, the performance of processor in memory approaches are handicapped by the ....

[Article contains additional citation context not shown here]

S. McKee et al. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 10th ACM International Conferenceon Supercomputing, Philadelphia, PA, May 1996.


A New Voting Based Hardware Data Prefetch Scheme - Singh, Mukul, Prasad, Patterson   (Correct)

....cache. Early studies in hardware prefetching focussed on simple one block lookahead schemes, i.e. upon referencing block i, block i 1 is to be prefetched. Smith [15] studied variations of this scheme. An extension to this idea, namely stream buffers, was proposed by Jouppi [7] McKee and Wulf [10] have merged off chip stream buffers and the memory controller to improve memory system performance by reordering memory accesses. A different prefetching scheme, proposed by Lee [8] is to decode ahead in the instruction stream. More recently, Baer and Chen [1] have come up with the idea of ....

S.A. McKee et al. Design and Evaluation of Dynamic Access Ordering Hardware. ICS, May 1996.


A Performance Comparison of Contemporary DRAM Architectures - Cuppu, Jacob, Davis, Mudge (1999)   (41 citations)  (Correct)

....example, there is a factor of two difference between the average access latency for compress and perl. This effect has been seen before McKee s work shows that intentionally reordering memory accesses to exploit locality can have an order of magnitude effect on memory system performance [21, 22]. Summary: Coupled with extremely wide buses that hide the effects of limited bandwidth and thus highlight the differences in memory latency, the DRAM architectures perform similarly. As FPM1 and ESDRAM show, the variations in Row Access can be avoided by always closing the row buffer after an ....

S. McKee, et al. "Design and evaluation of dynamic access ordering hardware." In Proc. International Conference on Supercomputing, May 1996.


Algorithmic Foundations for a Parallel Vector Access.. - Mathew, McKee, Carter.. (2000)   (1 citation)  Self-citation (Mckee)   (Correct)

No context found.

S. McKee, et al.Design and evaluation of dynamic access ordering hardware. In Proceedings of the 1996.


Parallel Vector Access: A Technique for Improving Memory System.. - Mathew (2000)   Self-citation (Mckee)   (Correct)

No context found.

MCKEE, S. A., ALUWIHARE, A., CLARK, B. H., KLENKE, R. H., LANDON, 69 T. C., OLIVER, C. W., SALINAS, M. H., SZYMKOWIAK, A. E., WRIGHT, K. L., WULF, W. A., AND AYLOR, J. H. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 10th ACM International Conference on Supercomputing (May 1996), pp. 125--132.


Pointer-Based Prefetching within the Impulse Adaptable.. - Zhang, McKee, Hsieh.. (2000)   (5 citations)  Self-citation (Mckee)   (Correct)

....hardware support for such prefetch operations. For instance, they might augment the ISA with a prefetch instruction [10] redefine a load to a specific register (e.g. to register 0, as in the PA RISC architectures [15] or provide programmable prefetch engines [6] or programmable stream buffers [19]. Hardware only prefetching [2, 9, 12, 14, 29] thus has the advantage of being transparent, and some commercial machines include such mechanisms [5, 7, 28] However, due to its speculative nature, care must be taken to keep from lowering application performance by increasing contention in the ....

....one risks the possibilities that the prefetched data may replace other needed data or may be evicted before it is used. Contention in the on chip cache hierarchy can be avoided by buffering prefetched data lower in the memory system. For instance, the Stream Memory Controller 1 of McKee et al. [19] combines prefetching within the memory controller with dynamic access ordering to exploit bank parallelism and locality of reference among the DRAM pages. Other researchers propose to move the prefetching mechanisms all the way down to the DRAM chips. Alexander and Kedem demonstrate significant ....

S. McKee, et al. Design and evaluation of dynamic access ordering hardware. In Proceedings of the


Research Statement - McKee   Self-citation (Mckee)   (Correct)

....an incomplete solution. My research is guided by the belief that a successful approach to the memory latency bandwidth problem requires looking beyond the traditional caching paradigm. My early research involved improving memory performance for streaming (e.g. vector) computations [HPCA 95, ICS 96] At Virginia, I was the lead graduate student responsible for the design of a memory subsystem called the Stream Memory Controller (SMC) In this system, the compiler generated code to transmit stream parameters to the SMC hardware at run time, and the SMC prefetched read data and buffered write ....

S.A. McKee, C.W. Oliver, Wm.A. Wulf, K.L. Wright, J.H. Aylor, "Design and Evaluation of Dynamic Access Ordering Hardware", Proc. 10th ACM International Conference on Supercomputing, May 1996.


Pointer-Based Prefetching within the Impulse Adaptable.. - Zhang, McKee, Hsieh.. (2000)   (5 citations)  Self-citation (Mckee)   (Correct)

....hardware support for such prefetch operations. For instance, they might augment the ISA with a prefetch instruction [10] redefine a load to a specific register (e.g. to register 0, as in the PA RISC architectures [15] or provide programmable prefetch engines [6]orprogrammable stream buffers [19]. Hardware only prefetching [2, 9, 12,14, 29]thus has the advantageofbeing transparent. However, due to its speculative nature, caremust be taken to keep from lowering application performancebyincreasing contention in the caches and wasting bus bandwidth on useless prefetches. Some ....

....cache lines) When prefetching to cache, the prefetched data mayreplace other needed data, ormay be evicted before it is used. Contention in the on chip cache hierarchy can be avoided bybuffering prefetched data lower in the memory system. For instance, the Stream MemoryController of McKee et al.[19]combines prefetching within the memorycontroller with dynamic access ordering to exploit bank parallelism and localityof reference among theDRAMpages. Other researchers propose moving the prefetching mechanisms all the way down to the DRAM chips. Alexander and Kedem demonstrate significant ....

S. McKee, et al. Design andevaluation of dynamic access ordering hardware. In Proc. of the 1996 ICS,May 1996.


Hardware-Only Stream Prefetching and Dynamic Access Ordering - Zhang, McKee (2000)   (5 citations)  Self-citation (Mckee)   (Correct)

....predictable, and this predictability can be exploited to improve the efficiency of the memory subsystem the memory controller and the DRAM back end. Previous work has examined memory scheduling mechanisms in the context of compiler or application supplied information about access patterns [22, 16, 20]. Here we investigate whether it makes sense to reorder accesses within the memory controller in the absence of compiler or applicationsupplied access pattern information. For current processor and memory technologies, the processor s natural reference stream provides the ordering mechanism with ....

....exposing those mechanisms to software. For instance, they might augment the ISA with a prefetch instruction [13] redefine a load to a specific register (e.g. to register 0, as in the PA RISC architectures [18] or provide programmable prefetch engines [8] or programmable stream buffers [22]. Baer and Chen [2] Fu and Patel [15] and Sklenar [31] propose dynamic vector prefetch units that induce stream parameters at run time. The cache based sequential hardware prefetching of Dahlgren et al. 11] eliminates the need for detecting strides dynamically. To minimize the number of ....

[Article contains additional citation context not shown here]

S. McKee, et al.Design and evaluation of dynamic access ordering hardware. In Proc. of the 1996 ICS, May 1996.


Algorithmic Foundations for a Parallel Vector Access.. - Mathew, McKee, Carter.. (2000)   (1 citation)  Self-citation (Mckee)   (Correct)

.... The memory system that our work targets returns requested base stride data as cache lines that can be stored in the processor s normal cache hierarchy [3] However, our PVA unit is equally applicable to systems that store the requested base stride data in vector registers [5, 7] or stream buffers [13]. For the sake of clarity, we refer to the chunks of base stride data manipulated by the memory controller as memory vectors, while we refer to the base stride data structures manipulated by the program as application vectors. We first review the basic operation of dynamic memory devices, and ....

.... that unroll loops and group accesses within each stream to amortize the cost of each DRAM page miss over several references to that page [14] The Stream Memory Controller built at the University of Virginia extends Moyer s work to implement access ordering in hardware dynamically at run time [13]. The SMC reorders stream accesses to avoid bank conflicts and bus turnarounds, and to exploit locality of reference within the row buffers of fast page mode DRAM components. The simple reordering scheme used in this proof of concept serial, gathering memory controller targets a system with only ....

S. McKee, et al.Design and evaluation of dynamic access ordering hardware. In Proceedings of the 1996 International Conference on Supercomputing, May 1996.


Design of a Parallel Vector Access Unit for SDRAM Memory.. - Mathew, McKee, Carter.. (2000)   (7 citations)  Self-citation (Mckee)   (Correct)

....results demonstrate performance improvements of 15 54 over a serial controller. Our bank controllers behaviorally resemble CVMS section controllers, but our hardware design and parallel access algorithm (see Section 4.3) differ substantially. The Stream Memory Controller (SMC) of McKee et al. [14] combines programmable stream buffers and prefetching in a memory controller with intelligent DRAM scheduling. Vector data bypass the cache in this system, but the underlying access ordering concepts can be adapted to systems that cache vectors. The SMC dynamically reorders stream vector accesses ....

S. McKee et al. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 10th ACM International Conference on Supercomputing, Philadelphia, PA, May 1996.


Impulse: Memory System Support for Scientific.. - Carter, Hsieh, Stoller, .. (1999)   (2 citations)  Self-citation (Mckee)   (Correct)

....that reduces stream buffer bandwidth consumption by 50 without decreasing performance. 10 In contrast, systems that prefetch within the memory controller itself never waste bus bandwidth fetching unneeded data onto the processor chip. The Dynamic Access Ordering systems studied by McKee et al. [22] and Hong et al. 16] combine programmable stream buffers and prefetching within the memory controller with intelligent DRAM scheduling. For vector or streaming applications with predictable memory reference patterns, these systems dynamically reorder stream accesses to improve bus utilization, to ....

S. A. McKee et al. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 10th ACM International Conference on Supercomputing, Philadelphia, PA, May 1996.


Next-Generation Memory Systems - Wang (2004)   (Correct)

No context found.

S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C. Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L. Wright, W. A. Wulf, and J. H. Aylor. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 1996.


Efficient Remapping Mechanisms for an Adaptable Memory System - Zhang (2002)   (Correct)

No context found.

S. A. McKee, A. Aluwihare, B. H. Clark, R. H. Klenke, T. C. Landon, C. W. Oliver, M. H. Salinas, A. E. Szymkowiak, K. L. Wright, W. A. Wulf, and J. H. Aylor. Design and evaluation of dynamic access ordering hardware. In Proceedings of the 1996.


DDR2 and Low Latency Variants - Davis, Mudge, Cuppu, Jacob (2000)   (3 citations)  (Correct)

No context found.

S. McKee, A. Aluwihare, B. Clark, R. Klenke, T. Landon, C. Oliver, M. Salinas, A. Szymkowiak, K. Wright, W. Wulf, and J. Aylor. 1996. "Design and evaluation of dynamic access ordering hardware." In Proc. International Conference on Supercomputing, Philadelphia PA.


Smarter Memory: Improving Bandwidth for Streamed References - McKee, al. (1998)   (7 citations)  (Correct)

No context found.

S.A. McKee et al., "Design and Evaluation of Dynamic Access Ordering Hardware," Proc. ACM SIGARCH Int'l Conf. Supercomputing, ACM Press, New York, 1996, pp. 125-132.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC