| McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", Proc. Conf. on Prog. Lang. and Sys. Arch., Zurich, Switzerland, March 1994; also University of Virginia, TR CS-93-34. |
....of the Obvious Appeared in Computer Architecture News, 23(1) 20 24, March 1995. 4 Our prediction of the memory wall is probably wrong too but it suggests that we have to start thinking out of the box . All the techniques that the authors are aware of, including ones we have proposed [McK94, McK94a], provide one time boosts to either bandwidth or latency. While these delay the date of impact, they don t change the fundamentals. The most convenient resolution to the problem would be the discovery of a cool, dense memory technology whose speed scales with that of processors. We are not aware ....
S.A. McKee, et. al., "Increasing Memory Bandwidth for Vector Computations", Proc. Conference on Programming Languages and System Architecture, Zurich, March 1994.
....beneficial impact of access ordering on effective memory bandwidth together with the limitations inherent in implementing the technique statically motivate us to consider an implementation that reorders accesses dynamically at run time. What follows is an overview of the architecture proposed in [McK93b, McK93c]: see those documents for more details. Our discussion is based on the simplified architecture of Figure 1. In this system, memory is interfaced to the processors through a controller labeled MSU for Memory Scheduling Unit. The MSU includes logic to issue memory requests as well as logic to ....
....of memory banks, DRAM speed, benchmark algorithm, and vector length, stride, and alignment with respect to memory banks. Complete uniprocessor results, including a detailed description of each access ordering heuristic, can be found in [McK93a] highlights of these results are presented in [McK93b, McK93c]. Since our concern here is to correlate the performance predictions of our analytic model with our functional simulation results, we present only the maximum percentage of peak bandwidth attained by any order issue policy simulated for a given memory system and benchmark. 6. Benchmark Suite ....
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", University of Virginia, TR CS-93-34, August 1993. To appear in Proc. Conf. on Prog. Lang. and Sys. Arch., Zurich, Switzerland, March 1994.
....One way to do this is via access ordering, which we define as any technique for changing the order of memory requests to increase bandwidth. Here we are especially concerned with ordering a set of vector like stream accesses. For a more thorough discussion of access ordering, see [Moy92, Moy93, McK93a, McK93b]. The performance benefits of doing such static access ordering can be quite dramatic [Moy92, Moy93] but without the kinds of address alignment information that are usually only available at run time, the compiler can t generate the optimal access sequence. The extent to which a compiler can ....
.... conflict free access to interleaved memory [Har89, Rau91, Val91] software prefetching data to the cache [Cal91, Kla91, Soh91] and hardware prefetching vector data to cache [Bae91, Fu91, Jou90, Skl92] For a more detailed discussion of how these schemes relate to dynamic access ordering, see [McK93b]. The main difference between these techniques and the complementary one we propose here is that we reorder stream accesses to exploit the architectural and component features that make memory systems sensitive to the sequence of requests. 3. The Stream Memory Controller The design space of ....
[Article contains additional citation context not shown here]
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", University of Virginia, TR CS-93-34, August 1993.
....number of memory banks, DRAM speed, benchmark kernel, and vector length, stride, and alignment with respect to memory banks. Complete uniprocessor results, including a detailed description of each access ordering heuristic, can be found in [McK93a] highlights of these results are presented in [McK94a, McK94b]. Complete shared memory multiprocessor results can be found in [McK94c] Since our concern here is to correlate the performance bounds of our analytic model with our functional simulation R v 1 s r v 1 s b Ns M ( gcd b stride , F Ns ....
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", Proc. Conf. on Prog. Lang. and Sys. Arch., Zurich, Switzerland, March, 1994; also University of Virginia, Technical Report CS-93-34.
....general structure of the dissertation is illustrated by the tree shown in Figure 1.5: Chapter 1: Introduction 10 Some of our results have been published previously. The uniprocessor SMC architecture and parts of the corresponding simulation results from Chapter 2 and Chapter 3 were described in [McK94a,McK94b,McK95b]. The analytic models in Chapter 3 and Chapter 4 and a description of the Symmetric Multiprocessor SMC organization introduced in Chapter 4 were first presented in [McK95b] Parts of the results in Chapter 2 appear in [McK95a] Complete results for the functional simulations and analytic models ....
S.A. McKee, S.A. Moyer, Wm.A. Wulf, and C. Hitchcock, "Increasing Memory Bandwidth for Vector Computations", Lecture Notes in Computer Science 782: Proceedings of the Conference on Programming Languages and Systems Architectures (PLSA, Zurich, Switzerland), pages 87-104, Springer Verlag, 1994. Bibliography 193
....One way to do this is via access ordering, which we define as any technique for changing the order of memory requests to increase bandwidth. Here we are especially concerned with ordering a set of vector like stream accesses. For a more thorough discussion of access ordering, see [Moy92, Moy93, McK93a, McK93b]. The performance benefits of doing such static access ordering can be quite dramatic [Moy92, Moy93] but without the kinds of address alignment information that are usually only available at run time, the compiler can t generate the optimal access sequence. The extent to which a compiler can ....
.... for conflict free access to interleaved memory [Har89, Rau91, Val91] software prefetching data to the cache [Cal91, Kla91, Soh91] and hardware prefetching vector data to cache [Bae91, Fu91, Jou90, Skl92] For a more detailed discussion of how these schemes relate to dynamic access ordering, see [McK93b]. The main difference between these techniques and the complementary one we propose here is that we reorder stream accesses to exploit the architectural and component features that make memory systems sensitive to the sequence of requests. 3. The Stream Memory Controller The design space of ....
[Article contains additional citation context not shown here]
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", University of Virginia, TR CS-93-34, August 1993.
....A 71,000 transistor ASIC has been designed and fabricated, and is currently being tested and used to verify expected SMC performance gains. Our results indicate that the fabricated SMC can deliver the expected bandwidth improvements for inner loops of important streaming computations [6] 10] [11]. Our need to use graduate students, our experience and access to MGC tools, and the necessity to use a particular IC fabrication process (0.75 m HP through MOSIS) forced us to use tools that were not tightly integrated. This led to the development of the design and revision process described ....
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", Lecture Notes in Computer Science 782 (Proc. PLSA, Zurich, Switzerland, March 1994), Springer Verlag, 1994.
....we develop analytic models that bound the performance of any uniprocessor or symmetric multiprocessor memory system on streams. We present highlights of these results, comparing them to the performance of a scheme we have proposed for accessing stream data the Stream Memory Controller (SMC) [McK94a, McK94b]. There are two independent comparisons: a bus level simulation, and a gatelevel simulation of the SMC s VHDL description. Both forms predict the SMC consistently delivers nearly the maximum attainable bandwidth determined by the analytic bounds. While not reported here, preliminary tests of the ....
....1 gcd b stride , Appeared in Proceedings of Europar 95, Stockholm, Sweden, August 1995. Lecture Notes in Computer Science 966, S. Haridi, et al. Eds. Springer Verlag, Berlin, 1995, pages 83 99. 12 [McK94a,McK94b]. Complete shared memory multiprocessor results can be found in [McK94c] Since our concern here is to correlate the performance bounds of our analytic model with our functional simulation results, we present only the maximum percentage of peak bandwidth attained by any order issue policy ....
McKee, S.A., Moyer, S.A., Wulf, Wm.A., and Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", Proc. Programming Languages and System Architectures, Zurich, Switzerland, March 1994.
....a basis for comparing the performance improvements of the other schemes. None of the techniques requires heroic compiler technology: the compiler need only detect streams, as in Benitez and Davidson s algorithm [2] Dynamic access ordering requires a small amount of special purpose hardware [25], and both static and dynamic access ordering depend on the availability of non caching load instructions. Although rare, these instructions are available in some commercial processors, such as the Convex C 1 [37] and Intel i860 [15] Other architectures, such as the DEC Alpha [7] provide a means ....
....on registers and cache. A system that reorders accesses at runtime and provides separate buffer space can reap the benefits of access ordering without these disadvantages, at the expense of adding a relatively small amount of special purpose hardware. One such scheme is depicted in Figure 1 [23, 25]. In this organization, memory is interfaced to the processor through a controller (or Memory Scheduling Unit) that includes logic to issue memory requests and logic to determine the order of requests during streaming computations. A set of control registers allow the processor A 1 A 2 , B 1 B 2 ....
[Article contains additional citation context not shown here]
McKee, S.A., et.al., "Increasing Memory Bandwidth for Vector Computations", Lecture Notes in Computer Science 782 (PLSA, Zurich, Switzerland, March 1994), Springer Verlag, 1994.
.... access ordering at run time [McK94a] Simulation studies indicate that dynamic access ordering is a valuable technique for improving uniprocessor memory performance for stream computations the SMC, or Stream Memory Controller, consistently delivers almost the entire available bandwidth [McK93a, McK94b, McK93c]. The applicability of dynamic access ordering is not limited to uniprocessor environments. This paper discusses the effectiveness of dynamic access ordering with respect to the memory performance of symmetric multiprocessor (SMP) systems. Our simulation results show that a modest number of ....
....These performance anomalies are discussed in Section 5.3. Figure 3 shows an example mapping of memory banks to FIFO positions for a stride one vector when the length of the FIFOs is less than the number of banks. 3. 1 Benchmark Suite The benchmark suite used is the same as in previous SMC studies [McK93a, McK93c, McK94a, McK94b], and is described in Figure 4. These benchmarks represent access patterns found in real scientific codes, including the inner loops of blocked algorithms. The suite constitutes a representative subset of all possible access patterns for computations involving a small number of vectors. The hydro ....
[Article contains additional citation context not shown here]
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", Proc. Conf. on Prog. Lang. and Sys. Arch., Zurich, Switzerland, March 1994; also University of Virginia, TR CS-93-34.
No context found.
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", Proc. Conf. on Prog. Lang. and Sys. Arch., Zurich, Switzerland, March 1994; also University of Virginia, TR CS-93-34.
No context found.
McKee, S.A., Moyer, S.A., Wulf, Wm.A., Hitchcock, C., "Increasing Memory Bandwidth for Vector Computations", University of Virginia, TR 48 CS-93-34, August 1993. To appear in Proc. Conf. on Prog. Lang. and Sys. Arch., Zurich, Switzerland, March 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC