| S. A. Mckee, "Maximizing memory bandwidth for streamed computations," Ph.D. Thesis, School of Engineering and Applied Science, University of Virginia, May 1995. |
....same instance same instance 21 The first two Sin32F instructions set up two stream buffers. They designate registers f0 and f1 as the head of the two FIFO queues. Within the loop body, a read from one of these two registers dequeues a data item from the appropriate queue. McKee et al. [McKee95a,McKee95b] extended this work by proposing a Stream Memory Controller (SMC) architecture. In this architecture, multiple stream buffers were used to store: i) data prefetched from the memory (input queue) and (ii) data to be stored to the memory (output queue) The heads of these FIFO queues appeared to ....
....to avoid polluting the cache. In addition to the stream buffers, there is also a Memory Scheduling Unit that dynamically reschedules the access requests made by the SBU and the cache. The unit coalesces and reschedules these requests to take advantage of the page access behavior of the DRAM memory [McKee95a,McKee95b]. 2.4 Data Address Generators On the Analog Devices SHARC ADSP 2106x CPU, there are two independent on chip memory modules: the Program Memory (PM) and the Data Memory (DM) PM is used to store CPU FIFO FIFO : FIFO Stream Buffer Unit Cache Memory Scheduling Unit mem mem mem mem : Figure ....
S. McKee, "Maximizing Memory Bandwidth for Streamed Computations," Ph.D. Thesis, Dept. of Computer Sc., University of Virginia, May 1995.
.... which examined the causes of the increase in cache performance, provided evidence to support our hypothesis We wanted to examine a memory system where vector like accesses bypass the cache because our research group has designed a piece of hardware called the Stream Memory Controller (SMC) [7,9] which can increase the speed of vector like memory accesses which go directly to DRAM. Since previous research [8] had studied the effect of the SMC on the vector like data, we needed to simulate the memory system where the Data Cache Performance When Vector Like Accesses Bypass the CacheJuly ....
....whenever it is efficient to do so, grouping together many single word transactions which go to the same DRAM page. These group transfers drastically reduce the number of page switches in the DRAM, and utilize nearly 100 of the available bandwidth from the DRAM. These results are documented in [7,8,9]. FIGURE 3. Relative Sizes of Memory Transfers CPU SMC Memory 1 word transfers Cache 1 word transfers 8 word transfers 1 to 256 word transfers Data Cache Performance When Vector Like Accesses Bypass the CacheJuly 5, 1997 6 3.0 Tools and Methodology We received a number of traces from ....
S. A. McKee, "Maximizing Memory Bandwidth for Streamed Computations", Ph.D. thesis, University of Virginia, May 1995. Available through http://www.cs.virginia. edu/techrep
No context found.
MCKEE, S. Maximizing Memory Bandwidth for Streamed Computations. PhD thesis, School of Engineering and Applied Science, University of Virginia, May 1995.
....well in practice. For uniprocessor systems, its simulation performance is competitive with that of more sophisticated policies. More intelligent schemes are required to achieve uniformly good performance on streams whose strides do not hit all memory banks and on multiprocessor systems in general [34]. 6.1.1 Effective Bandwidth for Long Stream Computations Fig. 6 illustrates the measured performance of our prototype system on each of the benchmark kernels with vectors of 16 to 8K elements and with the FIFO depth set at 16. These graphs show the percentage of the peak system bandwidth ....
....show the percentage of the peak system bandwidth exploited for each benchmark. The short dashed lines labeled limit indicate the combined effect of two performance bounds: SMC startup costs and unavoidable page misses and bus turnaround delays (derivations of performance bounds are given elsewhere [34]) The longdashed lines indicate the performance of our software simulations and the solid lines indicate the performance of our prototype hardware. The dotted lines indicate the performance measured when using caching load instructions to access the stream data in the i860 s own cacheoptimized ....
[Article contains additional citation context not shown here]
S.A. McKee, Maximizing Memory Bandwidth for Streamed Computations, PhD thesis, School of Eng. and Applied Science, Univ. of Virginia, May 1995.
....in practice. For uniprocessor systems, its simulation performance is competitive with that of more sophisticated policies. More intelligent schemes are required to achieve uniformly good performance on streams whose strides do not hit all memory banks, and on multiprocessor systems in general [McK95] 6.1.1 Effective Bandwidth for Long Stream Computations Figure 6 illustrates the measured performance of our prototype system on each of the benchmark kernels with vectors of 16 to 8K elements and with the FIFO depth set at 16. These graphs show the percentage of the peak system bandwidth ....
....the percentage of the peak system bandwidth exploited for each benchmark. The short dashed lines labeled limit indicate the combined effect of two performance bounds: SMC startup costs, and unavoidable page misses and bus turnaround delays (derivations of performance bounds are given elsewhere [McK95] The long dashed lines indicate the performance of our software simulations, and the solid lines indicate the performance of our prototype hardware. The dotted lines indicate the performance measured when using caching load instructions to access the stream data in the i860 s own ....
[Article contains additional citation context not shown here]
S.A. McKee. Maximizing Memory Bandwidth for Streamed Computations. PhD thesis, School of Engineering and Applied Science, University of Virginia, May 1995.
....stream vector accesses and issues them serially to exploit: a) parallelism across dual banks of fast page mode DRAM, and b) locality of reference within DRAM page buffers. For most alignments and strides on uniprocessor systems, simple ordering schemes perform competitively with sophisticated ones [13]. Stream detection is an important design issue for these systems. At one end of the spectrum, the application programmer may be required to identify vectors, as is currently the case in Impulse. Alternatively, the compiler can identify vector accesses and specify them to the memory controller, ....
S. McKee. Maximizing Memory Bandwidth for Streamed Computations. PhD thesis, School of Engineering and Applied Science, University of Virginia, May 1995.
....ordering strategy works well in practice. For uniprocessor systems, its performance is competitive with that of more sophisticated policies. More intelligent schemes are required to achieve uniformly good performance on computations involving streams with strides that do not hit all memory banks [McK95b]. Further details of the design, implementation, and testing of the ASIC and daughterboard can be found elsewhere [McG94,Lan95, SMC96] 4. PERFORMANCE Figure 5 lists the benchmark kernels used to generate the results presented here. Daxpy, copy, scale, and swap are from the BLAS (Basic Linear ....
....in a format that helps put the bandwidth percentages into perspective for this particular machine: the average number of processor cycles per stream access. The dashed lines labeled limit indicate limits to attainable bandwidth due to SMC startup costs and unavoidable page misses (see [McK95b,SMC96] for derivations of performance bounds) The solid lines indicate the performance of our hardware prototype. The dotted lines indicate the performance measured when using normal caching load instructions to access the stream data in the i860 s own cacheoptimized memory; and the dot dash lines ....
[Article contains additional citation context not shown here]
S.A. McKee, "Maximizing Memory Bandwidth for Streamed Computations", Ph.D. Dissertation, University of Virginia, Department of Computer Science, May 1995. http:// www.cs.virginia.edu/research/techrep.html.
....practice; for uniprocessor systems, its performance is competitive with that of more sophisticated policies. More intelligent schemes are required to achieve good performance on computations involving streams with strides that do not hit all memory banks, and on multiprocessor systems in general [5]. Further details of the design, implementation, and testing of the initial SMC ASIC and daughterboard can be found in [6] 7] Figure 5 SMC ASIC Layout Smarter Memory = Better Performance: Improving Effective Bandwidth for Streams DRAFT DO NOT DISTRIBUTE 9 4. Experimental Results For ....
....are varied, we present a tiny fraction of these results here. We then present detailed measurements for the first SMC design we fabricated. Our other reports contain more results, along with discussions of how our approach to the memory bandwidth problem relates to those others have taken [5][7] To facilitate comparison, results for all systems are presented as a percentage of peak system bandwidth, or that needed for the CPU to perform one memory access each processor cycle. For simplicity, the experiments presented here only consider unit stride vectors of equal length. These ....
[Article contains additional citation context not shown here]
S.A. McKee, "Maximizing Memory Bandwidth for Streamed Computations", Ph.D. thesis, University of Virginia, May 1995. Available through http:// www.cs.virginia.edu/techrep. Smarter Memory = Better Performance: Improving Effective Bandwidth for Streams DRAFT --- DO NOT DISTRIBUTE 16
....practice; for uniprocessor systems, its performance is competitive with that of more sophisticated policies. More intelligent schemes are required to achieve good performance on computations involving streams with strides that do not hit all memory banks, and on multiprocessor systems in general [McK95b]. Further details of the design, implementation, and testing of the SMC ASIC and daughter board can be found elsewhere [McG94,Lan95] 4. Performance Figure 6 lists the benchmark kernels used to generate the results presented here. Daxpy, copy, and scale are from the BLAS (Basic Linear Algebra ....
....into perspective for this particular machine: the average number of processor cycles per stream access. The dashed lines labeled attainable bandwidth indicate performance limits due to SMC startup costs, unavoidable page misses, or the cost of moving data between the SMC and CPU chips (see [McK95b] for derivations of performance bounds) and the solid lines indicate the performance of our access ordering hardware. The dotted lines indicate the performance measured when using normal caching load instructions to access the stream data in the i860 s own cache optimized memory; and the ....
[Article contains additional citation context not shown here]
S.A. McKee, "Maximizing Memory Bandwidth for Streamed Computations", Ph.D. Dissertation, University of Virginia, Department of Computer Science, May 1995.
....number of page misses that a computation must incur can be calculated from the FIFO depth (f) the number of streams (s) the number of vectors (v) the number of interleaved modules (m) in the memory system, and the DRAM access costs. See our other publications for the details of these formulas [McK95b,McK95c]. Briefly, the fraction of page misses, p, for a multiple vector computation is bounded by . The maximum percentage of peak bandwidth for the computation can then be calculated as: Figure 11 shows the net effect of these competing performance factors for a vector axpy (vaxpy) computation on ....
S.A. McKee, "Maximizing Memory Bandwidth for Streamed Computations", Ph.D. Dissertation, University of Virginia, May, 1995.
No context found.
S. A. Mckee, "Maximizing memory bandwidth for streamed computations," Ph.D. Thesis, School of Engineering and Applied Science, University of Virginia, May 1995.
No context found.
S. A. Mckee, "Maximizing memory bandwidth for streamed computations," Ph.D. Thesis, School of Engineering and Applied Science, University of Virginia, May 1995. 20
No context found.
S. A. Mckee, "Maximizing memory bandwidth for streamed computations," Ph.D. Thesis, School of Engineering and Applied Science, University of Virginia, May 1995.
No context found.
S. A. Mckee, "Maximizing memory bandwidth for streamed computations, " Ph.D. Thesis, School of Engineering and Applied Science, University of Virginia, May 1995.
No context found.
S.A. McKee, Maximizing Memory Bandwidth for Streamed Computations, doctoral thesis, Dept. of Computer Sci., Univ. of Virginia, 1995.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC