| A. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Technical Committee on Computer Architecture newsletter, Dec. 1995. |
....the peak bandwidth of microprocessors is typically between one and two Gbyte s. The peak bandwidth, however, is a theoretical value which is very hard to achieve in real life. Another measure of bandwidth is the sustainable bandwidth of user programs which is determined with the STREAM benchmark [McC95] The STREAM benchmark program accesses data in a way which is advantageous for memory systems. Thus, the memory bandwidth achieved with it can be seen as the maximally achievable user program memory bandwidth. Figure 2.2 summarizes the results of the STREAM benchmark on a HP SPP 2200 Convex ....
J.D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.
....cantly lower than the predicted value of four to seven times the read latency. This indicates an active bus snooping. The all to all communication reveals again a bottleneck of the HP N Class. 3 Bandwidth To measure the memory bandwidth the access patterns from the well known stream benchmarks [2] are used. Reading from and writing to memory is measured with SUM and FILL respectively. COPY reads and writes at the same time, while DAXPY performs the operation a(i) a(i) qb(i) with two loads and one store. The length of the vectors a and b are choosen such that all operations run out of ....
John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA, Dec. 1995. http://www.cs.virginia.edu/stream.
....machines are built with the ability to scale to large numbers 1 2 4 8 16 32 64 128 256 512 Num. Processors 1 10 100 STREAM TRIAD Performance (GB s) SX 4 T932 J932 Cray T3E DEC8400 Origin2000 SPP 1600 Figure 1: Performance on the TRIAD operation for several supercomputers (source: [8]) of nodes. Therefore, at their best, by using parallel programming and compilation techniques the processing power of each processor can be magnified by orders of magnitude in a large parallel system. On the other hand, some problems are difficult to program in parallel or have memory access ....
....or have memory access patterns that make distributed memory parallel systems less effective than their vector counterparts. To illustrate the importance of memory systems, figure 1 presents sustained performance of several PVPs, MPPs, and SMPs. This figure plots data gathered using the STREAM [8] benchmark. In particular, we present the performance of the TRIAD operations (performance is measured in Gigabytes per second) The core of the TRIAD benchmark consists of the following operation: a(i) b(i) q Theta c(i) Figure 1 shows that even when using large numbers of processors, ....
John D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE TCCA Newsletter, December 1995.
....dirty exclusive lines. Finally, Hristea et al. used the microbenchmarks to evaluate NUMA latencies and bandwidths they present results on the Origin 2000 where the requestor is both at the home node and one hop away. A popular microbenchmark which evaluates pipelined memory bandwidth is STREAM [27]. The kernels in this microbenchmark use four typical operations used by scientific codes written in Fortran: copying an array to another array, multiplying all elements of an array by a fixed value, computing a sum of two arrays, and computing a sum of two arrays where the second array is ....
....of the CPU and memory, but also the type of coherency transaction. For example, accessing an unowned cache line will have a much lower cost than pulling the cache line out of a remote processor s cache. Various microbenchmarks measure memory latencies and bandwidths: the STREAM benchmark [27], the memory latency kernel in the lmbench suite [28] and the microbenchmark used by Hristea et al. 19] However, there are several problems when trying to use these codes to systematically evaluate a large Origin 2000 system. First, these are all separate executables with different modes of ....
MCCALPIN, J. D. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture Newsletter (December 1995).
....an upper bound based only on the total memory transfer requirements of the algorithm at memory bus speeds, we obtain a much higher upper bound. In this section, we explore whether these bandwidth numbers can be obtained by examining several simpler benchmarks, known as the STREAM benchmarks [21, 20], and show that they are not. Our conclusion is that our model, which charges the full memory latency to fill the first element of each cache line, is a better match to actual performance than anything based on the peak memory bus bandwidth. As an example, consider the time to load an uncached ....
....it is natural to ask whether the full latency model is justified for streaming applications in practice, given the gap between the e#ective bandwidth in the full latency model and peak memory bus bandwidth. The standard benchmark for assessing sustainable memory bandwidth is the STREAM benchmark [21, 20]. STREAM consists of four vector kernels operating on long (out of cache) vector operands. We ran the STREAM benchmark on our four evaluation platforms, and wrote several additional kernels intended to mimic the access patterns characteristic of sparse matrix vector multiply. All of the kernels ....
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. Newsletter of the IEEE Technical Committee on Computer Architecture, December 1995. http://tab.computer.org/tcca/NEWS/DEC95/DEC95.HTM. 29
....on scientific and engineering workloads, leading to design decisions optimized for these workloads. For example, commodity processors are primarily optimized to perform well on the SPEC benchmark suite [113] and system designs are focused on scientific and engineering benchmarks such as STREAMS [76] and SPLASH 2 [128] It is not clear if these design decisions are suitable even with emerging media processing and database workloads. Given the growing importance of these emerging workloads, it becomes particularly important to re evaluate key system design decisions in the context of these ....
....be very hard. Given all these challenges in studying emerging applications, most uniprocessor studies in the architecture community tend to focus on the SPEC technical benchmark suite [113] while most multiprocessor studies focus mainly on scientific and engineering benchmarks such as STREAMS [76] and SPLASH 2 [128] These trends have led to ffl A lack of detailed quantitative understanding of the behavior of emerging workloads on state of the art systems. ffl A lack of clear consensus on the key performance challenges and best architectural solutions for future systems. Lack of ....
[Article contains additional citation context not shown here]
John D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. In IEEE Technical Committee on Computer Architecture Newsletter, Dec 1995.
....and gives a clearer presentation of the empirical data. Keywords: Compute balance, loop skewing, machine balance, memory locality, scalable locality, storage transformation 1 Introduction Microprocessor speed has been growing exponentially faster than main memory speed in the recent past [McC95] The possibility that this trend may continue raises the question of whether or not there is an upper limit on useful processor speed: at some point, will memory bandwidth limits make additional gains in processor speed irrelevant In this paper, we de ne scalable locality, which corresponds to ....
....remarks in Section 8. 2 Machine Balance, Compute Balance, and Scalable Locality In the discussion that follows, we consider the relative speeds of a processor and memory system in light of the demands made of each when running a given piece of code. We use the term machine balance [CCK88, McC95] to refer to the ratio of the maximum sustainable rate at which a processor can perform oating point arithmetic (typically for data in registers) to the maximum sustainable transfer rate for unitstride accesses for a memory system (in units of oating point values per second) We wish to ....
John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture Newsletter, Dec 1995.
....variants. A related research area is dynamic compilation and program specialization, from its most abstract beginnings by Ershov [8] to more recent work, such as [7, 5, 18, 14] System scoping: McCalpin s STREAM benchmark discovers the machine balance of an architecture via experimentation [20]. In addition to bandwidth, McVoy s and Staelin s lmbench determines a set of system characteristics, such as process creation costs, and context switching overhead [21] Saavedra and Smith use microbenchmarks to experimentally determine aspects of the system [25] Gustafson and Snell [15] develop ....
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. In IEEE Computer Society Technical Committee on Computer Architecture Newsletter, Dec. 1995.
....each additional KB of working set size increases the context switch performance penalty of a thread by 5.57 s. If it takes the processor 5.57 s to read 1 KB of data, 179 MB can be read in one second. This number is in the same order of magnitude as the memory bandwidth that the Stream Benchmark [57] reports on the test machine: 315 MB s. If we take the additional cost of a HLS context switch to be the difference between the median context switch time for the rebuilt Windows 2000 kernel and the median context switch time for the HLS time sharing scheduler, 4.35 s, then this cost is exceeded ....
John D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.
....lmbench includes a smaller, less complex benchmark that produces similar results. ttcp is a widely used benchmark in the Internet community. Our version of the same benchmark routinely delivers bandwidth numbers that are within 2 of the numbers quoted by ttcp. McCalpin s stream benchmark:[McCalpin95] has memory bandwidth measurements and results for a large number of high end systems. We did not use these because we discovered them only after we had results using our versions. We will probably include McCalpin s benchmarks in lmbench in the future. In summary, we rolled our own because we ....
....move twice as much memory as reported by this benchmark; less advanced architectures move three times as much memory: the memory read, the memory read because it is about to be overwritten, and the memory written. The bcopy results reported in Table 2 may be correlated with John McCalpin s stream [McCalpin95] benchmark results in the following manner: the stream benchmark reports all of the memory moved whereas the bcopy benchmark reports the bytes copied. So our numbers should be approximately one half to one third of his numbers. Memory reading is measured by an unrolled loop that sums up a series ....
[Article contains additional citation context not shown here]
John D. McCalpin, "Memory bandwidth and machine balance in current high performance computers," IEEE Technical Committee on Computer Architecture newsletter, to appear, December 1995.
....This implementation decision forces ADAPT to run the entirety of every variant. Furthermore, ADAPT does not currently adapt to data set properties other than size. System scoping: McCalpin introduces the STREAM benchmark, which discovers the machine balance of an architecture via experimentation [77]. In addition 18 to bandwidth, McVoy s and Staelin s lmbench determines a set of system characteristics, such as process creation costs, and context switching overhead [78] Saavedra and Smith use microbenchmarks to experimentally determine aspects of the system [95] Automation: Collberg ....
John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. In IEEE Computer Society Technical Committee on Computer Architecture Newsletter, December 1995. 198
....of the pointer traversing loop on multiple processors. The total bandwidth available to the processors increases linearly, and the latency stays flat until the interconnect saturates. Figure 12. Latency versus bandwidth for multi processor pointer chasing. 6. 2 Bandwidth The Stream benchmark [7] is the defacto standard memory bandwidth benchmark. It is available in both C and Fortran versions. Stream measures stride 1 memory bandwidth using four vector loops: copy, scale, add, and triad. Multiprocessor operations are done in parallel do for loops. These measurement were done using ....
John McCalpin, "Memory bandwidth and Machine Balance in Current High Performance Computers," IEEE Computer Society Technical CommitteeonComputer Architecture Newsletter, pp. 19-25, December 1995, http://www.computer. org/tab/tcca/news/dec95/dec95_mccalpin. ps, and http://www.cs.virginia.edu/stream/.
....peak memory bandwidth of 2 GByte sec compared to its predecessor Digital PWS, which uses an Alpha 21164 chip and has a main memory bandwidth of roughly 900 MByte sec (see Figure 3) The sustainable bandwidth of user programs, however, is still far away from that. For example, the STREAM bandwidth [16] for the Compaq XP1000 (500 MHz) is only 745 MByte sec and the achievable bandwidth for other architectures might be even much lower (see Figure 2) However, memory bandwidth is not the only cause for the performance gap between processor and main memory. The second source of this gap is the ....
J. D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, December 1995.
....: 15 2.5 Speedup of a single Cray C90 processor against single Alpha 21164 processor on the SPECfp92 benchmark suite [AAW 96] 20 2. 6 STREAM benchmark kernels [McC95] 22 3.1 T0 block diagram. 33 3.2 T0 pipeline structure. Each of the three vector functional units (VMP, VP0, and VP1) contains eight parallel pipelines. ....
....processor. 22 C Copy kernel C Sum kernel DO 30 j = 1,n DO 50 j = 1,n c(j) a(j) c(j) a(j) b(j) 30 CONTINUE 50 CONTINUE C Scale kernel C Triad kernel DO 40 j = 1,n DO 60 j = 1,n b(j) scalar c(j) a(j) b(j) scalar c(j) 40 CONTINUE 60 CONTINUE Figure 2. 6: STREAM benchmark kernels [McC95] 2.5.1 STREAM Benchmark The STREAM benchmark [McC95] measures the sustainable application memory bandwidth during long unit stride vector operations within the four FORTRAN kernels shown in Figure 2.6. Performance on this benchmark correlates well with performance measured on certain ....
[Article contains additional citation context not shown here]
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, December 1995.
....ranking is an excellent representative from the class of irregular problems. Limiting our attention to distributed memory machines (and even more specifically to grids) is motivated by the fact that efficient shared memory machines appear not to be constructable far beyond their actual size (see [8] for a treaty on machine balance ) Algorithms. On synchronous parallel computers equipped with a shared memory, PRAMs, the basic approach is pointer jumping . This technique can be used in a list ranking algorithm which runs in O(log N ) time with O(N Delta log N ) work on an EREW PRAM. ....
McCalpin, J.D., `Memory Bandwidth and Machine Balance in Current High Performance Computers, ' IEEE Technical Committee on Computer Architecture Newsletter, pp. 19--25, 12-1995.
.... = 78; 40, producing a cache array of size 73K (note that O = 6 for Figure 4) Note that our cache requirement grows with ( C B ) d , where d is the dimensionality of the array. Thus, when arrays of many dimensions are time skewed for machines with high C B (or machine balance [CCK88, McC95] the cache array may not t in the L1 cache. We will revisit this issue in Section 4.2. 0 2 1 0 1 2 3 4 5 6 0 1 2 3 4 5 i t j Figure 5: Time Skewed Iteration Space for Five Point Stencil r 1 1 s s 2 r 2 s 3 i j a b c Figure 6: Single Tile for Parallel Time Skewing for 2 D Array, ....
John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture Newsletter, Dec 1995.
.... peak performance as the data sizes grow beyond what can fit in the cache [26,27] This holds for uni processor machines as the extra memory traffic caused by the cache coherence mechanisms in shared memory multiprocessors introduces extra latency and further reduces their sustained bandwidth [28,29]. A comparison of single CPU dcopy performance in MB s for unit strides is shown 2 in Figure 3. There are some noteworthy observations about this comparison: The low performance in the large vector limit of the AlphaServer 8400 5 300, a shared memory multiprocessor based on the 300MHz 21164 ....
McCalpin J.D. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Technical Committee on Computer Architecture newsletter, December 1995.
....fundamental problems, list ranking is the most time consuming subroutine. Limiting our attention to distributed memory machines (and even more specifically to grids) is motivated by the fact that efficient shared memory machines appear not to be constructible far beyond their actual size (see [12] for a treaty on machine balance ) More generally, we are convinced that for some years to come, it will be highly relevant to consider how to solve problems of size N P Delta M in minimal time on distributed memory computers consisting of P 1000 PUs, with M , 10MB M 1GB, internal memory ....
McCalpin, J.D., `Memory Bandwidth and Machine Balance in Current High Performance Computers,' IEEE Technical Committee on Computer Architecture Newsletter, pp. 19--25, 12-1995.
....(in time) of main memory accesses and oating point arithmetic have undergone a dramatic shift for microprocessorbased computers. One particular metric is the time required to load a word from a unit stride data stream relative to the time required to perform a oating point operation. As shown in [McC95], this ratio has gone from near 1 in 1990 to at This work is supported by funds from Haverford College. 1 least 20 in 1996, with an annual increase near 70 per year averaged across the industry. This increase in the cost of memory accesses relative to oating point arithmetic has had ....
John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture Newsletter, Dec 1995.
No context found.
A. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Technical Committee on Computer Architecture newsletter, Dec. 1995.
No context found.
J. D. McCalpin, "Memory Bandwidth and Machine Balance in Current High Performance Computers," IEEE Technical Committee on Computer Architecture Newsletter, December 1995.
No context found.
J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. Technical Committee on Computer Architecture (TCCA) Newsletter, Dec. 1995. 4.5.3
No context found.
John D. McCalpin, "Memory bandwidth and machine balance in current high performance computers, " IEEE Technical Committee on Computer Architecturenewsletter, December 1995.
No context found.
J. D. McCalpin, Memory Bandwidth and Machine Balance in Current High Performance Computers, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, (1995). http://tab.computer.org/tcca/news/dec95/dec95.htm.
No context found.
John D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995. http://www.computer.org/tab/tcca/news/dec95/dec95 mccalpin.ps.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC