| R. H. Saavedra and A. J. Smith. Measuring cache and TLB performance and their effect of benchmark run. Technical Report CSD-93-767, Feb. 1993. |
....of using loop restructuring, but array restructuring can be quite effective at creating sequential access patterns. The number of accesses to the remapped data is an order of magnitude higher than the size of the arrays. Thus remapping performs well for small bands, where the fixed setup cost of 14 copying overwhelms the potential benefit, whereas copying performs well for large bands. The cost model correctly identified this behavior in most, but not all, cases. By correctly choosing when to apply copying (C) versus remapping (R) the cost model driven results achieved better performance ....
....model (the Abstract Machine Model) of program execution to characterize machine and application performance, and can predict with good accuracy the running time of a given benchmark on a given machine. However, this work does not consider the effects of the memory subsystem. Saavedra and Smith [14] model memory and TLB costs, but they use published miss ratio data rather than estimating cache miss rates. Ghosh et al. 4] introduce Cache Miss Equations (CMEs) a precise analytical representation of cache misses in a loop nest. This work is complementary to ours and can be used to further ....
R.. H. Saavedra and A. J. Smith. Measuring cache and TLB performance and their effect on benchmark run times. IEEE Trans. on Computers, C-44(10):1223-1235, Oct. 1995.
....and caching policy. Fingerprinting also shares much in common with microbenchmarking. Specifically, both perform requests of the underlying system in order to characterize its behavior. For example, with simple probes in microbenchmarks, one can determine parameters of the the memory hierarchy [2, 24], processor cycle time [28] and characteristics of disk geometry [26, 30] In our view, the key difference between fingerprinting and microbenchmarking is that a fingerprint is used to discover the policy or algorithm employed by the underlying layer, whereas a microbenchmark is typically used to ....
R. H. Saavedra and A. J. Smith. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes. IEEE Transactions on Computers, 44(10):1223--1235, 1995.
....did and the machine parameters we obtained. The parameters we obtained are shown in Table 3. To measure the size and miss penalty of L1 and L2 cache, we used the LMbench tool set [14] To measure the size and miss penalty associated with the TLB, we used the method described by Saavedra et al. [16]. The key idea in this method is to measure the time required for accessing arrays of increasing size, accessed using a stride that equals the page size. For computing cache coherence miss latency, we use a method similar to the one described by Hristea et al. 9] Finally, we also computed the ....
....analysis of a wavefront application [19] Micro benchmarking is a popular technique used as a basis for performance prediction. The initial work in this area focused on sequential programs and did not consider memory hierarchy [17] The work has been since then been extended for parallel machines [16, 22]. We combine micro benchmarking with probabilistic models to capture the effects of memory hierarchy, contention and locking. We have used several ideas from Hristea et al. 9] and Saavendra et al. 16] in our micro benchmarking work. Detailed analytical models of cache have been developed. ....
[Article contains additional citation context not shown here]
R. Saavedra and A. Smith. Measuring cache and tlb performance and their efect on benchmark run times. IEEE Transactions on Computing, pages 1223--1235, 1995.
....of the underlying system; we believe that this knowledge should be encapsulated in ICLs, so that these techniques can be used by all programmers. However, gray box systems go one step further by combining knowledge with measurements and observations, a technique commonly found in microbenchmarks [3, 33, 39, 40, 42]. We believe there exists a strong duality between microbenchmarks and gray box techniques. First, ICLs often require that underlying components be benchmarked to configure internal thresholds and parameters. Second, understanding the behavior of ICLs requires understanding the behavior of the OS; ....
....the necessary interfaces, many microbenchmarks have been developed that exploit gray box knowledge to allow the user to infer these characteristics. For example, by measuring the completion time of memory accesses with di#erent patterns, one can determine many parameters of the memory hierarchy [3, 33]; by finding the greatest common divisor of the execution time of di#erent expressions, one can determine processor cycle time [39] by measuring the access time of carefully designed requests, low level characteristics of disk geometry can be inferred [40, 42] Although gray box ICLs bear ....
R. H. Saavedra and A. J. Smith. Measuring Cache and TLB Performance and Their E#ect on Benchmark Runtimes. IEEE Transactions on Computers, 44(10):1223-- 1235, 1995.
....with hbench:OS, which improved some methodological and practical issues of lmbench. We build directly on this line of research, comparing benchmarks similar to those of hbench:OS with comparable Java versions. Other related work, pre Java, includes microbenchmarks presented by Saavedra and Smith [15]. This work presents microbenchmarks to evaluate memory system performance, exposing performance at the various levels of a computer system s memory hierarchy. A number of Java benchmark suites have appeared. The Jmark suite [16] provides compil er CPU processor tests , as well as a set of ....
Rafael H. Saavedra and Alan J. Smith. Measuring cache and TLB performance and their eect on benchmark run times. IEEE Transactional Computing, 44(10), Oct 1995.
....model (the Abstract Machine Model) of program execution to characterize machine and application performance, and can predict with good accuracy the running time of a given benchmark on a given machine. However, this work does not consider the effects of the memory subsystem. Saavedra and Smith [14] model memory and TLB costs, but they use published miss ratio data rather than estimating cache miss rates. Ghosh et al. 4] introduce Cache Miss Equations (CMEs) a precise analytical representation of cache misses in a loop nest. CMEs have some drawbacks as they also cannot handle multiple loop ....
R. H. Saavedra and A. J. Smith. Measuring cache and TLB performance and their effect on benchmark run times. IEEE Trans. on Computers, C-44(10):1223--1235, Oct. 1995.
.... architecture via experimentation [20] In addition to bandwidth, McVoy s and Staelin s lmbench determines a set of system characteristics, such as process creation costs, and context switching overhead [21] Saavedra and Smith use microbenchmarks to experimentally determine aspects of the system [25]. Gustafson and Snell [15] develop a scalable benchmark, HINT, that can accurately predict a machine s performance via its memory reference capacity. 8 Conclusion In an ideal world, static analysis would not only su#ce, but would not limit the universe of approachable input codes. Unfortunately, ....
R. H. Saavedra and A. J. Smith. Measuring cache and tlb performance and their e#ect on benchmark run times. IEEE Trans. Comput., 44(10):1223--1235, Oct. 1995.
....architectures incur into a large degree of inefficiency. This inefficiency is due both to the structure of RISC CPUs, in particular to unnecessary additional instructions and to low utilization of functional units, and to the hierarchical structure of memory systems, in particular to cache misses [13]. Nevertheless, in many low medium level IPPR tasks [3] the problem of cache misses is less critical than in other application fields [2] and the number of cache misses is quite near to the number of compulsory misses [11] This fact is due to two main reasons: 1. the reuse of data; in many ....
Saavedra R. H. and. Smith A. J, Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes, IEEE Transactions on Computers, vol.44, no. 10, pp. 1223-1235, Oct. 1995.
....systems image processing can still be considered computebound. As a consequence, improvements in processing speed (originated for example by a higher degree of parallelism) will yield improvements of an equal factor in applications. 1 INTRODUCTION The rapid progress of RISC technology [13] 41][33][46] has recently given a new impulse to image processing and pattern recognition (IPPR) The availability of a computing power of the same order of magnitude as that previously delivered by massively parallel computers on low cost RISC systems presently makes software based IPPR effective and ....
....is that primary memory accesses are activated only at the occurrence of cache misses, that is exactly at the moment at which the missing data items are needed and therefore the CPU remains idle waiting for memory access completion. Prefetching techniques were proposed to reduce such a latency [33]. In IPPR, where memory accesses are known in advance, it would be possible to plan the loading of cache lines in advance to eliminate the latency of primary memory access, thus overlapping primary memory accesses with other system activities. This behaviour should be supported by proper ....
Saavedra R. H. and. Smith A. J, Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes, IEEE Transactions on Computers, vol.44, no. 10, pp. 1223-1235, Oct. 1995.
.... via experimentation [77] In addition 18 to bandwidth, McVoy s and Staelin s lmbench determines a set of system characteristics, such as process creation costs, and context switching overhead [78] Saavedra and Smith use microbenchmarks to experimentally determine aspects of the system [95]. Automation: Collberg develops a strategy for automatically generating a compiler back end [27] His system discovers many aspects of the underlying system, such as instruction set syntax and semantics, and instruction timings, via experimentation. With this knowledge, his system generates a ....
Rafael H. Saavedra and Alan Jay Smith. Measuring cache and tlb performance and their e#ect on benchmark run times. IEEE Transactions on Computers, 44(10):1223--1235, October 1995.
....touched by the experiment and 2) the distance (stride) between two consecutive addresses sent to the cache or TLB. Saavedra s micro benchmarks work under following conditions: instruction caches and data caches are separate, and the lowest available address bits are used to select the cache set [9]. Li and Thomborson extended Saavedra and Smith s research on designing micro benchmarks to measure data cache parameters. Unlike Saavedra and Smith, Li and Thomborson characterized read accesses separately from write, and their benchmarks were valid for wider range of address mapping functions. ....
....size N, the access stride s, the primary cache capacity C 1 , the cache line size B 1 , the cache associativity A 1 , and the number of sets S 1 , we identify six cases of operations. These are summarised in Table 1. Most but not all of these cases were previously identified by Saavedra and Smith [9]. Case 1.1: 1 4N C 1 and 1 s The whole array fits in the cache. Regardless of the stride s, there are no cache misses after the array is loaded into the cache for the first time. The execution time per iteration of our inner loop is thus some constant T 1 = T 0 . Case 1.2: C 1 4N C 1 ....
Saavedra R.H. and A. J. Smith, "Measuring cache and TLB performance and their effect on benchmark running times," IEEE Trans. Computers 44(10): 1223-1225, 1995.
....the program has no control over which block is removed when a new block arrives. 5 2. 3 Penalty cost model We use a weighted version of the cost model presented in [Katajainen and Traff 1997] Alternatively, one can see the cost model as a simplification of the Fortranbased model discussed in [Saavedra and Smith 1995]. 1) The cost of all pure C operations is assumed to be the same . 2) Each load and store operation has an extra penalty i if it incurs a miss at level i of the memory hierarchy, i 2 f1; 2; g. Now, if a program executes n pure C operations and incurs m i misses at memory level ....
....worth to make memory access patterns as local as possible if a speed up by a constant factor is of interest. A micro benchmark similar to ours could be designed to measure other relevant performance characteristics of the memory hierarchy. An interested reader should consult, e.g. the study by Saavedra and Smith [1995] which describes a collection of experiments that can be used to compute many memory hierarchy parameters, including the size of the cache(s) and the TLB, the time needed to satisfy a cache or TLB miss, and the cache and TLB associativity. Table 1. Estimation of and in four different ....
Saavedra, R. H. and Smith, A. J. 1995. Measuring cache and TLB performance and their effect on benchmark runtimes. IEEE Transactions on Computers 44, 1223--1235.
....The Abstract Machine Model The Abstract Machine Model was rst described in [14] It is a model for representing and analyzing Fortran programs. Compared to the PHiPAC metrics, the AMM contains a larger set of software patterns to cover calls to the mathematical library. The model was extended in [13] to deal with the memory hierarchy contention. However, it does not charge for the pipeline stalls. By dropping these, the AMM model does not facilitate the development of high performance algorithms. As with other high level language metrics [15] the AMM metrics were designed to predict the ....
Rafael H. Saavedra and Alan Jay Smith. Measuring Cache and TLB Performance and Their Eect on Benchmark Run Times. ACM tranasctions on Computer Systems, 14, November 1996.
....CycleTime #levels X i (M i C i ) 3) where M i represents the number of accesses that miss in the i th level of the memory hierarchy, and C i represents the penalty (in machine cycles) for a miss in the i th level of the memory hierarchy. C i is computed using microbenchmarking, as in [19]. The compiler computes the number of array references in each statement and aggregates the values across blocks of statements, loops, and procedures. We do not include scalar references in this version of the model because, as mentioned above, we assume that the register les and the rst level ....
....there have been a few attempts to predict performance at compile time, mostly for applications running on multiprocessor systems. We summarize a few of these attempts, and explain how our work makes use of those results and how our work contrasts with some of the other methods. Saavedra et al. [19 21] has done extensive work in the area of performance prediction for uniprocessors. In [21] the authors present the microbenchmarking concept to measure architectural parameters. We use the same microbenchmarking approach to estimate operation costs (including intrinsic functions) and cache ....
[Article contains additional citation context not shown here]
R. Saavedra and A. Smith. Measuring cache and tlb performance and their eect on benchmark run times. IEEE Transactions on Computers, 44(10):1223-1235, October 1995.
No context found.
R. H. Saavedra and A. J. Smith. Measuring cache and TLB performance and their effect of benchmark run. Technical Report CSD-93-767, Feb. 1993.
No context found.
R. H. Saavedra and A. J. Smith. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes. IEEE Transactions on Computers, 44(10):1223--1235, 1995.
No context found.
R. Saavedra and A. Smith, "Measuring Cache and TLB Performance and Their Effect on Benchmark Run Times," IEEE Trans. Computers, vol. 44, pp. 1223--1235, Oct. 1995.
No context found.
Rafael H. Saavedra and Alan Jay Smith, "Measuring Cache and {TLB} Performance and Their Effect on Benchmark Runtimes", IEEE Transactions on Computers, Volume: 44, number: 10, p1223-1235, 1995.
No context found.
R. Saavedra, A. Smith, Measuring cache and TLB performance and their e#ect on benchmark run times, IEEE Transactions on Computers 44 (10) (1995) 1223--1235.
No context found.
R. Saavedra and A. Smith. Measuring cache and tlb performance and their efect on benchmark run times. IEEE Transactions on Computing, pages 1223-1235, 1995.
No context found.
R. H. Saavedra and A. J. Smith. Measuring cache and TLB performance and their effect on benchmark run times. IEEE Trans. on Computers, C-44(10):1223--1235, Oct. 1995.
No context found.
R.H. Saavedra and A.J. Smith, "Measuring cache and TLB performance and their effect on benchmark runtimes," IEEE Transactions on Computers, 44 (10), pp. 1223-1235, October 1995.
No context found.
R. H. Saavedra and A. J. Smith, "Measuring cache and TLB performance and their e#ect on benchmark run times," IEEE Trans. on Computers, vol. C-44, pp. 1223--1235, Oct. 1995.
No context found.
R. H. Saavedra and A. J. Smith. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes. IEEE Transactions on Computers, 44(10):1223--1235, 1995.
No context found.
R.H. Saavedra and A.J. Smith, "Measuring cache and TLB performance and their effect on benchmark runtimes," IEEE Transactions on Computers, 44 (10), pp. 1223-1235, October 1995.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC