| A. R. Lebeck and D. A. Wood, "Cache Profiling and SPEC Benchmarks: A Case Study, " IEEE Computer, pp. 15--26, Oct. 1994. |
....be provided by traces or by a machine simulator like SimOS [RHWG95] or SimpleScalar [BA97] The simulation systems described so far produce summarized cache information of the whole execution of a program (trace) The output typically consists of cache miss and hit rates statistics. The CPROF [LW94] system couples a trace based uniprocessor cache simulator with source code annotation. Thus, CPROF identifies the source lines and data structures that cause frequent cache misses. Therefore, the source lines and data structures are annotated with the appropriate cache miss statistics which are ....
A. Lebeck and D. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer, 27(10):15--26, October 1994.
....There are many examples of using profiling effectively to make static changes to a program that optimize its reference behavior. Profiling with MIN analysis has been used to annotate instructions for prefetching [Abraham 93] Other techniques restructure the layout of references, both data [Lebeck 94, Calder 98] and instructions [Pettis 90, Hashemi 97] However, the majority of the restructuring techniques have the goal of eliminating conflict misses in directmapped caches while my interest is in caches with larger set associativity. Variations on LRU [Smith 82] have been extensively ....
A. R. Lebeck and D. A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer, 27(10):15--26, October 1994.
....types of misses and join them in a single type called interference misses because the number of this type of misses is affected by the locality. Throughout this paper, the cache misses are measured using trace driven simulation. There are a variety of commercial cache simulators like DineroIII [15]. Nevertheless, we have built our own simulator avoiding the storage of trace files. The simulator is actually built to trace only some sparse algebra codes. This simple simulator [13] generates the references to memory and feeds them directly to a cache simulator module whose configuration (cache ....
A. R. Lebeck and A. D. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, pages 15--26, 1994.
....an instruction cache of 512 Bytes and a data cache of 512 Bytes, each of which with block size equal to16 Bytes, has been used as execution platform. For translating the code the DLX compiler dlxcc has been used. In addition to this tool, the DLX simulator dlx sim and the cache simulator dinero [7] were used for taking the measurements that will be presented in this section. As a first step the execution of the initial form of the code is simulated. The results can be seen in table 2. The number of instruction cache misses is almost 15 times larger than the number of data cache misses and ....
Alvin R. Lebeck, David A. Wood, "Cache Profiling and the SPEC Benchmarks: A Case Study", IEEE Computer,
....three SPEC benchmark suites while the primary goal of this study was to present a comprehensive analysis of the load behavior, from the perspective of the memory subsystem. The complete set of results for this study can be found in [Yi02] Finally, a number of profiling tools have been proposed [Lebeck94, Anderson97, Mowry97, and Thornock00] to monitor the program caching behavior (cache bottlenecks, hot misses, functional unit contention, etc. All of these profiling tools use sampling and are memory bound since they require substantial amounts of memory to store the sampled data before it is processed. In our study, we use a ....
A. Lebeck and D. Wood; "Cache Profiling and the SPEC Benchmarks: A Case Study"; IEEE Computer, Vol. 27, No. 10, Pages 15-26, October 1994.
....on HPF to insert instrumentation code that extracts a data trace of array references. The trace is later exposed to a cache simulator before miss correlations are reported [22] This approach shares its goal of cache correlation with our work, and we are considering collaborative efforts. CProf [19] is a similar tool that relies on post link time binary editing through EEL [17, 18] but cannot handle shared library instrumentation or partial traces. Lebeck and Wood also applied binary editing to substitute instructions that reference data in memory with function calls to simulate caches ....
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. Computer, 27(10):15--26, Oct. 1994.
....by di#erent sources. A related problem arises when a standard benchmark is the primary mechanism against which a system is evaluated. To show performance improvements, system architects will often tune their system specifically to the benchmark in question (e.g. optimizing memory cache design [LW94a] Unless the benchmark is particularly representative of real world workloads, such optimizations result in even less confidence in the overall applicability of the evaluation results. Thus, a valid question is whether a perfect evaluation methodology is possible. In an ideal evaluation ....
Alvin R. Lebeck and David A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994.
....space, by subdividing the program according to source level application data structures. Because of the inherent link between memory performance and the access patterns of particular data structures, these statistics can be crucial to reasoning about memory behavior. Echoing this message, CPIOF [9], developed independently, also implemented data oriented statistics. Data oriented statistics are especially useful in cases in which a particular data structure may constitute a memory bottleneck, but accesses to it are distributed across several procedures. For example, in Pthor (a SPLASH ....
A. R. Lebeck and D. A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. Technical report, Univ. of Wisconsin Computer Sciences Department, Mar. 1992.
....by cache coherence operations) The main data display shows the 2 D matrix of bins sorted by code and data units so the most expensive cell appears in the top left corner. Each bin can be examined in more detail. Data at a finer granularity on a per source reference basis is not available. CPROF [10] is a similar simulator, but is based on instrumentation of binary code. In addition, it refines the interference miss category into conflict and capacity misses, thus helping to distinguish cases where data re alignment can help from those where there is just too much data for the cache. The ....
A. Lebeck and D. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, October 1994.
....accesses of a program in execution. It is also influenced by the cache hardware parameters, such as the cache size, cache line size, set associativity and replacement policy. There are many tools to measure the cache miss ratio, either through cache simulation [10] such as Dinero [6] Cprof [8] or sampling from hardware counters like VTune [7, 1] There are also compiler techniques to estimate the cache miss ratio analytically [4] In order minimize cache misses, transformations to improve the data locality [9, 12] such as loop tiling, loop fusion, array padding and array alignment, ....
A. Lebeck and D. Wood. Cache profiling and the spec benchmarks - a case-study. COMPUTER, 27(10):15--, Oct. 1994.
....gaps filled with useless data that are characteristic for padding are no longer required. Kandemir et.al. KCR 98] target exclusively at spatial locality. Merging is a simple but effective way to prevent the arrays from interfering with each other, usually performed by the programmer [HP96, LW94] The method we present enables the compiler to intermix arrays in a systematic way that goes beyond merging. We have already employed the meeting graph in previous work [Gen98] where we proposed a general way of deriving data layouts from array indexes. In this approach, coloring was performed ....
Alvin R. Lebeck and David A. Wood. Cache profiling and the SPEC benchmarks: A case study. Computer, 27(10):15--26, October 1994.
....However these techniques are very slow (usually several orders of magnitude) For instance, the slowdown exhibited by all simulators surveyed in [22] is in the range of 45 6250. There are some innovative methods that have been proposed with the objective of reducing the exhibited slowdown [14] [12], 24] However, these methods provide little information (usually only miss ratio) that is, they trade information for speed. There are other tools based on hardware counters (e.g. 1] provided by some microprocessors. These tools are fast and accurate. However they have no flexibility since ....
A.R. Lebeck and D.A. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, 27(10):15--26, Oct. 1994.
....for the SPEC 92 benchmarks [4] They concluded that SPEC benchmarks may not represent actual performance of a time shared, multi programming system with operating system interference. This is due to each SPEC benchmark running as the single active user process until completion. Lebeck and Wood [6] used their CPROF cache profiling tool to analyze the cache bottlenecks on the SPEC 92 benchmark suite. CPROF provides cache hot spot information at the source line and data structure level. This information is then used by the programmer to modify the code to improve the program s locality. ....
A. Lebeck and D. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, October 1994.
....runs about 35 faster than the standard algorithm. The plots in Figure 5(b) show an increase in running time of the cache efficient FMV as the subgrid size increases, which is explained by TLB misses. All memory hierarchy simulations were performed using Lebeck s fast cache and cprof simulators [8], for NITER = 4. Figure 6(a) shows the plot of TLB misses, which correlates with the degradation in running times for large subgrid sizes. The reason for the increase in the TLB misses is as follows. Since the size of the solution array is large, each row gets mapped to one or more virtual memory ....
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, Oct. 1994.
....directmapped L3 caches can be ineffective. Many scientific programs, large integer programs like VLSI CAD applications and AI applications reference more data than can fit in cache, and incur a large number of capacity misses[16] While sophisticated compilers [21, 22, 29] or careful programming [23], in some cases, can help reduce capacity misses, these techniques can put a large burden on users and have not yet entered mainstream computing. Current computer architectures use caches that are close to the processor, making it possible to build caches with short access time. However, this ....
Alvin R. Lebeck and David A. Wood, "Cache Profiling and the SPEC Benchmarks: A Case Study", IEEE Computer, Vol. 27, No. 10, pp. 15-26, October, 1994.
....a single cache line whenever possible. While this can reduce cache traffic and decrease the potential for conflicts for small data sets, the impact is not significant for large data sets, such as arrays in scientific loop nests, because the cache line size is relatively small. Lebeck and Wood [18] present a case study of improving performance with a variety of techniques including data transformations such as padding and memory alignment. However, these transformations are discussed in the context of programmer tuning of application performance. There is no discussion of how to incorporate ....
....data transformations. For example, if hA and hB are identical except for a permutation of rows, the array dimensions corresponding to those rows in one of the arrays may be permuted to obtain compatibility. If hA and hB differ in the stride for one dimension, array compression or expansion [18] along that dimension can be applied. If hA and hB differ in the sign in one dimension, the storage order in that dimension can be reversed for one of the arrays. In conjunction with code transformations (e.g. loop permutation) such data transformations improve the utilization of cache lines, ....
A. Lebeck and D. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994. 35
....an accessed data item will be accessed again in the near future. A program exhibits spatial locality if there is good chance that subsequently accessed data items are located near each other in memory. Most programs tend to exhibit both kinds of locality and typical hit ratios are greater than 90 [18]. Our design techniques will attempt to improve both the temporal and spatial locality of the sorting algorithms. 3 Design and Evaluation Methodology. Cache locality is a good thing. When spatial and temporal locality can be improved at no cost it should always be done. In this paper, however, ....
....is an added advantage that the number of instructions executed for both add and remove min is also reduced. The second optimization is to align the heap array in memory so that all d children lie on the same cache block. This optimization reduces what Lebeck and Wood refer to as alignment misses [18]. The algorithm dynamically chooses between the repeated adds method and Floyd s method for building a heap. If the heap is larger than the cache and repeated adds can offer a reduction in cache misses, it is chosen over Floyd s method. We call this algorithm memory tuned heapsort. 4.2 ....
A. Lebeck and D. Wood. Cache profiling and the spec benchmarks: a case study. Computer, 27(10):15--26, Oct 1994.
....to store a graph. Thus they attempt to normalize for machine effects by using running times relative to the time needed to scan the adjacency structure. More recently several researchers have focused on designing algorithms to improve cache performance by improving the locality of the algorithms [20, 17, 18]. Lebeck and Wood focused on recoding the SPEC benchmarks and also developed a cache profiler to help in the design of faster algorithms. LaMarca and Ladner came up with improved heap and later sorting algorithms by improving locality. They also developed a new methodology for analyzing cache ....
A. Lebeck and D. Wood. Cache profiling and the spec benchmarks: a case study. Computer, 27(10):15-26, 1994.
....than 50 is consistent with the occurrence of cache breaks, i.e. when the CPU does not find the data it is seeking in cache, it must stop processing until it has retrieved the data from main memory and loaded it into 4 cache. This retrieval and cache loading typically costs tens of CPU cycles (Lebeck and Wood 1994). Such cache breaks occur once the memory requirement for N sample paths exceeds the data cache size; under these conditions the CPU must cycle all of the data representing the complete set of sample paths through the data cache with each event. The subsequent gradual reduction, as N is increased ....
Lebeck, A. R. and D. A. Wood. 1994. "Cache Profiling and the SPEC Benchmarks: A Case Study." IEEE Computer Magazine Vol. 27, No. 10 (October).
....among elements of different tiles (of the same array) or different arrays. Conflict misses can seriously degrade program performance; their reduction is generally addressed during the selection of tile sizes. Data alignment techniques for reducing cache misses were reported by Lebeck and Wood [3] in a study of the cache performance of the SPEC92 benchmark suite, where they observed significant speedups (up to 3.4X) even on code that was previously tuned using execution time profilers. Padding is a data alignment technique that involves the insertion of dummy elements in a data structure ....
A.R. Lebeck and D.A. Wood, Cache Profiling and the Spec Benchmarks: A Case Study, Computer, vol. 27, no. 10 Oct. 1994.
....hardware are at a high and continuing pace. Main memories of 128 MB are now affordable and custom CPUs currently can perform over 50 MIPS. They rely on efficient use of registers and cache to tackle the disparity between processor and main memory cycle time, which increase every year with 40 [14]. These hardware trends pose new rules to computer software and to database systems as to what algorithms are efficient. Another trend has been the evolution of operating system functionality towards micro kernels, i.e. those that make part of the Operating System functionality accessible ....
....(DSM) 5] facilitates object evolution, and saves IO on queries that do not use all the relation s attributes, while the extra cost for re assembling of complex objects before they are given to an application is neglectable in a main memory setting. ffl use lean bulk operations. Studies like [14, 15] show that cache profiling can speed up a program by a factor of two. Deeply nested function 1 This functionality is achieved with the mmap( madvise( and mlock( Unix system calls. nil nil nil nil Person sex previous transaction BUN Heap String Heap male female 101 169 147 ....
A.R. Lebeck and D.A. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, 27(10):15--26, October 1994.
....promoting efficient caching. The matrix multiplication routine in the BLAS package [DDDH 90] uses blocking to promote the use of caches; it is basically designed to handle non sparse matrices. The need to optimize the use of the cache and general guidelines for doing so have been described in [LeWo 94] but to date these techniques have not been used in published literature to speed up the multiplication of sparse matrices and design multi threaded applications for the same purpose. 2. Caching Issues in Matrix Multiplication Consider the normal 3 nested loop (L1) for multiplying two ....
A.R. Lebeck, D.A. Wood. "Cache Profiling and the SPEC Benchmarks: A Case Study". IEEE Computer, Oct. 1994, 15--26
....than i. 4 Validation and Application The code shown in Figure 1 was rewritten replacing the references to memory by functions that calculate the position to be accessed and write it to a trace file. Later this trace file was fed to the dineroIII cache simulator, integrated into the WARTS toolset [14]. Table 2 displays the prediction deviation Delta for several combinations of the input parameters obtained from the execution on synthetic matrices. For each combination several simulations were made changing the data structures starting addresses. In the table oe is the average deviation of the ....
A.R. Lebeck and D.A. Wood, Cache Profiling and the SPEC Benchmarks: A Case Study, IEEE Computer 27 (1994) 15--26.
....of 2 MB. The system has 512 MB of RAM. The VM page size is 8 KB, and the data TLB is fully associative with 64 entries. The system runs SunOS 5.6, and we used SUN s Workshop Compilers 4.2. In addition to timing runs, we also performed cache simulations using the FAST CACHE and CPROF tools [23, 31]. Figure 6 shows the running times of the various algorithms for a number of different problem sizes 8 IV 000000000000000111111111111111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 II I III t R t C t R t C n m ....
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE COMPUTER, 27(10):15--26, October 1994.
....of code that exhibit poor cache behavior, to provide insight to programmers for restructuring code to improve memory performance. Given certain high level cache circuit characteristics, these same tools can also estimate the energy dissipation of different levels of the memory hierarchy. Cprof [26] provides cache miss rate information at a fine grain source code level, and further classifies misses as compulsory, capacity, and conflict types. More recent tools, such as the profiling tool developed by Ammons et al. 27] and Compaq s ProfileMe [28] 16 Selective Cache Ways: On Demand Cache ....
A. Lebeck and D. Wood, "Cache profiling and the SPEC benchmarks: A case study," IEEE Computer, vol. 27, pp. 15--26, October 1994.
....hardware are at a high and continuing pace. Main memories of 128 MB are now affordable and custom CPUs currently can perform over 50 MIPS. They rely on efficient use of registers and cache to tackle the disparity between processor and main memory cycle time, which increase every year with 40 [14]. These hardware trends pose new rules to computer software and to database systems as to what algorithms are efficient. Another trend has been the evolution of operating system functionality towards microkernels, i.e. those that make part of the Operating System functionality accessible to ....
....Person birthday Person sex string date string OID OID OID class Person string . name; sex; string date BAT memory layout DSM decomposition Tables (BATs) into Binary Association birthday; Figure 1: Monet s decomposed storage scheme ffl use lean bulk operations. Studies like [14, 15] show that cache profiling can speed up a program by a factor of two. Deeply nested function calls cause CPUs to run out of register spaces, leading to similar performance penalties. Tuple oriented database algorithms, as proposed in Volcano, perform a sequence of operations for every tuple in a ....
A.R. Lebeck and D.A. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, 27(10):15--26, October 1994.
....they are reused. Finally, note that these examples point to the usefulness of stream buffers and victim caches [Jou90] methods for writing around caches for references that are not reused [Jou93] and techniques such as data padding by the compiler to achieve the minimal number of conflict misses [Leb94]. This analysis also shows how cache conscious data placement [Cal98] can improve cache performance, and how our framework might work in conjunction with such techniques. P 0 (T) l s , l s , l s , f.1.1, l s , l s , t.1.1, l s , l s 8, l s 8 P 1 (T) l s , l s , l s , l s , l s , l s , ....
....random walks and uses the model to predict the behavior of the miss ratio curve for fully associative caches of varying sizes. Thiebaut and Stone [Thi87] develop an analytic model for cache reload transients footprints in the cache to describe the effects of context switches. Lebeck and Wood [Leb94] describe a cache profiling system and show how it can guide code modifications that reduce cache misses. McKinley and Temam [McK96] take a step towards more detailed analysis by quantifying the locality characteristics of numerical loop nests. Their locality measurements reveal important ....
A.R. Lebeck and D.A. Wood, "Cache Profiling and the SPEC Benchmarks: A Case Study", IEEE Computer, Oct. 1994.
....than the standard algorithm. 4.2 Memory behavior The plots in Figure 5(b) show an increase in running time of the cache efficient FMV as the subgrid size increases, which is explained by TLB misses. All memory hierarchy simulations were performed using Lebeck s fast cache and cprof simulators [11], for NITER = 4. Figure 5(c) shows the plot of TLB misses, which correlates with the degradation in running times for large subgrid sizes. The reason for the increase in the TLB misses is as follows. Since the size of the solution array is large, each row gets mapped to one or more virtual memory ....
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, Oct. 1994.
....1996; Lam et al. 1991; McKinley and Temam 1996; Temam et al. 1994] thereby precluding e#ective cache utilization. Conflict misses can be particularly significant in caches with low associativity. In such situations programmers often rely on time consuming cache profiling and performance tuning [Lebeck and Wood 1994; Martonosi et al. 1992] There has also been compiler work in tailoring code to reduce conflict misses [Bacon et al. 1994; Coleman and McKinley 1995; Lam et al. 1991] Unfortunately, conflict misses are highly sensitive to slight variations in problem size and base addresses [Bacon et al. 1994; ....
Lebeck, A. R. and Wood, D. A. 1994. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer , 15--26.
....is affected by the placement of its data structures in memory. Changing this layout can reduce conflict misses and also increase implicit prefetching by improving spatial locality. Consequently, the application performance can be improved. Existing tools that target memory performance [GH93, LW94, MGA95, SB94] present only descriptive information, such as cache statistics, to the user. In contrast, a prescriptive tool can treat this as an optimization problem where the goal is to optimally arrange the program data structures, and directly specify the best layout to use. The first step is ....
....spent when the memory subsystem imposes no delays is reported as the memory overhead of that section. Mtool s main drawback is that it provides only descriptive feedback. In addition, the level of detail at which information is provided does not easily explain the performance problems. Cprof [LW94] is a cache profiling system for sequential programs that presents statistics in terms of both code and data structures. It categorizes cache misses obtained from a simulation of the program into compulsory, capacity or conflict misses and presents them the programmer. MemSpy [MGA95] also ....
[Article contains additional citation context not shown here]
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: a case study. IEEE Computer, 27(10), October 1994.
....with R10000 processors at 180MHz are shown in Table 4. We see there is a difference from two to five orders of magnitude, even when we have used a very simplified simulator locally developed. The validity of our simple simulator has been checked using dineroIII, belonging to the WARTS toolset [12]. We are currently working in the integration of our technique in code analysis environments that support it and allow its effective and fast application to real programs. An analyzer based on Polaris [1] that requires no user intervention is being built. It has already been validated with simple ....
A. Lebeck and D. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, Oct. 1994.
.... they are positioned in the subroutine call hierarchy, and whether iteration is expressed with loops at all; how data is communicated between subroutines; and, from the point of view of implementing codes on distributed memory architectures, how interprocessor communication is handled (Table 1) [20, 25, 28]. As mentioned previously, the ability to exploit a range of computer architectures with a single source code provides obvious benefits in reducing software costs for an institution charged with developing and maintaining a large code. Common approaches are (1) to maintain separate sources, 2) ....
A. R. Lebeck and D. A. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, 27(10):15--27, October 1994.
....of 2 MB. The system has 512 MB of RAM. The VM page size is 8 KB, and the data TLB is fully associative with 64 entries. The system runs SunOS 5.6, and we used SUN s Workshop Compilers 4.2. In addition to timing runs, we also performed cache simulations using the FAST CACHE and CPROF tools [23, 31]. Figure 6 shows the running times of the various algorithms for a number of different problem sizes 8 IV 000000000000000111111111111111 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 II I III t R t C t R t C n ....
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE COMPUTER, 27(10):15--26, October 1994.
....6. C program fragment can be done at the source code level or machine code level. For our framework we need a source code level instrumentation. In the past a variety of different cache profilers were introduced, e.g. MTOOL (Goldberg and Hennessy, 1991) PFC Sim (Callahan et al. 1990) CPROF (Lebeck and Wood, 1994). The novelty of our approach is to compute the trace data symbolically at compile time without executing the program. A symbolic tracefile is a constructive description for all possible memory references in chronological order. It is represented as symbolic expressions and recurrences. In the ....
....chosen a set of C programs as a benchmark. We have adopted the symbolic evaluation framework introduced in (Fahringer and Scholz, 1997; Fahringer and Scholz, 1999) for the programming language C and the cache evaluation. The instrumentation was done by hand although an existing tool such as CPROF (Lebeck and Wood, 1994) could have instrumented the benchmark suite. Our symbolic evaluation framework computed the symbolic tracefiles and symbolically evaluated data caches. In order to compare predictions against real values we klufront.tex; 27 09 1999; 12:56; p.29 30 J. Blieberger, T. Fahringer, B. Scholz int n; ....
Lebeck, A. and D. Wood: 1994, `Cache Profiling and the SPEC Benchmarks: A Case Study'. IEEE Computer 27(10).
....principle, model any computer, gather any statistic, and run any program that the target architecture would run, including the operating system. They easily serve as back ends to traditional debuggers as well as architecture design tools such as cache simulators (Bedichek 1990, Darcy et al. 1992, Lebeck and Wood 1994). Naturally, this flexibility comes at a cost instruction set simulators are often slow, easily over 3 orders of magnitude slower than native execution. Such poor performance severely hampers their practicality, limiting them to toy benchmarks or very patient users. This has prompted several ....
Lebeck, A. R., and D. A. Wood. 1994. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15-26.
.... [16] Memory dependence profiling has been used to aid ILP enhancing optimizations by allowing the compiler to reorder ambiguous memory references [2] 4 Profiling has also been used to identify procedures, basic blocks or source lines with high memory overheads or cache misses [17] 18] 19] [20]. Profiling information specifying the number of cache misses incurred by each access has been proposed to guide the compiler to selectively prefetch data [21] 22] and has been recently used to hand tune code [20] 1.2 The IMPACT Compiler The tool described in this thesis is implemented within ....
.... blocks or source lines with high memory overheads or cache misses [17] 18] 19] 20] Profiling information specifying the number of cache misses incurred by each access has been proposed to guide the compiler to selectively prefetch data [21] 22] and has been recently used to hand tune code [20]. 1.2 The IMPACT Compiler The tool described in this thesis is implemented within the IMPACT compiler [23] There are three levels of intermediate representation (IR) within IMPACT, which divide the compiler into distinct parts. Pcode is the highest level IR, and is based on a parallel C code ....
A. R. Lebeck and D. A. Wood, "Cache profiling and the spec benchmarks: A case study," IEEE Computer, pp. 15--26, October 1994.
....possible annotations: Retain, Release or WordMode. The first two annotations allow the application to exercise some control over the cache replacement policy. The last annotation allows the application to exercise some control over the blocksize (and associativity) of the cache. We use the CPROF [13] profiling tool to identify suitable annotations. Using our anno 3 tations, we see between 13 and 20 speedups for three media benchmarks (epic, pegwit, ijpeg) on a 4 issue dynamically scheduled processor. The remainder of this paper is organized as follows. Section 2 provides background ....
Alvin R. Lebeck and David A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE COMPUTER, 27(10):15--26, October 1994.
No context found.
A. R. Lebeck and D. A. Wood, "Cache Profiling and SPEC Benchmarks: A Case Study, " IEEE Computer, pp. 15--26, Oct. 1994.
No context found.
A. Lebeck and D. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994.
No context found.
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994. 6.1, 6.1, 6.4
No context found.
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994.
No context found.
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, pages 15--26, October 1994.
No context found.
A. R. Lebeck and D. A. Wood. Cache profiling and the spec benchmarks: A case study. IEEE Computer, 27(10):15--26, Oct 1994.
No context found.
A. Lebeck, D. Wood, Cache profiling and the SPEC benchmarks: A case study, IEEE Computer 27 (10) (1994) 15--26.
No context found.
Alvin R. Lebeck and David A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer, vol. 27, no. 10,pp. 15-26, October 1994.
No context found.
A.R. Lebeck and D.A. Wood, "Cache Profiling and the SPEC Benchmarks: A Case Study," IEEE Computer, vol. 27, no. 10, pp. 15-26, Oct. 1994.
No context found.
A.R. LEBECK and D.A. WOOD. Cache profiling and the {SPEC} benchmarks: {A} case study. Computer, 27(10):15--26, Oct. 1994.
No context found.
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994. 6.1, 6.1, 6.4
No context found.
A. R. Lebeck and D. A. Wood. Cache profiling and the spec benchmarks: A case study. IEEE COMPUTER, 27(10):15--26, Oct. 1994.
No context found.
LEB94 A. Lebeck, D. Wood. "Cache profiling and the SPEC benchmarks: A case study", IEEE Computer 27:10, October 1994.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC