| A. Diwan, D. Tarditi, and E. Moss. Memory subsystem performance of programs with intensive heap allocation. ACM Transactions on Computer Systems, 1995. |
....line when a store instruction references a location not currently residing in the cache. This organization is used in current workstations (e.g. the DECstation 5000 TM series) and has been shown to be effective for programs with intensive heap allocation [Koopman et al. 1992] Reinhold 1993] [Diwan et al. 1995]. We do not use the original SPARCstation 2 cache configuration because it suffers from large variations in cache miss ratios caused by small differences in code and data positioning (we have observed variations of up to 15 of total execution time) With the changed cache configuration, these ....
DIWAN, A., TARDITI, D., AND MOSS, E. B., 1995. Memory Subsystem Performance of Programs with Intensive Heap Allocation. ACM Trans. on Computer Systems 13(3), 244-273, August 1995.
....and subblock placement. The write allocate policy has been shown to be beneficial for programs with intensive heap allocation; we will discuss it in more detail below. We assume that there is no cost associated with write misses, i.e. that write buffers can absorb almost all writes (see [DTM94, Rei93, Rei94] for justifications of this assumption) Even though many SELF programs allocate objects at a rate of about 1 Mbyte s, the data cache performance is good. For example, the median miss ratio with a 64K cache is 2.4 , which would lead to an overhead of 6 on a SPARCstation 2 like memory system ....
....For example, the median miss ratio with a 64K cache is 2.4 , which would lead to an overhead of 6 on a SPARCstation 2 like memory system (cache miss time = 26 cycles) Even with an 8K data cache, the median overhead would still be less than 17 . This data is consistent with that of Diwan et al. [DTM94] who have measured allocationintensive ML programs and found very low data cache overheads for the same cache organization (write allocate, subblock placement) Similar results have also been reported by Reinhold for Scheme programs [Rei93] by Jouppi for the SPEC benchmark suite [Jou93] and by ....
[Article contains additional citation context not shown here]
Amer Diwan, David Tarditi, and Eliot Moss. Memory Subsystem Performance of Programs with Intensive Heap Allocation. In 21st Annual ACM Symposium on Principles of Programming Languages, p. 1-14, January 1994.
....does not seek to implement incremental garbage collection and does not require tag bits in memory. Their scheme, unlike ours, does not attack the problem of processor memory traffic; it does not attempt to reduce reads or writes to main memory by the application program. Diwan, Tarditi, and Moss [DTM 93] have studied the interaction of heap allocation, caching, and copying garbage collection. They conclude that most current machines support heap allocation poorly, although with large caches, good performance can be achieved by tuning. An earlier study of related issues was performed by Wilson et ....
Amer Diwan, David Tarditi, and Eliot Moss, "Memory subsystem performance of programs with intensive heap allocation," submitted to ACM Transactions on Programming Languages and Systems, August 1993.
....and subblock placement. The write allocate policy has been shown to be beneficial for programs with intensive heap allocation; we will discuss it in more detail below. We assume that there is no cost associated with write misses, i.e. that write buffers can absorb almost all writes (see [DTM94], Rei93] or [Rei94] for justifications of this assumption) Even though many SELF programs allocate objects at a rate of about 1 Mbyte s, the data cache performance is good. For example, the median miss ratio with a 64K cache is 2.4 , which would lead to an overhead of 6 on a ....
....For example, the median miss ratio with a 64K cache is 2.4 , which would lead to an overhead of 6 on a SPARCstation 2 like memory system (cache miss time = 26 cycles) Even with an 8K data cache, the median overhead would still be less than 17 . This data is consistent with that of Diwan et al. [DTM94] who have measured allocation intensive ML programs and found very low data cache overheads for the same cache organization (writeallocate, subblock placement) Similar results have also been reported by Reinhold for Scheme programs [Rei93] by Jouppi for the SPEC benchmark suite [Jou93] and by ....
[Article contains additional citation context not shown here]
Amer Diwan, David Tarditi, and Eliot Moss. Memory Subsystem Performance of Programs with Intensive Heap Allocation. In 21st Annual ACM Symposium on Principles of Programming Languages, p. 1-14, January 1994.
....cache design may do quite well in a memory hierarchy well suited to its behavior. The literature on garbage collection is considerably more sophisticated in terms of locality studies than the literature on memory allocation, and should not be overlooked. See, e.g. Bae73, KLS92, Wil90, WLM92, DTM93, Rei94, GA95, Wil95] Many of the same issues must arise in conventionally managed heaps as well. 26 larger free blocks (coalescing) Equally important are the policy and strategy implications i.e. whether the allocator properly exploits the regularities in real request streams. In this ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Submitted for publication, August 1993.
....smaller environment instead. Implemented straightforwardly, a BCS program allocates all data on the heap. Research in functional languages has shown that this is not necessarily worse than stack allocation given generational copying garbage collection and fast handling of write misses in the cache [10, 3]. Bevemyr and Lindgren have previously shown how to adapt generational copying garbage collection to a standard WAM [4] we expect to reuse that method. Shao and Appel have shown how to optimize continuation representations for a heap based implementation of SML [18, 2] If their results can be ....
A. Diwan, D. Tarditi, E. Moss, Memory Subsystem Performance of Programs with Intensive Heap Allocation, Technical report CMU-CS-93-227, Carnegie-Mellon University, 1993.
....smaller environment instead. Implemented straightforwardly, a BCS program allocates all data on the heap. Research in functional languages has shown that this is not necessarily worse than stack allocation given generational copying garbage collection and fast handling of write misses in the cache [53, 7]. Bevemyr and Lindgren have previously shown how to adapt generational copying garbage collection to a standard WAM [21] we expect to reuse that method. Shao and Appel have shown how to optimize continuation representations for a heap based implementation of SML [125, 6] If their results can be ....
A. Diwan, D. Tarditi, E. Moss, Memory Subsystem Performance of Programs with Intensive Heap Allocation, Technical report CMU-CS93 -227, Carnegie-Mellon University, 1993. f37g
....that some conventional cache designs can achieve this effect. 39 Tarditi and Diwan show that the same effect can be achieved in a more conventional language implementation using generational garbage collection, and demonstrate the value of a cache to memory interface supporting high write rates [DTM93] The difference in locality between moving and nonmoving collectors does not appear to be large at the scale of high speed cache memories the type of collector is not as important as rate of allocation and the size of the youngest generation, i.e. how quickly memory is used, reclaimed and ....
....memory. WLM92] also shows that the allocation of variable binding environments and activation records on the heap can greatly exacerbate cache level locality problems due to a youngest generation that won t fit in the cache. This is borne out by simulation studies of Standard ML of New Jersey [DTM93] on high performance processors. It suggests that activation information and binding environments should be allocated on a stack using compile time analysis [KKR 86] or in a software stack cache [CHO88, WM89, Kel93] Software stack caches can be used in languages like ML and Scheme, where ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Submitted for publication, August 1993.
....with a fixed nursery size of 10 6 bytes (figure 7b) The cache organization is the one found on the DECStation 5000 200. In figure 7a, we note that varying the nursery size has little effect on the cycles per instruction measure, for this organization. This is to be expected, as described in [3]. The actual performance, however, is significantly affected by the choice of nursery size see figure 8. In figure 7b, most of the effect is due to the instruction cache; generally, SML programs instruction working sets take a few hundreds of kilobytes. Collection cost model We used the ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Submitted for publication, July 1993.
....collector at all. Without such control experiments, it is impossible to argue that a sophisticated collection strategy is desirable, much less necessary. Diwan, Tarditi, and Moss recently published results supporting the proposition that garbage collected languages can have good cache performance [8, 9]. Using a temporal metric, rather than miss ratios, they studied the cache performance of eight fairly substantial ML programs running in Standard ML of New Jersey [3, 24] While they used a more accurate cache and memory system simulator than that employed here, they only considered ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory Subsystem Performance of Programs with Intensive Heap Allocation. Technical Report 93-227, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, December 1993.
....with subblock placement allocates a cache line when a store instruction references a location not currently residing in the cache. This organization is common in current workstations (e.g. the DECstation 5000 200) and has been shown to be effective for programs with intensive heap allocation [48]. # The SPARCstation 2 s write buffer organization is somewhat unfortunate and incurs high write costs compared to other common designs. To avoid skewing our data, we chose not to model the SS 2 write buffer and instead assume a perfect write buffer. Previous work (e.g. 14, 81] has shown that ....
....of 1.5. Even with an 8K data cache, the overhead would still be very modest, with a median of 12 . This is quite surprising since most of the programs allocate several MBytes of data. However, our data is consistent with that of Diwan et al. who have measured allocation intensive ML programs [48] and found very low data cache overheads for the same cache organization (write allocate, subblock placement) Our data also confirms that of Reinhold [108] who investigated the cache performance of large Lisp programs. Diwan et al. also measured that the data cache overhead of their ML programs ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory Subsystem Performance of Programs with Intensive Heap Allocation. In 21st Annual ACM Symposium on Principles of Programming Languages, p. 1-14, January 1994.
....prior work in understanding the cache behavior of programs, we are not aware of any study that correlates cache behavior to high level properties such as types. Some prior work tries to understand and improve the cache behavior of heap loads by measuring the cache impact of garbage collection [12, 18, 23, 30, 32, 33]. Mowry and Luk [22] also attempt to improve the e#ectiveness of latency tolerance techniques by applying them only to cache misses. They identify instructions that are likely to miss in the cache using correlation profiling, which, for instance, predicts whether a load will hit or miss in the ....
A. Diwan, D. Tarditi, and E. Moss. Memory subsystem performance of programs with intensive heap allocation. ACM Transactions on Computer Systems, 1995.
....problem with using the SML NJ compiler is that it does not use a stack. It uses heap only allocation: all allocation is done on the heap. In particular, all activation records are allocated on the heap rather than on a call stack. This leads to poor memory system performance on many machines [15, 16]. This is not a problem for demonstrating the thesis, though, since the end result will be to understate the relative performance benefits of better global optimization. The optimizations I am proposing will reduce mostly instruction counts, and not eliminate large numbers of function calls. ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Technical Report CMU-CS-93-227, School of Computer Science, Carnegie Mellon University, December 1993.
....the write barrier. The memory simulator modelled the entire memory system of the DECStation 5000 200 [12] which is favorable to programs which heap allocate intensively. A less favorable memory system organization would increase the cost of storage management by increasing the cost of allocation [14, 15]. The remainder of the paper is organized as follows. Section 2 introduces terminology and describes the storagemanagement strategy used by the SML NJ compiler. Section 3 describes the measurement techniques and benchmark programs. Section 4 presents measurements for eight SML NJ programs. Section ....
....the SML code as well as the garbage collector, which is written in C. We extended QPT in two ways. First, we modified QPT and the SML NJ system to produce traces for SML NJ programs. Second, we added an event tracing facility to QPT. The changes to QPT and the SML NJ system are described elsewhere [14, 15]. One important change we made to the SML NJ system was to place code outside the heap so that it was not moved by garbage collection. In the original system, code was placed in the heap and it was moved by garbage collection. Allowing code to be moved makes tracing programs extremely difficult. ....
[Article contains additional citation context not shown here]
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Submitted for publication, November 1993.
....is enough physical memory to avoid paging entirely, the pauses caused by major collections are intolerably long [9, 20] Cache performance of SML NJ was also suspected to be bad. It was speculated that 40 of execution time was spent waiting for main memory access, a 66 overhead. Recent work [11] has shown that this is not altogether true, at least for some current architectures and memory subsystem organisations. In particular, the cache performance on the DECStation 5000 200 is reasonably good across all benchmarks reported, at a 17 overhead. Although object behaviour has a bearing on ....
....we compute its derivatives, and then, inverting the logarithmic scale, compute the nursery mortality, object survival and object mortality rates. 2.6 Benchmark programs The benchmark suite we used draws upon Appel s collection, and adds some further scientific programs. Table 1 (adapted from [2, 11]) summarises the individual benchmarks. Due to space constraints, results will be presented and discussed for a subset of the benchmarks consisting of Leroy, Yacc, and ML . 3 Results We shall look at the results obtained from our main two experiments, first the promotion analysis for young ....
Amer Diwan, David Tarditi, and J. Eliot B. Moss. Memory subsystem performance of programs with intensive heap allocation. Submitted for Publication, October 1993.
....write barrier instructions, tag manipulation, etc. The traces were all for the MIPS R3000 architecture and the memory subsystem organization used in the simulations was similar to that of DECStation 5000 200 (which has been previously demonstrated to be favorable for allocation intensive programs [6]) All programs were written in Standard ML [7] a mostly functional programming language, and were compiled using SML NJ. The DECStation 5000 200 has a 64K instruction and a 64K data cache. Both the instruction and data caches are direct mapped. The block size of the cache is 1 word but on a read ....
....on the heap rather than on a call stack. Thus, the performance of SML NJ programs relies heavily on the efficiency of the garbage collector. 2 Actually partial word writes are treated differently, but since there are so few in our programs, we ignore them here without any loss of accuracy [6]. 3 The code size includes 207K for the standard libraries. Program Description CW The Concurrency Workbench [4] is a tool for analyzing networks of finite state processes expressed in Milner s Calculus of Communicating Systems. Lexgen A lexical analyzer generator [2] processing the lexical ....
[Article contains additional citation context not shown here]
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Submitted for publication, July 1993.
....the write barrier. The memory simulator modeled the entire memory system of the DECStation 5000 200 [12] which is favorable to programs which heap allocate intensively. A less favorable memory system organization would increase the cost of storage management by increasing the cost of allocation [14, 15]. The remainder of the paper is organized as follows. Section 2 introduces terminologyand describes the storagemanagement strategy used by the SML NJ compiler. Section 3 describes the measurement techniques and benchmark programs. Section 4 presents measurements for eight SML NJ programs. Section ....
....the SML code as well as the garbage collector, which is written in C. We extended QPT in two ways. First, we modified QPT and the SML NJ system to produce traces for SML NJ programs. Second, we added an event tracing facility to QPT. The changes to QPT and the SML NJ system are described elsewhere [14, 15]. One important change we made to the SML NJ system was to place code outside the heap so that it was not moved by garbage collection. In the original system, code was placed in the heap and it was moved by garbage collection. Allowing code to be moved makes tracing programs extremely difficult. ....
[Article contains additional citation context not shown here]
DIWAN, A., TARDITI, D., AND MOSS, E. Memory subsystem performance of programs with intensive heap allocation. Tech. Rep. CMU-CS-93227, School of Computer Science, Carnegie Mellon University, Dec. 1993. Submitted for publication.
....with subblock placement, the memory subsystem overhead was under 17 for 64K or bigger caches; for caches without subblock placement, the overhead was often as high as 100 . 19 Subsequent work has shown that 512K is large enough to hold the allocation area of most of the benchmark programs [17]. I and D cache sizes Cycles Useful instruction 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 8K 16K 32K 64K 128K write no alloc, no subblk,assoc=1 write alloc, subblk,assoc=1 write alloc, no subblk,assoc=1 write no alloc,no subblk,assoc=2 write alloc,subblk,assoc=2 write alloc,no subblk,assoc=2 ....
Amer Diwan, David Tarditi, and Eliot Moss. Memory subsystem performance of programs with intensive heap allocation. Work in progress, oct 1993.
No context found.
Diwan, A., Tarditi, D. and Moss, E. Memory Subsystem Performance of Programs with Intensive Heap Allocation. Carnegie Mellon University Technical Report CS TR 93-227. 1993.
No context found.
Amer Diwan, David Tarditi, and J. Eliot B. Moss. Memory subsystemperformance of programs with intensive heap allocation. Submitted for Publication, October 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC