| M. Swanson, L. Stroller, and J. B. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture (ISCA'98), June 1998. |
....hardware resource benefits flatten out once typical working sets can be fully mapped in the TLB. tries to dynamically predict the likely benefits of promoting to a superpage against the cost of doing the promotion to determine when and where to use superpages. A technique called shadow memory [12] has been put forward to mitigate the superpage requirement of large contiguous physical regions of memory which are difficult to manage and have potential adverse swapping effects. The last two superpage designs work on average but can encounter situations where performance is worse than without ....
Mark Swanson, Leigh Stoller, and John Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th International Symposium on Computer Architecture (ISCA), pages 204--213. ACM, 1998. 11
....of misses. Bala et al. 3] focus in specifically on interprocess communication activities, and illustrate software techniques for lowering miss penalties on software managed TLBs. Superpaging is another well investigated technique to boost the coverage of the TLB and better utilize its capacity [31, 32, 30, 12]. Studies have looked at hardware and operating system support for providing superpage translations in the TLB. Recent work in this area [12] is investigating memory controller support for remapping pages so that there is more scope for creating superpage entries (without incurring the overheads ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using Superpages backed by Shadow Memory. In Computer Architecture, pages 204--213, 1998.
....TLB supports 4K, 16K, 64K, 256K, 1M, 4M and 16M page sizes per TLB entry, which means the TLB reach ranges from 512K if all entries use 4K pages to 2G if all entries use 16M pages. Unfortunately, while many processors support multiple page sizes, few operating systems make full use of this feature [21]. Current operating systems usually use a fixed page size. Since a TLB miss can take from tens to hundreds of cycles, the use of large stride data accesses must be refrained. For example, a stride of the same size as a page can generate a TLB miss every access. In order to minimize cache misses, ....
....Moreover, the derived data transformations only consider the innermost loop level and only a single data layout is associated with each array. Such pre determined data layouts favor one axis of the index space over the others and adjacent data in the unfavored directions become distant in memory [5, 21]. Hence, such layouts can yield large strides generating TLB misses when control returns to an enclosing loop, and also generate many useless cache loads. In the same way, loop transformations that are dedicated to temporal locality optimization in the innermost loop for some references, do not ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pp. 204-213, 1998.
....of high memory contention, especially for multi megabyte superpages; if this is combined with no gathering, then we believe there will be little opportunities for promotions. This potential problem is not re ected in their experiments, since they tested under no memory pressure. Swanson et al. [6] devised a hardware based mechanism that avoids the cost of gathering. An extra level of translation in the memory controller allow superpages to be composed of non adjacent physical pages. This approach is not applicable to commodity machines. Subramanian et al. 5] describe the implementation ....
M. Swanson, L. Stroller, and J. B. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture (ISCA'98), 1998.
....entry in order to increase the TLB reach. For example, the MIPS R12000 processor TLB supports 4K to 16M page sizes per TLB entry, which means the TLB reach ranges from 512K to 2G. Unfortunately, while many processors support multiple page sizes, few operating systems make full use of this feature [18]. Recent work has provided advances in loop and data transformation theory. By using ane representation of loops, several loop transformations have been uni ed into a single framework using a matrix representation of these transformations [22] These techniques consist either in unimodular [1] or ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204-213, June 1998.
....support variable page sizes. There is extensive literature on the proposal to use superpages mapped extents that are much larger than common current page sizes as a solution. Proposals for architectural support range from the straightforward simple schemes [19] to suggestions by Swanson [23] for an additional level of indirection in page translation to create non contiguous and unaligned superpages. This scheme makes superpage use far more convenient since the memory management system does not have to be designed around it (finding large, contiguous, aligned areas of unused memory is ....
Mark Swanson, Leigh Stoller, and John Carter. Increasing tlb reach using superpages backed by shadow memory. In Computer Architecture News, 1998.
....Moreover, the derived data transformations only consider the innermost loop level and only one unique data layout is associated to each array. Such determined data layouts favor one axis of the index space over the others and neighbors in the unfavored directions become distant in memory [3, 15]. Hence, they can yield large strides when control returns to the enclosing loop, and also yield many useless cache loads of inactive elements. In the same way, the derived loop transformations are dedicated to temporal locality optimization in the innermost loop for one or several references. ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204-213, june 1998.
.... enables several optimizations that let the application control how, when, and where its data are loaded into the on chip caches: gathering sparse data into dense cache lines, tiling and recoloring data structures without copying, and mapping noncontiguous physical pages to a single TLB entry [49], 11] The compiler or application programmer inserts system calls to remap data structures, making this a CT;RT;RT approach. Impulse prefetches and buffers data within the memory controller until the CPU requests them, avoiding cache pollution. Most of the approaches outlined in this section ....
....that can perform intelligent access ordering and or prefetching and buffering within the memory controller, even if they always transmit data in cache line increments to a processor chip with a traditional cache hierarchy. Application driven remapping of physical addresses at the memory controller [49], 11] can be used to give programs more control over how stream data is cached. In the SMC system described here, stream data can be cached by copying it to a portion of memory that has been preallocated in cache, as in Lee s subroutines [32] Fundamentally, the decision whether or not to cache ....
M.R. Swanson, L.B. Stoller, and J.B. Carter, Increasing TLB Reach Using Superpages Backed by Shadow Memory, Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 204-213, June 1998.
....reversal, or skewing) when such transformations are legal [6] but this does not help in the matrix transposition example. Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs) because the dilation effect extends to virtual memory pages [5, 37]. Finally, it may cause cache misses due to self interference even when a tiled loop repeatedly accesses a small tile in the array index space, because the canonical layout depends 7 on the matrix size rather than the tile size. Such interference misses are a complicated and non smooth function ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
.... enables several optimizations that let the application control how, when, and where its data are loaded into the on chip caches: gathering sparse data into dense cache lines, tiling and recoloring data structures without copying, and mapping non contiguous physical pages to a single TLB entry [SSC98, CHS 99] The compiler or application programmer inserts system calls to remap data structures, making this a (CT; RT;RT ) approach. Impulse prefetches and buffers data within the memory controller until the CPU requests them, avoiding cache pollution. Most of the approaches outlined in this ....
....can perform intelligent access ordering and or prefetching and buffering within the memory controller, even if they always transmit data in cache line 9 increments to a processor chip with a traditional cache hierarchy. Application driven remapping of physical addresses at the memory controller [SSC98, CHS 99] can be used to give programs more control over how stream data is cached. In the SMC system described here, stream data can be cached by copying it to a portion of memory that has been pre allocated in cache, as in Lee s subroutines [Lee93] Fundamentally, the decision whether or not ....
M.R. Swanson, L.B. Stoller, and J.B. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
....TLB bound applications by about 50 . Our work extends theirs by measuring the added performance bene t, as well as the e ect on the choice of policy, of using hardware support at the memory system to make creating superpages cheaper. The hardware that we model is the Impulse Memory Controller [28], which helps create superpages without copying by adding another level in the memory hierarchy at the memory controller. In Impulse, superpages are built through reampping. Our research shows that combining the work of Romer et al. and the Impulse technology changes the tradeo s in designing an ....
....performance factors not covered by Romer et al. s trace based study, such as the detrimental e ects of the cache pollution induced by copying. Finally, we nd that online superpage promotion achieves performance comparable to the hand coded superpage promotion mechanism employed by Swanson et al. [28] The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 explains the two methods used to create superpages, along with the two policies investigated for promoting superpages at run time. Section 4 describes our simulation environment and benchmark suite, ....
[Article contains additional citation context not shown here]
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proc. of the 25th ISCA, pp. 204-213, June 1998.
....reversal, or skewing) when such transformations are legal [6] but this does not help in the matrix transposition example. Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs) because the dilation effect extends to virtual memory pages [5, 37]. Finally, it may cause cache misses due to self interference even when a tiled loop repeatedly accesses a small tile in the array index space, because the canonical layout depends 7 on the matrix size rather than the tile size. Such interference misses are a complicated and non smooth function ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
....the problem of limited TLB reach, caused by the disparity between application access patterns and TLB size, but most of them require the addition and use of special purpose hardware. Even the simpler proposed solutions require that the hardware implement superpages. Swanson proposes a mechanism in [10] that adds another level of indirection to page translation to create noncontiguous and unaligned superpages. This scheme makes superpage use far more convenient since the memory management system does not have to be designed around it (finding large, continuous, aligned areas of unused memory is ....
Mark Swanson, Leigh Stoller, and John Carter. Increasing tlb reach using superpages backed by shadow memory. In Computer Architecture News, 1998.
.... by appropriate loop transformations (such as interchange, reversal, or skewing) when such transformations are legal [4] Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs) because the dilation effect extends to virtual memory pages [3, 54]. Finally, it may cause cache misses due to self interference even when a tiled loop repeatedly accesses a small tile in the array index space, because the canonical layout depends on the matrix size rather than the tile size. Such interference misses are a complicated and non smooth function of ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
.... by appropriate loop transformations (such as interchange, reversal, or skewing) when such transformations are legal [4] Second, for large matrix sizes, it may even reduce the effectiveness of translation lookaside buffers (TLBs) because the dilation effect extends to virtual memory pages [3, 56]. Finally, it may cause cache misses due to self interference even when a tiled loop repeatedly accesses a small tile in the array index space, because the canonical layout depends on the matrix size rather than the tile size. Such interference misses are a complicated and nonsmooth function of ....
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
No context found.
M.R. Swanson, L.B. Stoller, and J.B. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
No context found.
M.R. Swanson, L.B. Stoller, and J.B. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
No context found.
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory.InProceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
....the Impulse Memory Controller. ory or extra pointers, and it is capable of hiding latency even when there is very little work to do for each node. 2 Impulse Background Impulse expands the traditional virtual memory hierarchy by adding address translation hardware to the memory controller (MC) [4, 11, 30, 33]. The operating system can configure the MC to reinterpret unused physical addresses as remapped aliases of real physical addresses. The remapped physical addresses refer to a shadow address space. This virtualization of unused physical addresses can be used to improve the efficiency of the ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
....three image processing algorithms image filtering, image rotation, and ray tracing. We then show how Impulse s ability to remap pages can be used to automatically improve TLB behavior through dynamic superpage creation. Some of these results have been published in prior conference papers [9, 13, 44, 49]. 3.1 Sparse Matrix Vector Product Sparse matrix vector product (SMVP) is an irregular computational kernel that is critical to many large scientific algorithms. For example, most of the time in conjugate gradient [3] or in the Spark98 earthquake simulations [33] is spent performing SMVP. To ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
....the impact of these new architectural features when designing a dynamic superpage promotion mechanism. For example, Swanson et al. demonstrate that applications can create superpages without copying using the Impulse memory controller s physical to physical address remapping capabilities [29]. The resulting system yields a two fold increase in TLB reach and a 5 20 improvement in the performance of a mix of SPECint95 and Splash2 applications. We also extend previous work by employing an execution driven simulation environment that more accurately models both the direct and indirect ....
....and instruction issue width for the two online promotion algorithms proposed by Romer et al. 24] In this section we describe the Impulse memory controller, compare our processor models to Romer s, and review Romer s promotion policies. 3. 1 Impulse Superpage Promotion The Impulse memory system [29] supports an extra level of address remapping at the MMC (main memory controller) unused physical addresses are remapped into real physical addresses. We refer to the remapped addresses as shadow addresses. From the point of view of the processor, shadow addresses are used in place of real ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proc. of the 25th ISCA, pp. 204--213, June 1998.
....Prefetch Unit DRAMs DRAMs Memory Controller Cache MMU System bus L1 cache CPU Remapping Figure1: Organization of the Impulse Memory Controller. 2 Impulse Background Impulse expands the traditional virtual memory hierarchy by adding address translation hardware to the memorycontroller [4,11, 30, 33]. The operating system canconfigure the memory controller to reinterpret unused physical addresses as remapped aliases of real physical addresses. This virtualization of unused physical addresses can be used to improve the efficiency of the processor caches. The remapped physical addresses refer ....
M. Swanson,L.Stoller, and J. Carter. Increasing TLB reach using superpages backed byshadowmemory. In Proc. of the 25th ISCA,pp. 204--213, June 1998.
....directly to a physical page. By remapping physical pages in this manner, applications can recolor physical pages without copying as described in Section 3.1. In another publication we have described how direct mappings in Impulse can be used to form superpages from non contiguous physical pages [21]. ffl Strided physical memory: Impulse allows applications to map a region of shadow addresses to a strided data structure. That is, a shadow address at offset soffset on a shadow region is mapped to a pseudo virtual address pvaddr stride soffset, where pvaddr is the starting address of the ....
....should be usable across a variety of memory bound applications. In addition, despite the fact that we use conjugate gradient as our application for two optimizations, we are not comparing optimizations: the two optimizations are usable on different sets of different applications. In previous work [21], we have shown that the Impulse memory remappings can be used to dynamically build superpages and reduce the frequency of TLB faults. Impulse can create superpages from non contiguous user pages: simulations show that this optimization improves the performance of five SPECint95 benchmark programs ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proc. of the 25th ISCA, June 1998.
....depending on how Impulse is used to access a particular data structure: direct, strided, or scatter gather. Direct mapping translates a shadow address directly to a physical DRAM address. This mapping can be used to recolor physical pages without copying [8] or to construct superpages dynamically [30]. We discuss no copy page coloring further in Section 3.1. Strided mapping creates dense cache lines from array elements that are not contiguous in physical memory. The mapping function maps an address soffset in shadow space to pseudo virtual address pvaddr stride Theta soffset, where pvaddr ....
....performance impact should be even greater on superscalar machines, where memory becomes a bigger bottleneck, and where non memory instructions are effectively cheaper. 11 Flexible remapping support in the Impulse controller can be used to implement a variety of optimizations. In previous work [30], we showed that the Impulse memory remappings can be used to dynamically build superpages and thereby reduce the frequency of TLB faults. Impulse creates superpages from non contiguous user pages. Our simulations show that this optimization improves the performance of five SPECint95 benchmark ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
.... performance gap for applications with poor, but predictable, memory locality (e.g. streaming and vector style applications) Impulse enables several optimizations that let an application control how, when, and where its data are loaded into the on chip caches via the following mechanisms [3, 24]: i) gathering sparse data into dense cache lines, ii) tiling without copying data, iii) recoloring data structures without copying, and (iv) mapping non contiguous physical pages to a single TLB entry. To decrease the observed latency of memory accesses, Impulse supports many in flight DRAM ....
....is used to access a particular data structure, this translation can be direct, strided,orscatter gather. Direct mapping translates a shadow address directly to a physical DRAM address. This mapping can be used to recolor physical pages without copying [3] or to construct superpages dynamically [24]. Strided mapping creates dense cache lines from array elements that are not contiguous. The mapping function maps an address soffset in MTLB ALU prefetch buffers cacheline assembly shadow descriptors prefetch buffers L2 L1 CPU MMU Shadow Descriptor Unit Page Table Unit scheduler ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213, June 1998.
....page. By remapping physical pages in this manner, applications can recolor physical pages without copying: this optimization is described in Section 3.1.2. In another publication we have described how direct mappings in Impulse can be used to form superpages from non contiguous physical pages [22]. ffl Strided physical memory: Impulse allows applications to map a region of shadow addresses to a strided data structure. That is, a shadow address at offset soffset on a shadow region is mapped to a pseudo virtual address pvaddr stride soffset, where pvaddr is the starting address of the ....
....should be usable across a variety of memory bound applications. In addition, despite the fact that we use conjugate gradient as our application for two optimizations, we are not comparing optimizations: the two optimizations are usable on different sets of different applications. In previous work [22], we have shown that the Impulse memory remappings can be used to dynamically build superpages, which can reduce the frequency of TLB faults. Impulse can create superpages from non contiguous user pages: simulations show that this optimization improves the performance of five SPECint95 benchmark ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
....in our current design: ffl Direct mapping: Impulse allows applications to map a shadow page directly to a physical page. By remapping physical pages in this manner, applications can color pages without copying (described in Section 4. 2) and form superpages from non contiguous physical pages [14]. ffl Strided physical memory: Impulse allows applications to map the addresses in a shadow page to strided physical memory. That is, a shadow address at offset soffset on the shadow page is mapped to a physical address paddr stride soffset. By mapping sparse, regular data items into packed ....
....7 Conclusions We have described three optimizations that the Impulse memory controller will support for matrix based scientific applications. An Impulse memory controller can also be used to dynamically build superpages, which can save processor TLB entries and reduce the frequency of TLB faults [14]. Because the physical memory associated with a superpage must be contiguous and correctly aligned, it is difficult to map user data using superpages on conventional memory systems. By contrast, Impulse can create superpages for arbitrarily shuffled user data by using shadow memory to map ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
....controller allows tiles to be copied virtually. The cost that virtual copying incurs is that read write sharing between virtual copies requires cache flushing to maintain coherence. An Impulse memory controller can be used to dynamically build superpages, so as to save processor TLB entries [13]. This optimization can dramatically improve the performance of applications with large working sets. Unfor 2 In the NAS CG benchmarks, x ranges from 600 kilobytes to 1.2 megabytes, and the other structures range from 100 300 megabytes. Thus, x will not fit in most processor s L1 caches, but ....
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998.
No context found.
M. Swanson, L. Stroller, and J. B. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proc. of the 25th Annual Int'l Symp. on Computer Architecture (ISCA'98), June 1998.
No context found.
M. Swanson, L. Stoller, J. Carter. Increasing TLB Reach Using Superpages Backed by Shadow Memory. In 25th Annual International Symposium on Computer Architecture, July 1998.
No context found.
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th International Symposium on Computer Architecture, pages 204-- 213, Barcelona, Spain, June 1998.
No context found.
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th International Symposium on Computer Architecture, pages 204-- 213, Barcelona, Spain, June 1998.
No context found.
M. Swanson, L. Stoller, J. Carter. Increasing TLB Reach Using Superpages Backed by Shadow Memory. In 25th Annual International Symposium on Computer Architecture, July 1998.
No context found.
M. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th International Symposium on Computer Architecture, pages 204-213, Barcelona, Spain, June 1998.
No context found.
M. R. Swanson, L. Stoller, and J. Carter. Increasing TLB reach using superpages backed by shadow memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 204--213,June 1998.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC