| Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 279289. ACM Press, 1996. |
....a runtime system with mechanisms and algorithms that transparently optimize at runtime the page placement of OpenMP programs, using feedback from the compiler, the operating system and dynamic monitoring of the memory reference pattern of the programs. UPMLIB leverages dynamic page migration [12] at user level to correct unfortunate page placement decisions made by the operating system. The notable difference of UPMLIB compared to previously proposed kernel level page migration engines, is that the employed dynamic page migration algorithms correlate the memory reference information ....
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. Proc. of ASPLOS-VII, pp. 279--289, Cambridge(USA), October 1996. 6
....for programmers and compromises the simplicity of OpenMP. The OpenMP programming model is designed to enable straightforward parallelization of sequential codes, without exporting architectural details to the programmer. Data distribution contradicts this design goal. Dynamic page migration [14] is an operating system mechanism for tuning page placement on distributed shared memory multiprocessors, based on the observed memory reference traces of each program at runtime. The operating system uses per node, per page hardware counters, to identify the node of the system that references ....
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. Proc. of the 7th Int. Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279--289. Cambridge, MA, October 1996.
....likely to provide a prediction for future memory accesses) and if the data distribution engine is able to identify phase changes in the memory access pattern. Although techniques for sampling and decaying memory access history to gauge dynamic memory access patterns have appeared in the literature [14, 18, 20], it is questionable if these techniques form a general solution. A second problem is that dynamic optimization of data distribution is only one aspect of the performance tuning process for scalable shared memory architectures. The balanced distribution of com3 putation among processors is a ....
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), pages 279--289, Cambridge, Massachusetts, Oct. 1996.
....we present later, we show that these techniques can lead to an improvement in performance of two orders of magnitude in some cases. Other more course grain approaches for improving locality in general SMP software include automated support for memory page placement, replication and migration [18, 23, 40] and cache a#nity aware process scheduling [39, 24, 13, 33, 9] 1.2 SMP Operating Systems Poor performance of the operating system can have considerable impact on application performance. For example, for parallel workloads studied by Torrellas et al. the operating system accounted for as much ....
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosemblum. Operating system support for improving data locality on CC-NUMA compute servers. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, Cambridge, Massachusetts, 1--5 October 1996. ACM Press.
....scheduling decisions. These approaches explore the benefits of not migrating threads whose data is resident in the local cache [6] 8] and or local memory [1] 4] Research in data migration focuses on co locating data with its accessing threads either by migrating or replicating pages in memory [5][9]. III. A VECTOR MODEL FOR MIGRATION The conflicting goals of improving locality and distributing resource demands can be characterized as vectors. An attraction vector is associated with every object (data or thread) and designates the dominant direction of the object s communication. Moving the ....
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, "Operating System Support for Improving Data Locality on CC-NUMA Compute Servers," in Proc. of the 7th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, pp. 279--289, Oct. 1996.
....from not being implemented in a fabrication technology tailored to the circuit family. In principle, neither MRAM nor stacked DRAM technologies have this drawback as the memory and logic devices can be fabricated separately. Page migration has been extensively studied in multiprocessor systems [24, 6, 37]. This technique reduces memory access time by moving data pages into the memory of the processor that is accessing them. Our MRAM memory architecture will allow a similar optimization to be made for uniprocessor systems. Lebeck et al. have suggested that page allocation and migration policies can ....
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, 1996.
....operating system as different microarchitectural features were varied. Barroso, Gharachorloo, and Bugnion [9] investigated database and Altavista search engine workloads on an SMP, focusing on the memory system performance. Other investigations using SimOS do not investigate OS activity at all [56, 85, 55, 34]. Web servers have been the subject of only limited study, due to their relatively recent emergence as a workload of interest. Hu, Nanda, and Yang [38] examined the Apache Web server on an IBM RS 6000 and an IBM SMP, using kernel instrumentation to profile kernel time. Although they execute on ....
VERGHESE, B., DEVINE, S., GUPTA, A., AND ROSEMBLUM, M. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (October 1996).
....between processors due to the operating system scheduling strategy. In these cases, the hardware and the operating system should make special provisions for reducing the number of remote memory accesses incurred by the programs. Page level coherence through dynamic page migration and replication [18] is a technique proposed to improve locality on DSM systems by dynamically moving virtual memory pages closer to the processors that actively use the pages more frequently. Previously proposed algorithms for page migration used flat competitive schemes, based on reference counters attached to the ....
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on ccNUMA Compute Servers. Proc of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279--289, Cambridge (USA), 1996.
....and replication are a direct analogue to multiprocessor cache coherence, with the virtual memory page serving as the coherence unit. Page migration has been proposed merely as a kernel level mechanism for improving the data locality 4 of applications with dynamic memory reference patterns [15, 34]. In this work, dynamic page migration is put in a radically different context. In particular, page migration is no longer considered as an optimization. It is rather used as the vehicle of a transparent data distribution engine. The key for leveraging dynamic page migration as a data ....
....of page placement schemes which are inferior to first touch. The implementation of page migration in IRIX follows closely the 3 mprotect is the UNIX system call for controlling access rights to memory pages. 4 We used the class A problem sizes in the experiments. 9 design presented in [34] for the Stanford FLASH multiprocessor. Each physical memory frame is equipped with a set of 11 bit hardware counters. Each set of counters contains one counter per node and some additional logic to compare counters. The counters track the number of accesses from each node to each frame in memory. ....
[Article contains additional citation context not shown here]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), pages 279--289, Cambridge, Massachusetts, October 1996.
....a runtime system with mechanisms and algorithms that transparently optimize at runtime the page placement of OpenMP programs, using feedback from the compiler, the operating system and dynamic monitoring of the memory reference pattern of the programs. UPMLIB leverages dynamic page migration [12] at user level to correct unfortunate page placement decisions made by the operating system. The notable difference of UPMLIB compared to previously proposed kernel level page migration engines, is that the employed dynamic page migration algorithms correlate the memory reference information ....
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. Proc. of ASPLOS-VII, pp. 279--289, Cambridge(USA), October 1996. 6
....programs, if pages with shared data are distant from the threads that access them more frequently upon cache misses. To surmount this problem, vendors provide either data distribution directives as extensions to OpenMP or operating system support to control the placement [1] and dynamic migration [13] of data pages. Offering data distribution directives similar to the ones offered by High performance Fortran (HPF [6] has two fundamental shortcomings. First, it is inherently platform dependent and thus hard to standardize and incorporate seamlessly in shared memory programming models like ....
....REDISTRIBUTE directives can be used to dynamically modify the initial data mapping. Dynamic page migration, triggered by specialized hardware that monitors the reference rates of from each node to each page in memory, moves pages competitively between nodes, based on the observed reference rates [13]. Page migration can be employed as an optimization for programs with dynamically changing reference patterns, but also, as a system tool to automatically fix incorrect placements of pages at runtime. Though transparent to the programmer, requires complicated policies and algorithms in order to ....
[Article contains additional citation context not shown here]
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279--289, Cambridge(USA), October 1996. 15
....across memories also impacts the performance of distributedmemory machines. The techniques presented in this paper are orthogonal to data distribution schemes: DDSMs can leverage proposed user transparent locality enhancement techniques applied to conventional DSM mechanisms, such as migration [23] and prediction speculation [13] None of these techniques are used in the simulations presented in this paper. In order to study the sensitivity of DDSM performance to different data distributions, two different scenarios are considered. In the first scenario, data is distributed randomly ....
Verghese, B., Devine, S., Gupta, A., and Rosenblum, M. Operating system support for improving data locality on CC-NUMA compute servers. In Proc. 7th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1996.
....mechanism that freezes frequently migrating pages. The pages are later defrosted based on either the time since they were frozen or when there are more remote than local accesses to the page [4] The migration and replication techniques have shown to benefit even modern CC NUMA architectures [5] [6] where the ratio between the cost of local and remote accesses is as low as 2 to 3. With modern network technologies, access to a block cached in a remote memory can be many times faster than access to a disk. On a 155 Mbit s ATM network, Feeley et al. [7] report up to factor 7 depending on the ....
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 279-289, October 1996.
....time, focusing on replacement algorithms. Recent studies examined page coloring policies for selecting appropriate physical page frames to minimize cache misses [2, 23] Other recent work has studied page placement aimed at improving TLB performance [41] or NUMAmultiprocessor memory access [27, 28, 45, 12, 3]. Each of these problems bears some resemblance to the issues we face since they all attempt to exploit the flexibility available in mapping virtual to physical pages. 3. HARDWARE POWER MANAGEMENT This section explains various hardware policies for controlling PADRAM power states. Since each ....
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings, Architectural Support for Programming Languages and Operating Systems, pages 279--289, October 1996.
....the operating system as different microarchitectural features were varied. Barroso, Gharachorloo, and Bugnion [4] investigated database and Altavista search engine workloads on an SMP, focusing on the memory system performance. Other investigations using SimOS do not investigate OS activity at all [28, 44, 27, 17]. Web servers have been the subject of only limited study, due to their relatively recent emergence as a workload of interest. Hu, Nanda, and Yang [19] examined the Apache Web server on an IBM RS 6000 and an IBM SMP, using kernel instrumentation to profile kernel time. Although they execute on ....
B. Verghese, S. Devine, A. Gupta, and M. Rosemblum. Operating system support for improving data locality on CCNUMA compute servers. In 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996.
....[22] 23] 24] 26] 27] 1 This work was performed as part of the author s dissertation research. The author s present address is: Storage Systems Program, Hewlett Packard Laboratories, 1501 Page Mill Road, M S 1U 13, Palo Alto, CA 94304 1126. Her current email address is kkeeton hpl.hp.com. 50 [33] and decision support (DSS) database workloads [3] 5] 15] 17] 18] 23] 32] These studies used standard workloads defined by the Transaction Processing Performance Council (TPC) namely TPC B and TPC C for OLTP [10] and TPC D, TPC H and TPC R for DSS [10] 30] 31] Although these benchmarks ....
....producing a representative random microbenchmark lies in posing multiple read only queries. 5. RELATED WORK Many of the studies that use database workloads to evaluate computer architecture innovations have employed the complex OLTP [3] 7] 8] 9] 16] 17] 18] 19] 22] 23] 24] 26] 27] [33] and DSS [3] 5] 15] 17] 18] 23] 31] workloads defined by the TPC. These studies vary in their usage of full scale data sets versus in memory data sets. One study provides rules of thumb for using an in memory version of the TPC B OLTP benchmark to approximate the processor and memory ....
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. "Operating system support for improving data locality on CC-NUMA computer servers." In Proc. of ASPLOS-VI, pages 279-289, October 1996.
....with the virtual memory page serving as the coherence unit. Page migration was proposed merely as a kernel level mechanism for improving the data locality of applications with dynamic memory reference patterns, initially on non cache coherent and later on cache coherent NUMA multiprocessors [12, 13]. In this work, we apply dynamic page migration in an entirely new context, namely data distribution. In this context, page migration is no longer considered as an optimization. It is rather used as the mechanism for approximating implicitly the functionality of a simple data distribution system. ....
....with and without the page migration engine. This is done primarily to investigate if the IRIX page migration engine is capable of improving the performance of page placement schemes inferior to first touch. The implementation of page migration in IRIX follows closely the scheme presented in [13] for the Stanford FLASH multiprocessor. Each physical memory frame is equipped with a set of 11 bit hardware counters. Each set of counters contains one counter per node in the system and some additional logic to compare counters. The counters track the number of accesses from each node to each ....
[Article contains additional citation context not shown here]
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 279--289. Cambridge, MA, October 1996.
....(DSM) machines, cache locality improvements (through data layout transformations) and memory locality improvements (through data distribution) are complementary. While good cache locality optimization combined with a good page management scheme (such as first touch (FT) policy with page migration [24, 41]) can effectively ensure low memory access costs, merely distributing the data across the memories of the processors in a best possible way may not necessarily ensure good cache locality. We would like to delve a bit more into this interplay between cache locality optimization, page management ....
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for for improving data locality on cc-NUMA compute servers. In Proc. 7th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, pages 279--289, October 1996.
....of references from a remote node exceeds that from the home node, the operating system is notified of a possible candidate for the page migration. The Origin also has a support for the fast page transfer and TLB shootdown, that are considered to be a significant part of the page migration cost [69]. Simple COMA is a software realization of the cache only memory architecture (COMA) and it allows page replication and migration as requested by processors automatically [56] Farsafi and Wood proposed Reactive NUMA that switches its coherence mechanism between CC NUMA and S COMA adoptively ....
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, "Operating System Support for Improving Data Locality on CC-NUMA Compute Servers," in Proceedings of 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 279--289, October 1996.
....cost of accessing a remote node is incurred on every cache miss. Again, the exact cost is highly variable depending on the workload data access patterns. Workloads with a large memory footprint that are adversely affected due to loss of node affinity can use dynamic page migration and replication [61] to alleviate the cost. Inter cell VCPU migration: The third type of VCPU migration occurs when a VCPU is moved across a cell boundary. This migration has a much more complex implementation. For fault containment reasons, the VCPU data structure must reside in the cell executing the VCPU; ....
....such dependencies. Migrating data pages can also improve performance by reducing the cache miss penalty if pages are migrated to the node from which they were being accessed most frequently. However, a better method for improving memory latency is to use dynamic page migration and replication [61]. This technique can provide a huge performance improvement for workloads that do not fit in the L2 cache. However, we ve found that the 4MB cache on the Origin 2000 is sufficient for most workloads. Besides deciding which VCPUs to migrate and where to place them, another performance critical ....
[Article contains additional citation context not shown here]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA computer servers. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 279--289, October 1996.
No context found.
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 279289. ACM Press, 1996.
No context found.
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, "Operating System Support for Improving Data Locality on CC-NUMA Compute Servers," Proc. Architectural Support for Programming Languages and Operating Systems VII, pp. 279-289, Oct. 1996.
No context found.
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 279--289, October 1996.
No context found.
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. Operating System Support for Improving Data Locality on CC-NUMA Computer Servers. In Proceedings of 7 th Symposium on Architectural Support for Programming Languages and Operating Systems (ASPOLS VII) 1996.
No context found.
B. Verghese, S. Devine, A. Gupta and M. Rosenblum. "Operating system support for improving data locality on CC-NUMA computer servers," Proc. of ASPLOS-VII, pages 279-289, October 1996.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC