| R.P. Larowe, C.S. Ellis, Experimental comparisons of memory management policies for NUMA multiprocessors, ACM Transactions on ComputerSypute 9 (4) (1991) 319-- 363. |
....Other researchers have been implementing [9, 13] and simulating [23] memory management systems on current NUMA multiprocessors that do not provide the reference information needed by the algorithms considered in this chapter. Some related migration results are reported by LaRowe and Ellis [33]. These results are from experiments on a Butterfly multiprocessor, and use approximations to an access count obtained by sampling and aging reference bits [23] This sampling and aging requires a periodic software scan of all pages in the system, which introduces some additional overhead. ....
....is not available from current hardware. If this information is available, the additional software changes required to implement the algorithms can be handled by Mach; existing Mach implementations of NUMA memory management [9, 13] lend support to this conclusion. Preliminary performance studies [7, 33] of these algorithms show that they can improve performance, but the demonstrated performance improvements are relatively small. This has lead the authors of one of the studies [33] to conjecture that the performance improvements gained from their approximation to the migration algorithm are not ....
[Article contains additional citation context not shown here]
Richard P. LaRowe, Jr. and Carla S. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical Report CS-1990-10, Department of Computer Science, Duke University, Durham, NC, 1990.
.... in hardware on machines with coherent caches, such as the Symmetry, the Silicon Graphics machine, and the Kendall Square Research multiprocessor, and may be implemented in the operating system on machines lacking coherent caches, like the Butterfly [Bolosky et al. 1989; Cox and Fowler, 1989; LaRowe, Jr. and Ellis, 1991] Our affinity scheduling algorithm divides the iterations of a loop into chunks of size IN P] where N is the number of iterations in the loop, and P is the number of available processors. The ith chunk of iterations is always placed on the local work queue of processor i. When a processor is ....
R. P. LaRowe, Jr. and C. S. Ellis, "Experimental Comparison of Memory Management Policies for NUMA Multiprocessors," ACM Transactions on Computer Systems, 9(4):319-363, November 1991.
....Figure 1.1: The uniform memory access (UMA) model of a multiprocessor. P i : processor, MMU i : memory management unit, C i : cache memory, M i : memory) In the non uniform memory access model (NUMA) the shared memory is physically distributed among the N processing nodes as shown in figure 1. 2 [96] [102] The physical memories in each processor together form the logical shared memory. The interconnection network connects processing elements containing processors and memory banks among themselves so that any processor may access a remote memory physically located within another element ....
Richard P. Larowe Jr. and Carla Schlatter Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Transactions on Computer Systems, 9(4):319--363, November 1991.
....working sets of the applications are smaller than 32K bytes. Thus a L3 cache with a size of 64K bytes is sufficient to capture all shared memory blocks. All machine models use the sequential consistency model [17] The default page allocation scheme is round robin and the page size is 512 bytes [25], due to the small problem size in the evaluated applications. Synchronizations are based on the test test and set primitive, which is also used to implement the Lock Unlock operations [22] The synchronization variables can be cached for local spinning. Barrier synchronizations are counter based ....
R. P. Larowe and C. S. Ellis. Experimental comparisons of memory management polices for numa multiprocessors. In ACM Transactions on Computer Systems, pages 319--323, Nov 1991.
....of Mach virtual memory in Chapters 4 and 5, and the differences in the architecture of the Mach and nX virtual memory systems have little affect on the applicability of the ideas presented in this chapter, we omit a detailed description of the DUnX virtual memory system. The reader is referred to [LE90] for more details. A conventional way for multiple users to share the resources of a GP 1000 machine is to assign each user a dedicated subset of the processor nodes (called a cluster ) DUnX expands upon this by providing users with the ability to specify and dynam 145 ically change the memory ....
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. Technical Report CS1990 -10, Duke University, April 1990. Submitted.
....write notices, but we take advantage of the globally accessible physical address space for cache fills and for access to the coherent map and the local weak lists. Our use of remote reference to reduce the overhead of coherence management can also be found in work on NUMA memory management [7, 8, 14, 22, 23]. However relaxed consistency greatly reduces the opportunities for profitable remote data reference. In fact, early experiments we have conducted with on line NUMA policies and relaxed consistency have failed badly in their attempt to determine when to use remote reference. On the hardware side ....
R. P. LaRowe Jr. and C. S. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Transactions on Computer Systems, 9(4):319--363, November 1991.
....The SHRIMP project at Princeton [BLA 94] has studied ways to efficiently map the network interface to virtual memory to achieve low latency, and high bandwidth communication, in a multi computer, to support shared memory abstraction. The distributed shared memory literature for NUMA [LE90, LLG 92, BLA 94, RLD94] non uniform memory accesses) and COMA [FBR93] cache only memory accesses) machines, described above, study mechanisms to efficiently implement a shared memory abstraction so that applications on a parallel hardware make use of aggregate 17 memory. Efficiently ....
P. LaRowe and C. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. Technical Report CS-1990-10, Duke University, April 1990.
....SRAMbased block caches. A detailed study of the block cache design space is beyond the scope of this paper and has been dealt with in great detail in a recent paper [14] Prior research indicates that CC NUMA s performance may be very sensitive to the initial data allocation and placement [9]. As such, in this paper we use a first touch placement policy in all the systems we study. This policy is simple and has been shown to substantially eliminate unnecessary traffic [13] In this policy, an user invoked directive on every node initiates page migration and placement at the start of ....
Rick LaRowe and Carla Ellis. Experimental comparison of memory management policies for numa multiprcessors. ACM Transactions on Computer Systems, 9(4):319-- 363, November 1991.
....best use of the processor resources, and the reduction of different contentions in the execution of a program. Many performance evaluation results, applications and experiences have shown that the NUMA problem causes serious performance degradation for MIN based multiprocessor systems. See e.g. [7], 10] and [11] Very little work has been done to evaluate the NUMA problem on the 20 KSR1 multiprocessor system. Our computing experience on the KSR1 with two rings (64 processors) indicates that this HRbased architecture handles various potential contentions more efficiently than the ....
R. P. LaRowe, Jr. C. S. Ellis, "Experimental comparison of memory management policies for NUMA multiprocessors", ACM Transactions on Computer Systems, Vol. 9, No. 4, 1991. pp. 319-363.
....presents our experiments and answers to the above stated questions. Finally, section 4 summarizes our results and presents our conclusions. 2 Previous Work Memory coherence is an active area of research. Several implementations of memory coherence protocols exist in shared memory multiprocessors [15, 7, 5, 12, 13]. A lot of work has also been done in implementing coherent memory in message passing multiprocessors [16, 10] Bolosky et. al [4] have compared several memory coherence policies. In a companion paper [3] he has shown that the dominant overhead in memory coherence policies is false sharing, which ....
R. P. LaRowe, Jr. and C. S. Ellis. "Experimental Comparison of Memory Management Policies for NUMA Multiprocessors". ACM Transactions on Computer Systems, 9(4):319--363, November 1991.
....to employ memory management policies that allocate data in local memories intelligently, thus increasing data locality and eliminating or at least significantly reducing the need for remote accesses. Work on memory management policies has been based primarily on DSM multiprocessors without caches [41, 64, 66] and software DSMs [1, 11] that enable coherence at the page level. The impact of memory management policies on CC NUMA system performance was only recently studied in [14, 77] These studies primarily focus on OS hardware support needed for dynamic memory management policies involving page ....
....local memory, thus reducing the need to access the remote memory. The focus of this research is to study the impact of memory management on the network and memory system performance of CC NUMA multiprocessors. Related work in this area is the performance evaluation of memory management policies [41, 64, 66] for distributed shared memory systems without hardware cache coherence. Recently, Verghese et al. 77] studied the effect of different memory management policies on CC NUMA system performance. This work concentrated on the OS hardware support required for dynamic page migration and replication ....
R. P. Larowe and C. S. Ellis, "Experimental Comparisons of Memory Management Policies for NUMA Multiprocessors," ACM Transactions on Computer Systems, vol. 9, no. 4, pp. 319-363, Nov. 1991.
....is minimized, which has been a major obstacle to parallel computing with distributed systems. Additionally, since cache coherence is handled by hardware in SCI, no software bookkeeping is necessary. The Architectural Model LNUMA Non Uniform Memory Access (NUMA) distributed shared memory machines [16, 27, 32] are becoming increasingly significant since they support a shared memory paradigm on a large scale. Examples of NUMA machines are KSR1 [5] and BBN TC2000 [17] In these types of systems, the placement and movement of data are critical to system performance. The NUMA Problem , as it is often ....
Jr. R.P. LaRowe and C.S. Ellis. Experimental comparison of memory management policies for numa multiprocessors. ACM Transactions on Computing Systems, pages 319--363, November 1991.
....multi processor from the point of view of dsm systems is the so called non uniform memory access (numa) machine. In this type of machine there are usually two types of memory: fast local memory and slower global memory. The problems faced in implementing cache consistency in such machines [127] are very similar to those faced in implementing a dsm system. A common form of cache consistency used in these machines (and most interesting from the dsm point of view) is directory based cache coherence [35] Directory based coherence describes any protocol that doesn t use broadcast, and must ....
Richard P. LaRowe, Jr. and Carla Schlatter Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Transactions on Computer Systems, 9(4):319--363, November 1991.
....in different memory access time. A scheduling mechanism is supported to schedule the processes dynamically among the processors in run time. Several related studies have been conducted to understand and improve the parallel processing performance on a NUMA multiprocessor. LaRowe and Ellis (see [19]) take an experimental approach to compare a wide range of memory management policies on a target NUMA system, the BBN GP1000. Their system experiments conclude that the placement and movement of code and data are crucial to NUMA performance. The performance of a general multistage interconnection ....
R. P. LaRowe, Jr., C. S. Ellis, "Experimental comparison of memory Management policies for NUMA multiprocessors", Technical Report, CS-1990-10, Department of Computer Science, Duke University, 1990.
....time, focusing on replacement algorithms. Recent studies examined page coloring policies for selecting appropriate physical page frames to minimize cache misses [2, 23] Other recent work has studied page placement aimed at improving TLB performance [41] or NUMAmultiprocessor memory access [27, 28, 45, 12, 3]. Each of these problems bears some resemblance to the issues we face since they all attempt to exploit the flexibility available in mapping virtual to physical pages. 3. HARDWARE POWER MANAGEMENT This section explains various hardware policies for controlling PADRAM power states. Since each ....
R. LaRowe and C. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. ACM Transactions on Computer Systems, 9(4):319--363, Nov. 1991.
....migration and page replication (with a directory based coherence mechanism) as well as flexible policy control, are all features added to nX to create DUnX. In this paper, however, we limit our discourse to those changes necessary to support dynamic policy selection. The reader is referred to [7] for more information on the DUnX virtual memory subsystem and the evaluation of specific policies. In DUnX, nearly all policy decisions are made by seven special policy functions. These functions isolate most policy decisions from the page fault handler and the page scanner source code. The ....
....Our operating system experimentation has been able to happily coexist with a regular user community. The flexible memory management policy control mechanism provided by DUnX has also proven to be successful. Using that mechanism, we have studied over forty different memory management strategies [7]. The mechanism has also been exploited to provide useful data collection tools, and to provide support for programmer supplied memory management hints [9] Since DUnX is an implementation running on real hardware, the range of system parameters that can be varied and or measured is somewhat ....
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. Technical Report CS-1990-10, Duke University, April 1990.
....as a serious problem in several recent studies. Some of the earliest uses of the term false sharing appear in discussion of memory management for NUMA 2 architectures where page granularity migration and replication are employed to take advantage of the faster local memory access times [4, 2, 12]. False sharing has been blamed as a cause of increased coherency overhead in multiprocessor hardware caches with increasing line sizes in workload characterization studies of shared memory reference patterns [11, 10] Techniques for ameliorating the false sharing problem have also been proposed ....
Richard P. LaRowe and Carla S. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Transactions on Computer Systems, 9(4):319--363, November 1991.
....over memory management policy. In this section, we give an overview of the DUnX virtual memory system and the support it provides for our experimental policy evaluation. BBN s nX virtual memory system is described in [3] and a more thorough description of the DUnX mechanisms can be found in [28, 30]. A conventional way for multiple users to share the resources of a GP 1000 machine is to assign each user a dedicated subset of the processor nodes (called a cluster ) DUnX expands upon this by providing users with the ability to specify and dynamically change the memory management policies used ....
....second. One might expect that the usefulness of the scanners might be increased by tuning this parameter. We explore this possiblity in [33] 6.7 Other Results We have attempted to answer other questions by experimenting with policy alternatives in DUnX. Details of the results are presented in [30]. In this subsection, we summarize those findings. 32 A comparison of push based versus pull based page migration policies show little difference in overall performance, yet often significant differences in the amount of page migration activity. Our results suggest that push based policies ....
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. Technical Report CS-1990-10, Duke University, April 1990.
....in a significant additional programming burden. The operating system can play a major role in managing placement through the policies and mechanisms of the virtual memory subsystem (e.g. by migrating and replicating shared pages) OS level NUMA memory management is an area of active research [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13]. This body of work has demonstrated that dynamic page placement is indeed effective. In fact, the effectiveness question has been the focus of many of the previous studies that took the approach of proposing an algorithm and evaluating its performance (e.g. by comparing it against some static, ....
....distinct points; individual policies that captured various combinations of the large number of factors that we suspected might affect performance. Nearly fifty policies were tested, including (at least approximations of) most of the published policies, and the experimental results were reported in [9]. The DUnX policy space covered both pull and push based page movement, the collection and use of reference his tory information of various kinds and levels of detail, different mechanisms for triggering new placement decisions, different means of limiting excessive page movement (page ....
[Article contains additional citation context not shown here]
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. Technical Report CS1990 -10, Duke University, April 1990. To Appear in ACM Transactions on Computer Systems.
....recognized as a serious problem in several recent studies. Some of the earliest uses of the term false sharing appear in discussion of memory management for NUMA architectures where page granularity migration and replication are employed to take advantage of the faster local memory access times [4, 2, 16]. False sharing has been blamed as a cause of increased coherency overhead in multiprocessor hardware caches with increasing line sizes in workload characterization studies of shared memory reference patterns [12, 11] Techniques for ameliorating the false sharing problem have also been proposed ....
Richard P. LaRowe and Carla S. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Transactions on Computer Systems, 9(4):319-- 363, November 1991.
....architectures become more complex and the non uniformity becomes less well hidden, system software must assume a larger role in providing memory management support for the programmer. The NUMAtic project at Duke has been addressing this problem by investigating the role of the operating system [15, 16, 18, 19, 20]. This is an area of active research by other groups as well [5, 6, 7, 8, 9, 12, 21, 23, 25] In this paper, we describe a recent experience of designing an application to use the memory management capabilities provided in DUnX (Duke University nX pronounced ducks ) our locally developed ....
....to emphasize that this paper reports on a single case study of using the dynamic page placement support in DUnX. It is not intended to serve as justification for dynamic page placement in general or for DUnX specific features in particular. These issues have already been discussed by the authors [19] and others (e.g. 8, 12] What distinguishes this paper from others in the literature is that it provides an in depth account of how dynamic page placement can be used to improve the performance of a real UMA application, and compares it to the performance obtained through hand tuning of a NUMA ....
[Article contains additional citation context not shown here]
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. Technical Report CS1990 -10, Duke University, April 1990. Submitted.
.... Ramanathan and Ni conducted a study of critical factors in NUMA memory management reported in [25] We have also investigated OS level NUMA memory management through experimentation, both with the USMR programming library for the BBN GP1000 and with our DUnX kernel for the BBN GP1000 and TC2000 [14, 15, 17, 18, 19, 20, 21]. The unique contribution of this current work in relation to previous research is the complementary use of 1) measurements based on a flexible parameterized policy implementation that can explore a wide range of policy behavior and 2) an experimentally validated analytic model. Our focus has been ....
....individual policies that captured various combinations of the large number of factors that we suspected might affect performance. Nearly fifty policies were tested using DUnX, including (at least approximations of) most of the published policies, and the experimental results were reported in [18]. Our early experiences allowed us to prune and consolidate techniques. It appeared that our further investigation of policy issues could best be formulated in the context of a single parameterized policy and studied by varying the parameter settings and measuring the effect on performance. This ....
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. ACM Transactions on Computer Systems, 9(4):319--363, November 1991.
....the locality issue by migrating or replicating a page of virtual storage to a node that appears to be accessing it frequently. The decision of whether to migrate or replicate a page depends on reference patterns and the type of reference (read or write) being performed. The DUnX operating system [9, 10, 8] is an example of an operating system that supports such dynamic page placement. DUnX is a kernel developed for the BBN Butterfly family of computers. DUnX supports dynamic page placement which means that a page may be migrated or replicated to different nodes in response to observed usage ....
....defined types may also be used to avoid the classic false sharing situation in which de facto private data from multiple processes are packaged together in the same page. 4 msort An Example Using DUSTy A collection of parallel programs comprise the test workload for evaluation of DUnX [9, 10]. We have chosen one of these, the msort program, as our example program. The reason for this choice is that its access patterns are relatively clean and easy to capture in DUSTy constructs, an attractive feature for a first case study. In previous experiments, it has also exhibited intermittent ....
R. P. LaRowe Jr. and C. S. Ellis. Experimental comparison of memory management policies for NUMA multiprocessors. ACM Transactions on Computer Systems, 1991. (to appear).
No context found.
R.P. Larowe, C.S. Ellis, Experimental comparisons of memory management policies for NUMA multiprocessors, ACM Transactions on ComputerSypute 9 (4) (1991) 319-- 363.
No context found.
R. P. LaRowe and C. S. Ellis. Experimental comparison of memory management policies for numa multiprocessors. Technical report, Duke University, April 1990. CS-1990-10.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC