| L. Iftode, J. P. Singh, and K. Li. Understanding Application Performance on Shared Virtual Memory Systems. In Proc. of the ISCA'96, pp. 122--133, May 1996. |
....to which it is assigned. Given appropriate data structures, these applications are single writer at page granularity as well, and pages can be allocated among nodes such that writes to shared data are almost all local. The applications have di#erent inherent and induced communication patterns [24,14], which a#ect their performance and the impact on nodes. Figure 1 shows the execution times of each application in both system configurations for 1,4,8, and 16 processors. We see that the current implementation of the first touch placement in CableS. Although this implementation results in ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), May 1996.
....far less invalidation tra#c and o#ers FIFO servicing of lock requests to avoid potential starvation. 2.4 Applications Layer We use the SPLASH 2 [15, 9] application suite. A detailed classification and description of the application behavior for SVM systems with uniprocessor nodes is provided in [7]. 64 bit addressing and operating system limitations related to shared memory segments prevent some of the benchmarks from running at specific system configurations. We indicate these configurations in our results with N A entries. The application we use are: FFT, LU, Radix, Volrend, ....
....segments prevent some of the benchmarks from running at specific system configurations. We indicate these configurations in our results with N A entries. The application we use are: FFT, LU, Radix, Volrend, WaterSpatial, and SampleSort. We use both original versions of SPLASH 2 applications [7] and versions that have been restructured to improve their performance on SVM systems [9] Finally, in this work we restructure FFT (FFTst) to stagger the transpose phase among processors within each node. This allows FFT to take advantage of the additional bandwidth available in the system. 3 ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), May 1996.
....software rather than hardware in tightly coupled systems such as the SGI Origin2000 discussed in Chapter 2. Studies on software coherent shared address space multiprocessors have largely used applications as they were written for hardware cache coherent machines. The performance evaluations so far [43, 22, 46, 61, 31, 89, 35, 8] point out that for certain classes of applications there is a large performance gap between hardware cache coherent and software coherent systems. However, it should be possible to modify or restructure applications to interact better with software coherence protocols and granularities, and to ....
....7, 63] All this research has helped narrow the performance gap between SVM on clusters and hardware DSM systems for an expanding range of applications at the 16 processor scale. Recent studies examine the performance of software shared memory clusters at significant scale (32 to 64 processors) [87, 35, 89, 80, 14]. In this chapter, we present the first scalability study of all software SVM on a modern cluster with 64 processors. 136 CHAPTER 5. SCALABILITY ON AN SVM CLUSTER 137 Since clusters used for high performance computing typically use SMP nodes, we examine a 64 processor cluster composed of sixteen ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory systems. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 122--133, May 1996.
....which are the sharing patterns that are seen at word granularity. These usually reflect the actual sharing patterns seen on hardware cache coherent machines. These patterns interact with the large granularity of coherence in SVM (a page) to construct the induced sharing patterns of pages [34], which may be very di#erent than CHAPTER 1. INTRODUCTION 8 the inherent patterns. These key patterns determine how applications interact with di#erent system characteristics and granularities. Regular Applications LU performs the blocked LU factorization of a dense matrix. We begin with the ....
....The home copy is thus kept up to date. Upon a page fault following a causally related acquire CHAPTER 3. PERFORMANCE PORTABILITY TO CLUSTERS 59 operation, the entire page is fetched from the home [89] The tradeo#s between homebased and traditional LRC protocols have been studied in the literature [46, 34, 89]. Overall, due to software management for communication and coherence, SVM systems su#er high costs in communication, protocol overhead, and synchronization as well as critical section dilations. Interactions such as false sharing and fragmentation can easily occur due to the large page ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996. BIBLIOGRAPHY 171
....the constants, the node to network bandwidth may become a bottleneck if it is not increased considerably when going from a uniprocessor to an SMP node. A recent and promising form of SVM protocols is the class of so called home based CHAPTER 1. INTRODUCTION 19 protocols (HLRC, AURC,Cashmere) [44, 45, 46, 100, 54, 91]. Chapter 3 describes a protocol for home based SVM across SMP nodes (HLRC SMP, AURC SMP) that accomplishes the goals above. The SVM protocol can operate completely in software, or can exploit hardware support for automatic update (AU) propagation of writes to remote memories (also called ....
....bringing it closer to that of full hardware coherence and making the shared address space model attractive for application users on clusters as well. The most popular form of hardware support used so far is the propagation of fine grained writes to remote memories [16, 50, 37] Previous work [44, 46] has shown that the home based protocols with hardware support for automatic update (AURC) and without (HLRC) outperform previous all software protocols. Also it demonstrated that support for automatic update is beneficial in systems with uniprocessor nodes and customized network interfaces. ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....in the communication layer. Radix exhibits both these problems to a heightened extent, as well as a lot of false write sharing due to the page granularity. Lock synchronization: Previous work has identified locks and their dilation to be a major performance problem for SVM for many applications [27, 28]. The restructured versions of these applications dramatically reduce the number of locks and hence their e#ect on performance. In GeNIMA the applications that su#er from high lock synchronization costs are the unrestructured Water nsquared and Barnesoriginal. Both exhibit fine grain locking, and ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
....of parallel programs for shared address space multiprocessors. Although there are several studies which use the SPLASH 2 programs for performance evaluation, a few works are done for Cholesky on clusters. Some studies on clusters of uniprocessors use Cholesky for evaluation of software DSM systems[6][7] 8] These works show that the efficient parallel execution of Cholesky is difficult due to high synchronization overheads. In this paper, we present parallel implementations and performance evaluation of Cholesky on an SMP cluster COMPaS. The original program is already parallelized using ....
L. Iftode, J. P. Singh and K.Li, Understanding Application Performance on Shared Virtual Memory Systems, In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
....literature on S DSM that has had an impact on the design of the Cashmere and Shasta systems. The focus of this paper is to understand the performance tradeoffs of fine grain vs. coarse grain software shared memory rather than to design or study a particular S DSM system in isolation. Iftode et al. [11] have characterized the performance and sources of overhead of a large number of S DSM applications, while Jiang et al. 12] have provided insights into the restructuring necessary to achieve good performance for a similar S DSM application suite. Our work builds on their s by providing insight on ....
L. Iftode, J. P. Singh, and K. Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the Twenty-Third Annual International Symposium on Computer Architecture, May 1996.
....components and custom devices. In simulation, their fastest system is only twice as slow as completely customised hardware dsm systems such as the Stanford FLASH system [82] There is little clear understanding of the performance issues in shared memory and dsm systems. Iftode, et al. [101] simulate the performance of a number of applications on three different platforms: an all software lazy release consistent system, a release consistent system with hardware assistance, and an all hardware numa architecture. They conclude that the software dsm systems perform quite well for a ....
Liviu Iftode, Jaswinder Pal Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory Systems. In Proceedings of the 23rd Annual International Symposium on Computer Architecture [175], pages 122--133. Also available in Computer Architecture News 24(2), May 1996.
....may not be improved at one stage. This is becasue the relative size of a problem decreases when more computers or more powerful computers were added. This phenomenon corresponds to the finding at Princeton DSMs perform surprisingly well for some application, in particular for large problems [4]. N body In this experiment, 8192 bodies are considered and the number of tasks used in the simulation is four times of the number of machines. For parallelism without contention, we use a loop based algorithm and arrange four arrays to temporarily store the result in each iteration. The final ....
L. Iftode, J.P. Singh and K. Li. Understanding Application Performance on Shared Virtual Memory Systems. In proc. of the 23rd Annual International Symposium of Computer Architecture, May 1996.
.... can provide a shared memory abstraction across the machines with the help of the virtual memory system [17] For certain classes of applications, these software distributed shared memory (DSM) systems can deliver performance which is comparable to hardware cachecoherent machines of a similar scale [6, 11]. However, for applications with larger communication demands, the performance can be disappointing. 0 To appear in HPCA 4, February 1 4, 1998. A key stumbling block to achieving higher performance on software DSMs is the relatively large communication latency. In contrast with tightly coupled ....
....among 512 water molecules in liquid states across 9 time steps using an O(n 2 ) algorithm. WATER SP performs the same simulation as WATER NSQ except with 4096 water molecules and an O(n) algorithm. Further details on these applications can be found in studies by Woo et al. 26] and Liviu et al. [11]. 3. Prefetching We begin our study by focusing on prefetching alone. The idea behind prefetching is to use knowledge of future access patterns to bring remote data into the local memory before it is actually needed. In particular, we focus on software controlled prefetching, where explicit ....
L. Iftode, J. P. Singh, and K. Li. Understanding Application Performance on Shared Virtual Memory Systems. In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
....literature on S DSM that has had an impact on the design of the Cashmere and Shasta systems. The focus of this paper is to understand the performance tradeoffs of fine grain vs. coarse grain software shared memory rather than to design or study a particular S DSM system in isolation. Iftode et al. [10] have characterized the performance and sources of overhead of a large number of applications under S DSM, while Jiang et al. 11] have provided insights into the restructuring necessary to achieve good performance under SDSM for a similar application suite. Our work builds on theirs by providing ....
L. Iftode, J. P. Singh, and K. Li. Understanding Application Performance on Shared Virtual Memory. In Proc. of the TwentyThird ISCA, May 1996.
....begun to be used, such as Stanford SPLASH[48] and SPLASH2[54] and most programs are too simple to be used to evaluate software shared virtual memory systems. Performance evaluation for SVM should use a wider range of applications in different classes with regard to sharing patterns at page grain[26]. Accompany with the maturity of benchmarks, the summarizing methods need to be revisited too. Shi[47] et.al propose a new scheme which is based on the concept of confidence interval to summarize the evaluating results. Table 1: Summary of 11 Selected Representative SVM Systems. System ....
....of architectural support have already led to new area for software shared memory research that have recently begun to be investigated. The time is now ripe to focus research attention primarily on the gap between the performance of software shared memory and hardware coherent shared memory[26], and how it might be further alleviated. Some of the key directions for are research in 1. Comparing page grained SVM system with recently re popularized fine grained software shared memory approaches that don t rely so heavily on relaxed consistency models. To determine which scheme is more ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory systems. In Proc. of the 23rd Annual Int'l Symp. on Computer Architecture (ISCA'96), pages 122--133, May 1996.
....Way, Princeton, NJ 08540. kevin research.nj.nec.com z School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213. girija cs.cmu.edu x NEC Research Institute, 4 Independence Way, Princeton, NJ 08540. satish research.nj. nec.com cation at the page level [13, 33, 34, 35] or at the user defined object level [4, 5, 8, 16, 41, 47, 49] Another useful abstraction is to define simple communication and synchronization mechanisms. A technique we call batch communication, based on the BSP model [43] restricts access to communicated data until after a barrier ....
.... providing a shared memory abstraction for a distributed system is shared virtual memory (SVM) which provide communication at the level of pages [35] Though such systems can be efficient for some applications, performance is highly sensitive to the interaction of data structures and page sizes [33]. By sharing data at the page level, extensive false sharing can occur. This inefficiency can be substantially reduced by providing communication between processes at the level of user defined objects [4, 5, 8, 16, 47] To our knowledge, none of these systems rely on the exclusive use of barrier ....
L. Iftode, J. P. Singh, and K. Li, "Understanding application performance on shared virtual memory systems," in Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 122--133, May 1996.
....TreadMarks [4] makes use of the homeless protocol discussed in Section 2 to implement the lazy release consistency model. The popularity of the system has triggered various studies on its design and implementation. Coherence protocol has become one of the major fields of research in DSM. Iftode [13] has studied the homeless protocol in TreadMarks. He criticized that a page requester may have to gather diffs from different processors upon a page fault can be a potential performance bottleneck. Hence he proposed a new cache coherence protocol known as the Automatic Update Release Consistency ....
L. Iftode, J. P. Singh, J. P. and K. Li. Understanding Application Performance on Shared Virtual Memory Systems. In Proc. of the 23rd Annual International Symposium on Computer Architecture (ISCA'96), pages 122-133, May 1996.
....be larger on real SVM systems, where the overheads of access violations, i.e. page faults, are higher. 5. 2 Detailed Analysis To understand the reasons for the performance di erences, it is useful to classify the applications according to their data access patterns and synchronization behavior [29, 1, 12]. In this section, we will rst describe application classi cations according to the number of writers per coherence unit, spatial data access granularity and temporal synchronization granularity. We will then provide a detailed analysis for each 4 64 256 1024 4096 Coherence granularity 0 5 ....
....the applications into single writer and multiple writer applications. Write write false sharing occurs only for multiple writer applications. Coarse grain vs. Fine grain data access Data access granularity a ects how the communication to computation ratio changes with the coherence granularity [12]. Applications with coarse grain access tend to access a whole contiguous page at a time. Finegrain applications are likely to scatter reads and writes across multiple pages. Fine grain reads can introduce fragmentation with coarse coherence granularity and or false sharing. 5 Coarse grain vs. ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....is called its home. This makes update resolution in AURC extremely simple: a single 1 full page fetch from the home. On the other hand, in standard LRC collections of updates (di s) are distributed and homeless, making update resolution more dicult to perform. Extensive performance evaluation [16] has shown that AURC outperforms standard LRC in most cases. In this paper we propose two new home based LRC protocols. The rst, Home based LRC (HLRC) is similar to the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are ....
....may still need them. When implementing the protocol on a large scale machine, memory consumption can become a severe problem. To reduce memory consumption the shared virtual memory system must perform garbage collection frequently [21] 2. 2 Automatic Update Release Consistency The AURC protocol [16] implements Lazy Release Consistency without using any di operations by taking advantage of the SHRIMP multicomputer s automatic update hardware mechanism [5, 6] Automatic update provides write through communication between local memory and remote memory with zero software overhead. Writes to ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996. 13
....the SPLASH 2 [22] application suite. This section briefly describes the basic characteristics of each application relevant to this study. A more detailed classification and description of the application behavior for SVM systems with uniprocessor nodes is provided in the context of AURC and LRC in [9]. The applications can be divided in two groups, regular and irregular. Table 2: Number of page faults, page fetches, local and remote lock acquires and barriers per processor per 10 7 cycles for each application for 1,4 and 8 processors per node. Application Page Faults Page Fetches 1 4 8 1 4 ....
....among nodes such that writes to shared data are almost all local. In HLRC we do not need to compute diffs, and in AURC we do not need to use a write through cache policy. Protocol action is required only to fetch pages. The applications have different inherent and induced communication patterns [22,9], which affect their performance and the impact on SMP nodes. FFT: The all to all, read based communication in FFT is essentially a transposition of a matrix of complex numbers. We use two problem sizes, 256K(512x512) and 1M(1024x1024) elements. FFT has a high inherent communication to ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May
....6 A. Bilas, D. Jiang, and J.P. Singh 3. APPLICATIONS We use 10 applications from the SPLASH 2 [52] application suite (including different versions of the applications) A detailed classification and description of the application behavior for SVM systems with uniprocessor nodes is provided in [27]. The applications can be divided in two groups, regular and irregular. 3.1 Regular Applications The applications in this category are FFT [2; 52] LU [52] and Ocean [13; 48; 28] Their common characteristic is that they are optimized to be single writer applications; i.e. a given word of data ....
....a#ect their performance and the impact on SMP nodes. 3. 2 Irregular Applications The irregular applications in our suite are Barnes [3; 21; 46; 28] Radix [10; 23] Raytrace [47; 52] Volrend [39; 52; 28] and Water [52] In this work we use both original versions of several SPLASH 2 applications [27] as well as versions that have been restructured to improve performance on SVM systems [28] The same restructurings are found to be very important on large scale hardware coherent machines [29] Thus, although they are often substantial and algorithmic, they are not specific to SVM. FFT, ....
L. Iftode, J. P. Singh, and Kai Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), May 1996.
No context found.
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
.... note that in Volrend restructuring also greatly improves false sharing and fragmentation in the image at page granularity, and hence data wait time) Many characteristics of the applications relevant to SVM, including sharing patterns, message frequencies and message sizes, are described in [9, 22, 2]. For FG, not examined in [10] restructuring helps significantly in cases where application access granularity is made larger (e.g. Ocean) since it allows a larger granularity to be PRAM SVM FG 0 5 10 15 Speedup FFT C P 2 1 C P C P 0 C P 0 0 C P 0 C P ....
....Ocean contiguous behaves similarly, although it is a regular application, due to fine grained (one element) remote accesses at column oriented partition boundaries in the nearneighbor calculations. When di# cost is a problem, hardware support for automatic write propagation [3] can eliminate di#s [12, 8, 9], at the potential cost of contention and or code instrumentation; we might expect it to help substantially in Water nsquared and Radix. Finally, improving protocol costs halfway (C0P 1 2 , not shown in the figures) usually provides about half or less of the benefit of eliminating them (more on ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
.... note that in Volrend restructuring also greatly improves false sharing and fragmentation in the image at page granularity, and hence data wait time) Many characteristics of the applications relevant to SVM, including sharing patterns, message frequencies and message sizes, are described in [10, 22, 2]. For FG, not examined in [11] restructuring helps significantly in cases where application access granularity is made larger (e.g. Ocean) since it allows a larger granularity to be PRAM SVM FG 0 5 10 15 Speedup FFT C P 2 1 C P C P 0 C P 0 0 C P 0 C P ....
....behaves similarly, although it is a regular application, due to fine grained (one element) remote accesses at column oriented partition boundaries in the nearneighbor calculations. When diff cost is a problem, hardware support for automatic write propagation [3] can eliminate diffs [13, 9, 10], at the potential cost of contention and or code instrumentation; we might expect it to help substantially in Water nsquared and Radix. Finally, improving protocol costs halfway (C0P 1=2 , not shown in the figures) usually provides about half or less of the benefit of eliminating them (more on ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
....a clean version of the page at the beginning of a synchronization interval, word by word comparison to compute the di#, and applying the di# upon a page fault. In addition to their direct costs, these operations substantially pollute the primary cache and hurt application performance further [20]. 3.3.2 Home Based Protocols: An Intermediate Approach An intermediate form of laziness in data propagation is to propagate the modifications in two stages. A home node is selected for each shared page and modifications are propagated eagerly (at or before a release) to the home only [19] This ....
.... to perform substantially better on this platform, with the performance gap increasing with the number of processors [47] Similar results are seen in earlier, simulationbased comparisons between a home based protocol with some hardware support and a no home LRC for a wider range of applications [19, 20]. An alternative multiple writer scheme that shares some properties of home based write collection was used in [26] to implement an ERC protocol (the one in Table 1) On a release, invalidations are sent eagerly to sharers (ERC) Benchmarks LRC HLRC LU 11.5 13.9 SOR 13 22.7 Water Nsq 11.7 ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....from the home page to a non home page directly to bring the non home copy up to date. Such a data transfer requires very little software overhead, no memory copy, and no additional overhead at the receiving node. A previous paper compares the AURC protocol with a homeless di# based LRC protocol [13] using a simulator, but its evaluation has two limitations from the viewpoint of evaluating memory mapped communication and AU support. First, the protocols it compares di#er not only in the update propagation mechanism used (di#s versus hardware supported automatic update) but also in the type ....
....higher end point overhead (and perhaps endpoint contention) in HLRC due to di# computation and bursty propagation, has not been evaluated using an actual implementation because no AU implementation of this type has existed until now. Previous evaluations of AURC were performed using simulations [12, 13] and the software only protocol used for comparison was a homeless LRC protocol. This paper is the first evaluation of the actual performance benefits of AU support for a home based LRC protocol on a real SHRIMP system. 4 Performance Evaluation 4.1 Applications To evaluate the performance of ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....over total translation lookups) because page pinning cost is not a linear function with respect to the number of pages pinned in a system call. The applications fall in two categories based on their communication patterns: regular, which include FFT and LU, and irregular, which include the rest [25, 34]. Even the simple sequential policy is very e#ective for most applications. The only exception is FFT which performs a lot of unnecessary pinning unpinning with 16 page prepinning. FFT is a regular application with a strided access pattern such that it does not access most of the pages that are ....
L. Iftode, J. P. Singh, and Kai Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
.... not only in hardware support but also in the manner in which they propagate changes and solve the multiple writer problem (i.e. in the protocol layer) Essentially, every shared page has a home node, and writes observed to a page are propagated to the home at a fine granularity in hardware [13, 15, 21], without interrupting the processor at the home. In the automatic write propagation (also called automatic update or AU) approach, shared pages are mapped write through in the caches so that writes can be snooped o# the memory bus. When a node incurs a page fault, the page fault handler retrieves ....
....write propagation (also called automatic update or AU) approach, shared pages are mapped write through in the caches so that writes can be snooped o# the memory bus. When a node incurs a page fault, the page fault handler retrieves the page from the home where it is guaranteed to be up to date [13, 15]. Data are kept consistent according to a page based software consistency protocol such as lazy release consistency [20] Thus, consistency in maintained at page granularity, while there is some hardware support for fine grained communication. The SHRIMP system [6] provides both the write snooping ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....has also been explored, most recently in the Shasta system [18, 6] on a small scale cluster of SMPs. For the scalability of uniprocessor node systems, a simulation study compared a hardware supported home based protocol (the AURC protocol) with a non home based, original Treadmarks style protocol [9]. It examined both 16and 32 processor clusters, and studied the impact of problem size. However, it used rather small problem sizes, did not examine application restructuring, and was limited by not being performed on a real system. Another study examined the all software home based HLRC and the ....
L. Iftode., J. P. Singh, and K. Li. Understanding application performance on shared virtual memory systems. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 122--133, May 1996.
....Progress was slow until the release consistency (RC) model [GLL 90] breathed new life into the software CHAPTER 1. INTRODUCTION 4 ScC EC [BZ91] ISL96c] DC Heterogeneous DSM Local Access Control [CBZ91] DWB91] Li92] BH90] RC SC [Li86] GLL90] Home based LRC [IDFL96,ZIL96] ISL96a,JSS97] Kel96] SW LRC Adaptive LRC [KCZ92] LRC ERC Fine grain [SFL94] Protocols Applications Architectural Support [SB97] Home based ScC [ISL96c] Multicast based ScC [SZB94] DCZ96] Consistency Models Software dirty bits [CF89,PL93, Remote Operations IDFL96,KS96b] and ....
.... of SVM research indicated in Figure 2: a new consistency model called scope consistency [ISL96c] home based LRC protocols [IDFL96, ZIL96] automatic update architectural support for SVM [IDFL96, BAB 98, IBD 98] and application classification with respect to their behavior under SVM [ISL96a] In the following sections I review the most important research results in these areas, summarize comparative performance results from literature and identify the lessons learned so far and the key outstanding questions. 1.4 Relaxed Consistency Models A memory consistency model defines ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....showing that multiple writer protocols are indeed valuable for irregular applications under SVM. 5. 2 Detailed Analysis To understand the reasons for the performance differences, it is useful to classify the applications according to their data access patterns and synchronization behavior [31, 1, 12]. In this section, we will first describe application classifications according to the number of writers per coherence unit, spatial data access granularity, and temporal synchronization granularity. We will then provide a detailed analysis for each category of applications. 5.2.1 ....
....applications into single writer and multiple writer applications. Write write false sharing occurs only for multiple writer applications. ffl Coarse grain vs. Fine grain data access Data access granularity affects how the communication to computation ratio changes with the coherence granularity [12]. Applications with coarse grain access tend to access a whole contiguous page at a time. Fine grain applications are likely to scatter reads and writes across multiple pages. Fine grain reads can introduce fragmentation with coarse coherence granularity and false sharing. ffl Coarse grain vs. ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....all software protocols not only in hardware support but also in the manner in which they propagate changes and solve the multiple writer problem. Essentially, every shared page has a home node, and writes observed to a page are automatically propagated to the home at a fine granularity in hardware [11, 13, 19]. Shared pages are mapped write through in the caches so that writes appear on the memory bus. When a node incurs a page fault, the page fault handler retrieves the page from the home where it is guaranteed to be up to date. Data are kept consistent according to a page based software consistency ....
....definition. Alternative definitions are also possible. 4 Applications We use the SPLASH 2 [26] application suite in our evaluation. A more detailed classification and description of the application behavior for SVM systems with uniprocessor nodes is provided in the context of AURC and LRC in [13]. Here we describe only the characteristics of greatest relevance to this study. The applications can be divided in two groups, regular and irregular. The regular applications are FFT, LU and Ocean. Their common characteristic is that when structured for good performance on shared memory they are ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....has also been explored, most recently in the Shasta system [18, 6] on a small scale cluster of SMPs. For the scalability of uniprocessor node systems, a simulation study compared a hardware supported home based protocol (the AURC protocol) with a non home based, original Treadmarks style protocol [9]. It examined both 16and 32 processor clusters, and studied the impact of problem size. However, it used rather small problem sizes, did not examine application restructuring, and was limited by not being performed on a real system. Another study examined the all software home based HLRC and the ....
L. Iftode., J. P. Singh, and K. Li. Understanding application performance on shared virtual memory systems. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 122--133, May 1996.
.... not only in hardware support but also in the manner in which they propagate changes and solve the multiple writer problem (i.e. in the protocol layer) Essentially, every shared page has a home node, and writes observed to a page are propagated to the home at a fine granularity in hardware [13, 15, 21], without interrupting the processor at the home. In the automatic write propagation (also called automatic update or AU) approach, shared pages are mapped write through in the caches so that writes can be snooped off the memory bus. When a node incurs a page fault, the page fault handler ....
....write propagation (also called automatic update or AU) approach, shared pages are mapped write through in the caches so that writes can be snooped off the memory bus. When a node incurs a page fault, the page fault handler retrieves the page from the home where it is guaranteed to be up to date [13, 15]. Data are kept consistent according to a page based software consistency protocol such as lazy release consistency [20] Thus, consistency in maintained at page granularity, while there is some hardware support for fine grained communication. The SHRIMP system [6] provides both the write snooping ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....can divide the applications in single writer and multiple writer. Write write false sharing occurs only for multiple writer applications. ffl Coarse grain vs. Fine grain data access Data access granularity affects how the communication to computation ratio changes with the coherence granularity [11]. Applications with coarse grain access tend to access a whole contiguous page at a time. Fine grain applications are likely to scatter reads and writes across multiple pages. Fine grain reads can introduce fragmentation with coarse coherence granularity and or false sharing. ffl Coarse grain vs. ....
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....The block size of 16 is selected to maintain good load balance. In the experiments, eight processors of JIAJIA obtained a speed up of 2.65 for 1024 Theta1024 matrix, and 5.96 for 3072 Theta3072 matrix. This indicates that the best problem size for SVM system is medium or large, as described in [16]. For small problems, the communication and computation ratio is high, and the performance produced by multiple processors can not tradeoff the large communication overhead. Despite of this, JIAJIA still does better than CVM in both absolute execution time and speedups. This strongly validates the ....
L. Iftode, J. Singh, and K. Li, "Understanding Application Performance on Shared Virtual Memory Systems", in Proceedings of the International Symposium on Computer Architecture, pp. 122-133, May 1996.
....single writer [25] And at least on the Paragon platform, the home based software multiple writer protocol seems to often substantially outperform the original, non homebased one. A simulation based comparison of home based with architectural support versus non home based LRC can be found in [17]. Two recent papers [2, 21] independently propose schemes to improve migratory sharing (the third problem above) by recognizing this sharing pattern and treat it differently. For pages that exhibit migratory sharing and not write write false sharing within a synchronization interval, there is ....
....the significant differences between the prototype results and results obtained in simulations that assume hardware remote read support, illustrating the dependence. Recall that other studies support the advantage of home based protocols both with and without AU support over original TreadMarks LRC [37, 17]. 2.4.3 Communication Parameters In addition to the above forms of support, comparisons between protocols are also affected greatly by the communication mechanisms and parameter values used, and studies should examine these effects as well. In a typical SVM implementation (without a dedicated ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....software communication and synchronization costs can be high as can protocol overhead, and the performance potential of this approach across a wide range of applications is not well understood. Previous research has studied parallel application performance on particular shared memory systems [11, 12, 7, 5, 17, 2, 14, 20]. Studies on shared virtual memory have largely used applications as they were written for hardware cache coherent machines. The performance evaluations so far point out that SVM is very sensitive to data referencing and communication patterns, and that for certain classes of applications there is ....
....use both regular and irregular applications. Second, we explore applications with a range of behaviors: different inherent communication and data referencing patterns, and different access granularities to data that interact with SVM page granularity to produce different induced sharing patterns [14]. By fine grained access we mean that the accesses to data by a process are not highly spatially contiguous, which usually implies that accesses from different processes (at least compared to page size) are interleaved at quite fine granularity in the address space. As per the classification in ....
[Article contains additional citation context not shown here]
Iftode L., Singh J. P., and Li K. Understanding Application Performance on Shared Virtual Memory Systems. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 122--133, May 1996.
....is called its home. This makes update resolution in AURC extremely simple: a single full page fetch from the home. On the other hand, in standard LRC collections of updates (diffs) are distributed and homeless, making update resolution more difficult to perform. Extensive performance evaluation [16] has shown that AURC outperforms standard LRC in most cases. In this paper we propose two new home based LRC protocols. The first, Home based LRC (HLRC) is similar to the Automatic Update Release Consistency (AURC) protocol. Like AURC, HLRC maintains a home for each page to which all updates are ....
....may still need them. When implementing the protocol on a large scale machine, memory consumption can become a severe problem. To reduce memory consumption the shared virtual memory system must perform garbage collection frequently [21] 2. 2 Automatic Update Release Consistency The AURC protocol [16] implements Lazy Release Consistency without using any diff operations by taking apply diff diff (a) Standard LRC Acquire(l) Release(l) Write(x) make twin diff apply proc proc Compute Compute Node 0 Node 1 diff Acquire(l) Write(x) Acquire(l) Compute proc proc Compute Node ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....the SPLASH 2 [22] application suite. This section briefly describes the basic characteristics of each application relevant to this study. A more detailed classification and description of the application behavior for SVM systems with uniprocessor nodes is provided in the context of AURC and LRC in [9]. The applications can be divided in two groups, regular and irregular. 4.1 Regular Applications The applications in this category are FFT, LU and Ocean. Their common characteristic is that they are optimized to be single writer applications; a given word of data is written only by the processor ....
....among nodes such that writes to shared data are almost all local. In HLRC we do not need to compute diffs, and in AURC we do not need to use a write through cache policy. Protocol action is required only to fetch pages. The applications have different inherent and induced communication patterns [22, 9], which affect their performance and the impact on SMP nodes. Application Page Faults Page Fetches Local Lock Acquires Remote Lock Acquires Barriers 1 4 8 1 4 8 1 4 8 1 4 8 1,4,8 FFT (20) 397.12 251.89 270.32 393.31 167.17 91.59 0.00 0.00 0.00 0.00 0.00 0.00 1.14 LU(contiguous) 512) 81.36 ....
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....the remote pages. The Automatic Update data bypass the I O bus and are sent to the I O card through a private bus. The Automatic Update mechanism is transparent to the user. It is possible to use such minimal hardware support to build an SVM layer that outperforms all software SVM implementations [6, 7]. One such protocol, called AURC (Automatic Update Release Consistency) uses lazy release consistency [10] together with the automatic update mechanism to implement a multiple writer protocol [6] A similar approach using a directory based scheme has been taken on other systems that provide ....
....the AURC protocol to use hardware coherent SMPs as the basic nodes, and examines the issues in this process. Then, it examines the performance of a few applications with different communication requirements for which the performance on uniprocessor node LRC and AURC systems is well understood [7]. It also compares two extended protocols (MP AURC and MP LRC) with different communication traffic requirements. We find that clustering in SMP nodes does indeed reduce the amount of SVM communication per processor (e.g. page faults) but causes increased aggregate bandwidth requirements per ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
.... for the propagation of writes at fine granularity (a word or a cache line, say) to a remotely mapped page of memory [4, 13] This facility can be used to accelerate home based protocols by eliminating the need for diffs, leading to a protocol called automatic update release consistency or AURC [16]. Now, when a processor writes to pages that are remotely mapped (i.e. writes to a page whose home memory is remote) these writes are automatically propagated in hardware and merged into the home page, which is thus always kept up to date. At a release, a processor simply needs to ensure that ....
....and protocol data and messages are much smaller than under standard LRC. Studies on different platforms have indicated that home based protocols outperform outperform traditional LRC implementations, at least on the platform and applications tested, and also incur much smaller memory overhead [16, 30]. Having understood the basic protocol ideas, let us proceed to the main goal of this paper, to examine how and how well the protocols can be used to extend a coherent shared address space in software across SMP nodes. 3. Extending Home based Protocols to SMP Nodes. 3.1. Protocol Design. Consider ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and K. Li, Understanding application performance on shared virtual memory, in Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....to the owner s copy (b) communication from the owner s copy to the local copies node N node N Figure 8: Basic Communication Schemes in AURC the network itself is not simulated. We compare the performance of ScC against AURC as implemented on a SHRIMP like machine. Results from a previous studies [10, 11] show that AURC substantially outperforms the all software LRC protocol running on the same hardware (the latter does not exploit the automatic update feature) Parameter Value Processor clock 60 MHz Page size 4 Kbytes Data cache 256 Kbytes Cache line size 32 bytes Write buffer size 4 words Memory ....
....the increase in protocol overhead and communication. 6 All software Implementation for ScC The hardware support for automatic update is very valuable for shared virtual memory independent of scope consistency [10] and several shared virtual memory systems have been built using this feature (see [10, 15, 11]) We have seen that scope consistency can be easily built on top of this in software, and has performance benefits. However, it is also interesting to see if ScC can be built on top of an allsoftware protocol, without automatic update support. One way to do this is to emulate the AURC protocol ....
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....be larger on real SVM systems, where the overheads of access violations, i.e. page faults, are higher. 5. 2 Detailed Analysis To understand the reasons for the performance differences, it is useful to classify the applications according to their data access patterns and synchronization behavior [29, 1, 12]. In this section, we will first describe application classifications according to the number of writers per coherence unit, spatial data access granularity and temporal synchronization granularity. We will then provide a detailed analysis for each category of applications. 5.2.1 Classification ....
....0 5 10 15 (l) Barnes Original SC SW LRC HLRC Figure 1: Speedups on T0 with 16 nodes. Some numbers are missing because in those cases performance is dramatically affected by the disk swapping and becomes irrelevant for this study. computation ratio changes with the coherence granularity [12]. Applications with coarse grain access tend to access a whole contiguous page at a time. Fine grain applications are likely to scatter reads and writes across multiple pages. Fine grain reads can introduce fragmentation with coarse coherence granularity and or false sharing. ffl Coarse grain vs. ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
....Consistency used with a home based protocol in which modifications are eagerly propagated to the home node for a page whether supported by hardware automatic update(AU) or in software. These home based protocols have been shown to substantially outperform earlier distributed approaches to LRC [17, 33]. To understand the performance implications we implemented an AU based ScC and an AURC protocol within the TangoLite simulation framework [15] We conducted detailed simulation studies with five Splash 2 [31] applications, as well as a synthetic benchmark designed to emphasize a communication ....
....two important consistency models which are supported in software shared memory systems. 2. 1 Release Consistency The Release Consistency(RC) model was initially introduced for hardware cache coherent multiprocessor [11] It has since become the cornerstone of software shared virtual memory systems [8, 20, 19, 17, 21]. Release consistency distinguishes between ordinary memory accesses and synchronization accesses defined as either acquire or release. To make this possible the program must be properly labeled [11] Figure 1) Only the synchronization accesses are strictly ordered (either sequentially [22] or ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and Kai Li. Understanding Application Performance on Shared Virtual Memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
.... locks with the relaxed software protocol (to reduce false sharing at page granularity which would otherwise have destroyed performance anyway) Also there is tremendous serialization at locks because critical sections are greatly dilated by expensive page faults and protocol activity within them [14]. The ORIG, ORIG LOCAL, and UPDATE algorithms use a lot of locking, so they perform very poorly (see the end of section 4 for measurements of dynamic lock counts) Since the PARTREE version is more coarse grained and needs less locking, it performs better. However, the only algorithm that performs ....
L. Iftode, JP Singh, K. Li, "Understanding Application Performance on Shared Virtual Memory Systems", In proceedings of the 23th International Symposium on Computer Architecture, May 1996.
....application suite to evaluate the HLRC SMP protocol. We now briefly describe the basic characteristics of each application that are relevant to the use of SMP nodes. A more detailed classification and description of the application behavior for SVM systems with uniprocessor nodes is presented in [12]. The applications can be divided in two groups, regular and irregular. The regular applications are FFT, LU and Ocean. Their common characteristic is that they are single writer applications a given word of data is written only by the processor to which it is assigned. Given appropriate data ....
....well and pages can be allocated among nodes such that writes to shared data are mostly local. There is no need to compute diffs for these applications. Protocol action is required only to fetch pages. The applications have different inherent and induced (at page granularity) communication patterns [24, 12], which affect their performance and the impact on SMP nodes. FFT: The all to all, read based communication in FFT is essentially a transposition of a matrix of complex numbers. LU: The contiguous version of LU is single writer and exhibits a very small communication to computation ratio, but it ....
[Article contains additional citation context not shown here]
L. Iftode, J. P. Singh, and K. Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
No context found.
L. Iftode, J. P. Singh, and K. Li. Understanding Application Performance on Shared Virtual Memory Systems. In Proc. of the ISCA'96, pp. 122--133, May 1996.
No context found.
L. Iftode, J. PalSingh, K. Li. Understanding application performance on shared virtual memory systems. In Proceedings of the 23rd annual international symposium on Computer architecture, pp.122-133, 1996.
No context found.
L. Iftode, J. P. Singh, and Kai Li. Understanding application performance on shared virtual memory. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1996.
No context found.
L. Iftode, J.P. Singh, and K. Li. Understanding Application Performance on Shared Virtual Memory System. In Proceedings of the Annual International Symposium on Computer Architecture, pages 122-133, May 1996.
No context found.
L. Iftode, J. Singh, and K. Li, "Understanding Application Performance on Shared Virtual Memory Systems", in Proceedings of the International Symposium on Computer Architecture, pp. 122-133, May 1996.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC