| P. Keleher and C.-W. Tseng, "Enhancing Software DSM for Compiler-Parallelized Applications, " in IPPS, August 1997. |
....brings large performance penalties. This is a valid concern not only for DSM machines, since large shared memory machines also have a home node concept. However, home node migration would probably make allocation considerations superfluous. 6. RELATED WORK Most DSM systems are either page based [17, 20, 19] or objectbased [4, 5, 16] while discarding transparency. Jackal manages pages to implement a shared address space in which regions are stored. This allows shared data to be named by virtual addresses to avoid software address translation. For cache coherence, however, Jackal uses small, ....
P. Keleher and C. Tseng. Enhancing software dsm for compiler parallelized applications. In In Proceedings of the 11th International Parallel Processing Symposium, April 1997.
....brings large performance penalties. This is a valid concern not only for DSM machines, since large shared memory machines also have a home node concept. However, home node migration would probably make allocation considerations superfluous. 6. RELATED WORK Most DSM systems are either page based [15, 18, 17] or object based [2, 3, 14] while discarding transparency. Jackal manages pages to implement a shared address space in which regions are stored. This allows shared data to be named by virtual addresses to avoid software address translation. For cache coherence, however, Jackal uses small, ....
P. Keleher and C. Tseng. Enhancing software dsm for compiler parallelized applications. In In Proceedings of the 11th International Parallel Processing Symposium, April 1997.
....elimination[2] So far, these advanced analyses have not been used for explicit parallel shared memory programs. Existing research about cooperation between optimizing compilers and software DSM can be divided in three kinds. The first is that a parallelizing compiler targets software DSM[21, 13, 24]. For parallelizable programs, the compiler can use precise communication information. Message vectorization is applicable to regular communication. The compiler can use code generation techniques for inspector executor mechanism. Software DSM does not require complex code generation for ....
P. Keleher and C. Tseng. Enhancing Software DSM for Compiler-Parallelized Applications. In Proc. of the 11th International Parallel Processing Symp., Mar. 1996.
....one dimensional counterpart. Unfortunately, two dimensional distributions are not eciently supported in software distributed shared memory (DSM) systems. Such systems provide an attractive shared memory programming model on a distributed machine [LH89, CBZ91, KDCZ94, Ift98, LCD 97, CDLZ97, KT97] In a software DSM, two dimensional distributions can cause a large amount of excess communication This work was supported by National Science Foundation CAREER Grants CCR 9733063 and CCR 9876073. y Department of Computer Science, The University of Georgia, Athens, GA 30602. ....
Pete Keleher and Chau-Wen Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, April 1997.
....uses a combination of compile time and runtime techniques to determine the placement of objects. Many techniques exist that are related to ours. Early work on runtime techniques for data placement includes Lucco s implementation of Linda, which dynamically monitored the usage of tuples [31] CVM [25] is a software DSM (similar to TreadMarks) that has been used as a target for a parallelizing compiler. CVM uses a hybrid invalidate update protocol. The compiler determines which pages have a communication behavior for which an update protocol is better than an invalidation protocol (which is the ....
P. Keleher and C-W. Tseng, "Enhancing Software DSM for Compiler-Parallelized Applications," Proc. 11th Int. Parallel Processing Symposium, Geneva, Switzerland, pp. 490-499 (April 1997).
....into the node s memory by sending a message to the node that has the page. Hence, internode communication is done implicitly by the DSM system in a manner similar to how disk I O is managed in paged operating systems. Recently, DSM systems have received attention as attractive compiler targets[KT97, CDLZ97] This is because it is much easier to generate code without having to determine at compile time whether data is local or not. Furthermore, when communication patterns cannot be determined at compile time, compilers often have to generate code with all to all communication, which is ....
Pete Keleher and Chau-Wen Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, April 1997.
....communication in DSM systems. However, their work primarily concerns local (single phase or two adjacent phases) data distribution. Finally, many have studied integrated compiler DSM systems with a focus on elimination of as many consistency actions as possible using compiler information. KT97, LCD 97, CDLZ97, MHS94] 6 Summary We have described the design and implementation of an integrated compiler run time system for global data distribution in distributed shared memory (DSM) systems. The SUIF Adapt system efficiently supports a larger class of applications than previous ....
Pete Keleher and Chau-Wen Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, April 1997.
....shared memory parallelizing compilers with software distributed shared memory (DSM) systems that provide a coherent shared address space in software. Shared memory parallelizing compilers are easy to use, flexible, and can accept a wide range of applications. Results from several recent studies [1, 5] indicate they can approach the performance of current messagepassing compilers or explicitly parallel message passing programs on distributedmemory machines. Unfortunately, load imbalance and synchronization overhead were identified as sources of inefficiency in compiler parallelized programs. ....
....measurements that reducing idle time caused by synchronization is important for achieving good performance. In this paper we investigate a number of compiler techniques for reducing synchronization overhead and load imbalance. Our techniques are evaluated in a prototype compiler runtime system [5] using the CVM [3] software DSM as a compilation target for the SUIF [2] shared memory parallelizing compiler. We develop a number of new techniques, including 1) eliminating barriers by inexpensively detecting communication using local subscript analysis, 2) exploiting lazy release consistency to ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
....varies from 2 to 20 . In general, two factors account for the difference. The compilergenerated shared memory programs have excess synchronization and additional data communication. The latter is because there is less processor locality in the programs data access patterns. Keleher and Tseng [10] perform a similar study which also compares the performance of compiler generated DSM programs with compiler generated message passing programs. Instead of using commercial Fortran compilers to compile all the programs, they use the Stanford SUIF [1] parallelizing compiler version 1.0 to generate ....
P. Keleher and C. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, 1997.
....machine, with a single application thread per process. The runtime uses kernel threads. This is the base case and approximates the typical environment used in previous work on DSMs e.g. TreadMarks has been studied on ATM networked DECstation 5000 240s [2] CVM results were presented on the IBM SP 2 [13]. P4T4 KT: four processes, one per machine with four application threads per process. Kernel threads are used throughout. Multiple application threads can be scheduled across processors in this case, and multiple requests can be handled by the DSM server thread. P4T4 UT: same as above, but using ....
P. Keleher and C.-W. Tseng. Enhancing Software DSM for Compiler-Parallelized Applications. In Proceedings of International Parallel Processing Symposium, August 1997. FFT LU-c LU-n WATER-sp RADIX WATER-n2 MRI MATMUL
....of such programs differ, many have highly regular sharing behaviors. The set of shared data accessed by individual threads is often invariant from one iteration to the next. This regular behavior can be used by DSMs to predict future accesses, and to move data in advance of subsequent accesses [4, 11, 12]. Such update protocols allow much of the latency of remote data fetches to be hidden. Given reasonably efficient communication, DSMs should be able to achieve good speedups on such applications. The output of parallelizing compilers, such as SUIF [13] is a good source for this type of ....
....memory environments requires far less analysis. Further, the set of applications that can currently be 2 analyzed well enough to turn into a shared memory application is much larger than for message passing applications. By combining parallelizing technology with sophisticated runtime systems [3, 4, 12], we can create a programming environment that is flexible and easy to use. Scientists are not required to write message passing programs or use data parallel languages such as HPF. Instead, they can write sequential programs, rewriting a few computation intensive procedures, and adding ....
[Article contains additional citation context not shown here]
C.-W. Tseng and P. Keleher, "Enhancing Software DSM for Compiler-Parallelized Applications," in 11th International Parallel Processing Symposium, 1997.
No context found.
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
....performance in software DSM systems: compiler directed coherence protocols and compile time synchronization elimination. Software DSMs typically use a lazy invalidate protocol to maintain coherence in shared memory. We found compilers can identify opportunities for exploiting customized protocols [24]. A reduction protocol can be applied to shared data modified through associative operations (e.g. addition) to combine results efficiently. A flush update protocol can be used at barriers to aggregate and pre send data communicated between processors in fixed patterns. When used together in ....
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
....of such programs differ, many have highly regular sharing behaviors. The set of shared data accessed by individual threads is often invariant from one iteration to the next. This regular behavior can be used by DSMs to predict future accesses, and to move data in advance of subsequent accesses [1 3]. Such update protocols allow much of the latency of remote data fetches to be hidden. Given reasonably efficient communication, DSMs should be able to achieve good speedups on such applications. The output of parallelizing compilers, such as SUIF [4] is a good source for this type of ....
....memory environments requires far less analysis. Further, the set of applications that can currently be analyzed well enough to turn into a shared memory application is much larger than for message passing applications. By combining parallelizing technology with sophisticated runtime systems [1, 3, 5], we can create a programming environment that is flexible and easy to use. Scientists are not required to write message passing programs or use data parallel languages such as HPF. Instead, they can write sequential programs, rewriting a few computation intensive procedures, and adding ....
[Article contains additional citation context not shown here]
C.-W. Tseng and P. Keleher, "Enhancing Software DSM for Compiler-Parallelized Applications," in 11th International Parallel Processing Symposium, 1997.
....gather nonlocal data. A second approach is to combine shared memory compilers (e.g. SUIF [11] with software DSM systems (e.g. TreadMarks [30] CVM [24] which provide a shared memory interface. Software DSMs are less efficient than explicit messages, but are much simpler compilation targets [4,25]. In this paper, we examine existing approaches to parallelizing irregular reductions, and propose a new efficient algorithm. 1.1 Irregular Reductions We begin by looking at the example irregular reduction shown in Figure 1. The computation loops over the edges of an irregular graph, computes a ....
....is inside a time step loop t with many repetitions. The number of time steps executed is a function of the application, but is usually quite large. Iterative computations are a boon to software DSMs, which can take advantage of repeated communication patterns to predict prefetches to nonlocal data [25,39]. Second, many irregular scientific computations are adaptive, where the data access pattern may change over time as the computation adapts to data. The example in Figure 1 is adaptive because condition change may be true on some iterations of the time step loop, modifying elements of the index ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
....second approach is to combine shared memory compilers (e.g. SUIF [6] with software distributed sharedmemory (DSM) systems (e.g. TreadMarks [17] CVM [13] which provide a shared memory interface. Software DSMs are less efficient than explicit messages, but are much simpler compilation targets [3, 14]. In this paper, we introduce LocalWrite, a new compiler and run time parallelization technique which can improve performance for certain classes of irregular reductions. We evaluate the performance of different parallelization approaches as we vary application characteristics, in order to ....
....of irregular reductions. We evaluate the performance of different parallelization approaches as we vary application characteristics, in order to identify areas in which software DSMs can match or even exceed the efficiency of explicit messages. Experiments are conducted in a prototype system [7, 14] using the CVM [13] software distributed shared memory (DSM) as a compilation target for the SUIF [6] shared memory compiler. Our paper makes the following contributions: develop and evaluate LocalWrite, a new compiler and run time technique for parallelizing irregular reductions based on the ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
....rather than to invalidate them. The advantage of such protocols is that subsequent page faults are avoided, but the lack of any selectivity often causes update protocols to move far more data than invalidate protocols. Several researchers have described more selective update protocol variants [3, 32, 33] that might also suffice in this example. However, these protocols effectively encode expected sharing behavior into the underlying protocol. By making such expectations part of the programmable protocol interface, the tape mechanism has far more flexibility. 6 2.1 Operations As mentioned ....
....can be used to improve data movement at inefficient points in application executions. Our future work with tapes will center on two areas. First, we are exploring the use of compilers to automatically generate tapes interfaces. This work is complementary to recent work in parallelizing compilers [8, 32]. Tapes improve performance by exploiting repetitive access patterns. Identifying such patterns with high degree of probability in the compiler is much easier than generating explicit messagepassing code for the data movement. Hence, compiler heuristics that might not be rigorous enough to ....
C.-W. Tseng and P. Keleher, "Enhancing Software DSM for Compiler-Parallelized Applications, " in 11 th International Parallel Processing Symposium, 1997.
....as well as irregular computations, making software DSMs more attractive on message passing machines. Without compiler support, software DSMs also exploit iterative nature of scientific applications by prefetching the same non local data used in the previous iteration to eliminate access misses [30, 49]. 8.2 Locality optimizations for irregular codes Data locality has been recognized as a significant performance issue for modern processor architectures. Most researchers have focused on loop transformations on densematrix codes [37, 47, 43, 50] though recent work has focused on data layout ....
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, Apr. 1997.
....programs are portable since they can be run on the large scale parallel machines as well as the low end, but more pervasive multiprocessor workstations. Shared memory parallelizing compilers are easy to use, flexible, and can accept a wide range of applications. Results from several recent studies [4, 14] indicate they can approach the performance of current message passing compilers or explicitly parallel message passing programs on distributed memory machines. However, load imbalance and synchronization overhead were identified as sources of inefficiency when compared with message passing ....
....these measurements that reducing load imbalance caused by synchronization overhead is important for achieving good performance. In this paper we investigate a number of compiler techniques for reducing synchronization overhead and load imbalance. Our techniques are evaluated in a prototype system [14] using the CVM [12] software distributed sharedmemory (DSM) as a compilation target for the SUIF [9] shared memory compiler. This paper makes the following contributions: ffl eliminate barriers by inexpensively detecting communication using local subscript analysis ffl exploiting lazy release ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
....and gather nonlocal data. A second approach is to combine shared memory compilers (e.g. SUIF [7] with software DSM systems (e.g. TreadMarks [19] CVM [15] which provide ashared memory interface. Software DSMs are less efficient than explicit messages, but are much simpler compilation targets [3, 16]. In this paper, we introduce LOCALWRITE, a new compiler and run time parallelization technique which can improve performance for certain classes of irregular reductions. We evaluate the performance of different parallelization approaches as we vary application characteristics, in order to ....
....of irregular reductions. We evaluate the performance of different parallelization approaches as we vary application characteristics, in order to identify areas in which software DSMs can match or even exceed the efficiency of explicit messages. Experiments are conducted in a prototype system [8, 16] using the CVM [15] software distributed shared memory (DSM) This research was supported by NSF CAREER Development Award #ASC9625531 in New Technologies. The IBM SP 2 and DEC Alpha Cluster were provided by NSF CISE Institutional Infrastructure Award #CDA9401151 and grants from IBM and DEC. as a ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compilerparallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, Apr. 1997.
....from IBM and DEC. 0 20 40 60 80 100 barrier imbalance seq wait comm OS application Figure 1. Breakdown of Total Execution Time (16 Processor SP 2) Shared memory parallelizing compilers are easy to use, flexible, and can accept a wide range of applications. Results from recent studies [3, 14] indicate they can approach the performance of current message passing compilers or explicitly parallel messagepassing programs on distributed memory machines. However, load imbalance and synchronization overhead were identified as sources of inefficiency when compared with message passing ....
....by synchronization overhead is important for achieving good performance. In this paper we investigate a number of compiler techniques for reducing synchronization overhead and load imbalance for compiler parallelized applications on software DSMs. Our techniques are evaluatedin a prototype system [14] using the CVM [12] software distributed shared memory (DSM) as a compilation tar get for the SUIF [8] shared memory compiler. This paper makes the following contributions: ffl empirical evaluation of compiler synchronization optimizations for software DSMs ffl evaluating impact of ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compilerparallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, Apr. 1997.
....second approach is to combine shared memory compilers (e.g. SUIF [8] with software distributed shared memory (DSM) systems (e.g. TreadMarks [24] CVM [18] which provide a shared memory interface. Software DSMs are less efficient than explicit messages, but are much simpler compilation targets [4, 20]. In this paper, we introduce LOCALWRITE, a new compiler and run time parallelization technique which can improve performance for certain classes of irregular reductions. We evaluate and compare the performance of different approaches to supporting irregular computations as a function of how ....
....is required, as well as how frequently the connection pattern changes during the course of the computation. These parameters are important in identifying areas in which software DSMs can match or even exceed the efficiency of explicit messages. Experiments are conducted in a prototype system [9, 20] using the CVM [17] software distributed shared memory (DSM) as a compilation target for the SUIF [8] shared memory compiler. Our paper makes the following contributions: ffl develop and evaluate LOCALWRITE, a new compiler and run time technique for parallelizing irregular reductions based on ....
[Article contains additional citation context not shown here]
P. Keleher and C.-W. Tseng. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium, Geneva, Switzerland, April 1997.
No context found.
P. Keleher and C.-W. Tseng, "Enhancing Software DSM for Compiler-Parallelized Applications, " in IPPS, August 1997.
No context found.
P. Keleher and C.--W. Tseng, Enhancing Software DSM for Compiler--Parallelized Applications, Proceedings of the Seventh International Symposium on High Performance Distributed Computing, 180--188.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC