| Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, and Jennifer M. Anderson. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the ACM SIGPLAN 1997. |
....performance may suffer from the latency of remote memory accesses, which is several times higher than the latency of local memory accesses. The parallel processing community has been addressing this problem by incorporating data and thread placement facilities in shared memory programming models [3, 4, 17]. Albeit effective, this solution sacrifices the transparency of the shared memory programming abstraction, by exposing architectural state to the programs. Shared memory programming paradigms are fundamentally based on location transparency for both data and computation. Data distribution ....
....the processors that execute the program. The AFFINITY directive is in analogy to the ONHOME clause of HPF and has been proposed in previous work as an extension to shared memory programming paradigms that helps the programmer express mappings of computation that enforce memory access locality [3, 4]. What we try to circumvent with the work presented in this paper is the requirement to explicitly distribute data in codes like LU, whenever the desired collocation of computation and data can be achieved by letting the operating system place data in memory and in parallel, move the right pieces ....
[Article contains additional citation context not shown here]
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedelijkovic, and J. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors. In Proc. of the 1997.
....dropping of exclusive lines is beneficial to uniprocessor applications because directory bandwidth is not wasted with replacement hints. In order to support application execution in ccNUMA environment, the Origin designers decided to implement page migration to improve application performance [7]. Efficient page migration requires a fast data copy mechanism and the ability to globally purge stale TLB entries. Implementing a global TLB shootdown algorithm without hardware support can be very costly. Origin directory protocol allows for an efficient global TLB purge by supporting directory ....
CHANDRA, R., CHEN, D.-K., COX, R., MAYDAN, D. E., NEDELJKOVIC, N., AND AN- DERSON, J. M. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation (Las Vegas, NV, June 1997).
....by each code block can be determined by considering individual loop nests that make up the code block. The crucial step in this process is taking into account the parallelization information [2] Individual nests can either be parallelized explicitly by programmers using compiler directives [1, 4], or can be parallelized automatically (without user intervention) as a result of intra procedural and inter procedural compiler analyses [2, 10, 11] In either case, after the parallelization step, our approach determines the data regions (for a given dataset) accessed by each processor involved. ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data-distribution support on distributed-shared memory multiprocessors. In Proc. Prog. Lang. Design and Implementation, Las Vegas, NV, 1997.
.... single address space programs to hundreds of processors [16, 18] It is interesting to note that vendors of commercial NUMA multiprocessors have realized the importance of data distribution and implemented HPF like, platform specific data distribution mechanisms, as extensions to FORTRAN and C [8]. Unfortunately, OpenMP, which is nowadays the de facto standard for programming shared memory multiprocessors, provides no means for data distribution. As a consequence, some vendors are proposing the introduction of data distribution directives in OpenMP [4, 21, 23] as the way to achieve the ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedelijkovic, and J. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors. In Proc. of the
....by each code block can be determined by considering individual loop nests that make up the code block. The crucial step in this process is taking into account the parallelization information [2] Individual nests can either be parallelized explicitly by programmers using compiler directives [1, 7], or can be parallelized automatically (without user intervention) as a result of intraprocedural and inter procedural compiler analyses [2, 20, 21] In either case, after the parallelization step, our approach determines the data regions (for a given dataset) accessed by each processor involved. ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data-distribution support on distributed-shared memory multi-processors. In Proc. Prog. Lang. Design and Implementation, Las Vegas, NV, 1997.
....of memory intensive programs, if pages with shared data are distant from the threads that access them more frequently upon cache misses. To surmount this problem, vendors provide either data distribution directives as extensions to OpenMP or operating system support to control the placement [1] and dynamic migration [13] of data pages. Offering data distribution directives similar to the ones offered by High performance Fortran (HPF [6] has two fundamental shortcomings. First, it is inherently platform dependent and thus hard to standardize and incorporate seamlessly in shared memory ....
R. Chandra et.al. Data Distribution Support for Distributed Shared Memory Multiprocessors. Proc. of the
....0.0 0.1 0.2 0.3 memory accesses (in millions) LU, OpenMP schedule reuse Figure 5: Histograms of memory accesses in LU. the unmodified OpenMP implementation and an implementation that encompasses explicit data distribution directives, provided as extensions to OpenMP by the SGI compiler [2]. The OpenMP implementation of the irregular kernels that uses iteration maps and schedule reuse is compared against the unmodified OpenMP implementation and a well tuned MPI implementation of the same programs. The MPI implementation implements irregular data distributions, including generalized ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedelijkovic, and J. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors. In Proc. of the
....apply run time checks to determine if it is safe to apply a candidate data transformation. We do not investigate this issue in this paper and assume that the data transformations we apply are always legal. Of course, this may not always be true; in those cases techniques proposed by Chandra et al. [13] can be used. The second important issue is the propagation of layout transformations across procedure boundaries. Currently, the scope of our work is limited to one procedure at a time and the experimental results presented later on are obtained on inlined [2] codes. We are working on a ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. M. Anderson. Data-distribution support on distributed-shared memory multiprocessors. In Proc. Programming Language Design and Implementation (PLDI'97), Las Vegas, NV, 1997.
....and parallel scientific programs. Loop transformations (e.g. loop permutation, fusion, tiling) for sequential dense matrix codes with regular memory access patterns has proven useful [19, 27, 48, 49, 57, 55, 71, 76, 87, 88] Data layout optimizations (e.g. transpose, padding) also help [2, 3, 13, 18, 39, 69, 70], even for irregular [1, 22, 58] and pointer based programs [8, 17] Despite the major advances made in providing software support for improving locality for both sequential and parallel programs, more work remains for advanced scientific computations. 2 Advanced Scientific Applications We begin ....
R. Chandra, D.-K. Chen, R. Cox, E. Maydan, N. Nedeljkovic, and J. Anderson. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.
....for controlling the distribution of data among processing nodes. It is interesting to note that some vendors of commercial ccNUMA systems have realized the importance of data distribution and implemented HPF like, platformspecific data distribution mechanisms in their FORTRAN and C compilers [5]. Since OpenMP has become the de facto standard for parallel programming on shared memory multiprocessors, some vendors are seriously considering the incorporation of data distribution facilities in the OpenMP API [6, 7] The introduction of data distribution directives in OpenMP contradicts some ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data Distribution Support for Distributed Shared Memory Multiprocessors. Proc. of the 1997 ACM Conference on Programming Languages Design and Implementation, pp. 334--345. Las Vegas, NV, June 1997.
....the suitable memory layouts for each array such that the loop nests in the program will have spatial locality (as defined earlier) with respect to each reference that they enclose If this is not possible, we want to maximize the number of references for which this is possible. Chandra et al. [6] indicate that due to some conditions related to storage and sequence assumptions about the arrays and to passing arrays as subroutine arguments, data transformations may not always be legal. We assume no such situation occurs for the example programs given in this paper. Chandra et al. 6] also ....
....et al. 6] indicate that due to some conditions related to storage and sequence assumptions about the arrays and to passing arrays as subroutine arguments, data transformations may not always be legal. We assume no such situation occurs for the example programs given in this paper. Chandra et al. [6] also propose methods on how to cope with storage sequence and parameter passing problems when data transformations are to be applied. Investigating these issues is beyond the scope of this paper. 4.2 Determining Optimal Layouts In this subsection, we present our data space restructuring ....
[Article contains additional citation context not shown here]
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data-distribution support on distributed-shared memory multiprocessors. Proc. SIGPLAN Conf. Programming Language Design & Implementation, Las Vegas, NV, pages 334--345, 1997.
....and parallel scientific programs. Loop transformations (e.g. loop permutation, fusion, tiling) for sequential dense matrix codes with regular memory access patterns has proven useful [16, 25, 42, 43, 52, 51, 63, 67, 76, 77] Data layout optimizations (e.g. transpose, padding) also help [3, 5, 12, 15, 35, 61, 62], even for irregular programs [1, 20, 53] Despite the major advances made in providing software support for improving locality for both sequential and parallel programs, more work remains. In the following sections, we discuss three important types of codes where locality support can be improved. ....
R. Chandra, D.-K. Chen, R. Cox, E. Maydan, N. Nedeljkovic, and J. Anderson. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997.
....also be removed without conducting loop fusion [23] Array permutation [3] is a recently proposed data reorganization technique which permutes array dimensions in order to increase intraarray address contiguity. Subscripts in array references are permuted accordingly. Recent work by Chandra et al. [7] provides compiler support for programmer specified data reshaping on CC NUMA multiprocessors. Such compiler support aims to reduce page level false sharing by eliminating intraarray address discontiguity. These techniques are complementary to the APP technique. With the APP technique, loop ....
R. Chandra, D.-K. Chen, R. Box, D.E. Maydan, N. Nedeljkovic, and J.M. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors. In ACM SIGPLAN PLDI'97, pp. 334--345, June 1997.
....and not just a single loop nest. Not surprisingly, deciding the optimal layouts is NP complete. Finally, some problematic constructs like array aliasing and pointers in C and the EQUIVALENCE statement in Fortran may prevent automatic data layout modifications. We refer the reader to Chandra et al. [6] for a study of techniques for ensuring the legality of memory layout transformations. It seems natural to try and combine the benefits of loop and data transformations in improving the memory performance of programs. There have been some efforts aimed at unifying loop and data transformations [2, ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data-distribution support on distributed-shared memory multiprocessors. Proc. SIGPLAN Conf. Programming Language Design & Implementation (PLDI'97), Las Vegas, NV, pages 334--345, 1997.
....the suitable memory layouts for each array such that the loop nests in the program will have spatial locality (as defined earlier) with respect to each reference that they enclose If this is not possible, we want to maximize the number of references for which this is possible. As indicated in [3], due to some conditions related to storage and sequence assumptions about the arrays and to passing arrays as subroutine arguments, data transformations may not always be legal. We assume no such situation occurs for the example programs given in this paper. 4.2 Determining the Optimal Layouts ....
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. M. Anderson. Data-distribution support on distributed-shared memory multiprocessors. In Proc. Programming Language Design and Implementation (PLDI), 1997.
No context found.
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. "Data Distribution Support on Distributed-Shared Memory Multiprocessors." In Proc. Programming Language Design and Implementation, Las Vegas, NV, 1997.
No context found.
R. Chandra, D. Chen, R. Cox, D.E. Maydan, and N. Nedeljkovic. Data distribution support on distributed shared memory multiprocessors. In Proceedings of '97 Conference on Programming Language Design and Implementation, 1997.
....frequency of packing. Some existing language features such as sequence and storage association in Fortran prevent a compiler from accurately detecting all accesses to a transformed array. However, this problem can be safely solved in a combination of compile, link and run time checks described in [7]. Although the compiler support can guarantee the correctness of packing, it needs additional information to decide on the profitability of packing. Our compiler currently relies on a one line user directive to specify whether packing should be applied, when and where packing should be carried out ....
R. Chandra, D. Chen, R. Cox, D.E. Maydan, and N. Nedeljkovic. Data distribution support on distributed shared memory multiprocessors. In Proceedings of '97 Conference on Programming Language Design and Implementation, 1997.
No context found.
Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, and Jennifer M. Anderson. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the ACM SIGPLAN 1997.
No context found.
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedelijkovic, and J. Anderson. Data Distribution Support on Distributed Shared Memory Multiprocessors. In Proc. of the 1997.
No context found.
Rohit Chandra, Ding-Kai Chen, Robert Cox, Dror E. Maydan, Nenad Nedeljkovic, and Jennifer M. Anderson. Data distribution support on distributed shared memory multiprocessors. In ACM SIGPLAN '97 Conference on Programming Language Design and Implementation, pages 334-345, June 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC