• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Automatic data layout for high performance fortran. Supercomputing ’95, (1995)

by K Kennedy, U Kremer
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 73
Next 10 →

Maximizing Multiprocessor Performance with the SUIF Compiler

by Mary Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-wei Liao, Edouard Bugnion, Monica S. Lam , 1996
"... This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to ..."
Abstract - Cited by 280 (22 self) - Add to MetaCart
This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to obtaining good multiprocessor performance. These techniques have a significant impact on the performance of half of the NAS and SPECfp95 benchmark suites. In particular, we achieve the highest SPECfp95 ratio to date of 63.9 on an eight-processor 440MHz Digital AlphaServer. 1 Introduction Affordable shared-memory multiprocessors can potentially deliver supercomputer-like performance to the general public. Today, these machines are mainly used in a multiprogramming mode, increasing system throughput by running several independent applications in parallel. The multiple processors can also be used together to accelerate the execution of single applications. Automatic parallelization is a promis...
(Show Context)

Citation Context

...efetching to move data into the cache before it is needed. Improving Processor Re-use of Data. The compiler reorganizes the computation so that each processor re-uses the same data as much as possible=-=[2, 3, 12]-=-. This reduces the working set on each processor thus minimizing capacity misses� it also reduces communication between processors thus minimizing true sharing misses. To achieve this goal, the compil...

Dynamic feedback: an effective technique for adaptive computing

by Pedro Diniz, Martin Rinard - PLDI ’97: Proceedings of the ACM SIGPLAN , 1997
"... This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environ-ments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a dif-ferent optimization policy. The generated ..."
Abstract - Cited by 71 (6 self) - Add to MetaCart
This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environ-ments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a dif-ferent optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase mea-sures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the pre-vious sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment. We have implemented dynamic feedback in the context of a par-allelizing compiler for object-based programs. The generated code uses dynamic feedback to automatically choose the best synchro-nization optimization policy. Our experimental results show that the synchronization optimization policy has a significant impact on the overall performance of the computation, that the best policy varies from program to program, that the compiler is unable to stat-ically choose the best policy, and that dynamic feedback enables the generated code to exhibit performance that is comparable to that of code that has been manually tuned to use the best policy. We have also performed a theoretical analysis which provides, under certain assumptions, a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution. 1
(Show Context)

Citation Context

...s on the access pattern of the parallel program [12]. The best data distribution of dense matrices in distributed memory machines depends on how the different parts of the program access the matrices =-=[1, 2, 18, 21]-=-. The best concrete data structure to implement a given abstract data type often depends on how it is used [14, 22]. The best algorithm to solve a Pedro Diniz is sponsored by the PRAXIS XXI program ad...

Compiler-directed page coloring for multiprocessors

by Edouard Bugnion, Jennifer M. Anderson, Todd C. Mowry, Mendel Rosenblum, Monica S. Lam - In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII , 1996
"... This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the ac ..."
Abstract - Cited by 66 (8 self) - Add to MetaCart
This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the access patterns of the parallelized applications to direct the operating system’s virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching. We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8 % over Digital UNIX’s page mapping policy and by 20 % over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.
(Show Context)

Citation Context

...eorder the computation to enhance data locality [7,26]. Recently, there has also been research to minimize communication between processors by clever partitioning of the computation across processors =-=[3,11,15]-=-. Transformations that make data elements accessed by the same processor contiguous in the shared address space have been shown to be useful for enhancing spatial locality and minimizing false sharing...

A Matrix-Based Approach to Global Locality Optimization

by Mahmut Kandemir, Alok Choudhary, J. Ramanujam, Prith Banerjee , 1998
"... Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependences and may not be very successful in optimizing imperfectly nested loop ..."
Abstract - Cited by 35 (10 self) - Add to MetaCart
Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependences and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependences, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for the optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general non-singular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an ¤-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our
(Show Context)

Citation Context

... that exhibit good processor locality need cache optimization techniques. After all, there are a number of powerful automatic data distribution techniques published in the literature (see for example =-=[40, 5, 14, 21, 33, 47, 52]-=- and the references therein), and for example, the SGI Origin gives the programmer fine-grain control over data distribution, that can be optimized using any of the techniques mentioned. Our answer to...

Using fine-grain threads and run-time decision making in parallel computing

by David K. Lowenthal, Vincent W. Freeh, Gregory R. Andrews - Journal of Parallel and Distributed Computing , 1996
"... Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run ..."
Abstract - Cited by 33 (14 self) - Add to MetaCart
Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run-time decisions leads to a simpler interface—because decisions are implicit—and it can lead to better decisions—because more information is available. This paper examines the costs, benefits, and details of making decisions at run time. The starting point is explicit fine-grain parallelism with any number (even thousands) of threads. Five specific techniques are considered: (1) implicitly coarsening the granularity of parallelism, (2) using implicit communication implemented by a distributed shared memory, (3) overlapping computation and communication, (4) adaptively moving threads and data between nodes to minimize communication and balance load, and (5) dynamically remapping data to pages to avoid false sharing. Details are given on the performance of each of these techniques as well as their overall performance on several scientific applications. 1
(Show Context)

Citation Context

...ments statically. They can generally be divided into two categories: using language primitives, such as the ones in HPF [HPF93], or compiler analysis, such as the work reported in [AL93], [GB93], and =-=[KK94]-=-. Language primitives involve the programmer in the choice of data placement; unfortunately, the best placement may be difficult or impossible for the programmer to determine. Compiler analysis also m...

Integrating Loop and Data Transformations for Global Optimisation

by M. F. P. O'Boyle, P. M. W. Knijnenburg - IN PROC. INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'98 , 1998
"... This paper is concerned with integrating global data transformations and local loop transformations in order to minimise overhead on distributed shared memory machines such as the SGi Origin 2000. By first developing an extended algebraic transformation framework, a new technique to allow the static ..."
Abstract - Cited by 31 (2 self) - Add to MetaCart
This paper is concerned with integrating global data transformations and local loop transformations in order to minimise overhead on distributed shared memory machines such as the SGi Origin 2000. By first developing an extended algebraic transformation framework, a new technique to allow the static application of global data transformations, such as partitioning, to reshaped arrays is presented, eliminating the need for expensive temporary copies and hence eliminating any communication and synchronisation. In addition, by integrating loop and data transformations, any introduced poor spatial locality and expensive array subscripts can be eliminated. A specific performance improving algorithm is implemented giving significant improvements in execution time.
(Show Context)

Citation Context

...olation may perform well, they may perform poorly when combined due to significant communicationand synchronisation between loop nests. Another approach is to consider data orientated parallelisation =-=[8]-=-, traditionally developed for distributed memory compilation but also used for distributed shared memory [1, 6]. This approach is primarily concerned with mapping arrays to processors and has a global...

AUTOMATIC COMPUTATION AND DATA DECOMPOSITION FOR MULTIPROCESSORS

by Jennifer-Ann Monique Anderson , 1997
"... Memory subsystem efficiency is critical to achieving high performance on parallel machines. The memory subsystem organization of modern multiprocessor architectures makes their performance highly sensitive to both the distribution of the computation and the layout of the data. A key issue in progr ..."
Abstract - Cited by 29 (0 self) - Add to MetaCart
Memory subsystem efficiency is critical to achieving high performance on parallel machines. The memory subsystem organization of modern multiprocessor architectures makes their performance highly sensitive to both the distribution of the computation and the layout of the data. A key issue in programming these machines is selecting the computation decomposition and data decomposition, the mapping of the computation and data, respectively, across the processors of the machine. A popular approach to the decomposition problem is to require programmers to perform the decomposition analysis themselves, and to communicate that information to the compiler using language extensions. This thesis presents a new compiler algorithm that automatically calculates computation and data decompositions for dense-matrix scientific codes. The core of the algorithm is based on a linear algebra framework for expressing and calculating the computation and data decompositions. Using the linear algebra model, the algorithm generates a system of equations that specifies the conditions the desired decompositions must satisfy. The decompositions are then calculated systematically by solving the system of equations. Since the best decompositions may change as different phases of the program are executed, the algorithm also considers re-organizing the data dynamically. The analysis is performed both within and across procedure boundaries so that entire programs can be analyzed.

A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

by Mahmut Kandemir, Alok Choudhary, Nagaraj Shenoy, Prithviraj Banerjee, J. Ramanujam , 1999
"... This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced ..."
Abstract - Cited by 25 (5 self) - Add to MetaCart
This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism.

Automatic data layout for distributed-memory machines

by Ken Kennedy, Ulrich Kremer - ACM Transactions on Programming Languages and Systems , 1998
"... The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The p ..."
Abstract - Cited by 23 (0 self) - Add to MetaCart
The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The performance of a data layout depends on the target compilation system, the target machine, the problem size, and the number of available processors. This makes the choice of a good layout extremely difficult for most users of such languages. If languages such as HPF are to find general acceptance, the need for data layout selection support has to be addressed. We believe that the appropriate way to provide the needed support is through a tool that generates data layout specifications automatically. This article discusses the design and implementation of a data layout selection tool that generates HPF-style data layout specifications automatically. Because layout is done in a tool that is not embedded in the target compiler and hence will be run only a few times during the tuning phase of an application, it can use techniques such as integer programming that may be considered too computationally expensive for inclusion in production compilers. The proposed framework for automatic data layout selection builds and examines search spaces of candidate data layouts. A candidate layout is an efficient layout for
(Show Context)

Citation Context

...en [Philippsen 1995] and Garcia, Ayguadé and Labarta [Garcia et al. 1995]. The latter two works have been based on our previously published experience with 0-1 integer programming [Bixby et al. 1994; =-=Kennedy and Kremer 1995-=-]. In the remainder of this section different approaches to automatic data and computation mappings that were mainly designed for use within an optimizing compiler will be discussed. Many aspects of t...

Improving Locality For Adaptive Irregular Scientific Codes

by Hwansoo Han, et al. , 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract - Cited by 21 (2 self) - Add to MetaCart
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregular scientific codes for a variety of meshes show our partitioning algorithms are effective for static and adaptive codes on both sequential and parallel machines. Improved locality also enhances the effectiveness of LOCALWRITE, a parallelization technique for irregular reductions based on the owner computes rule.
(Show Context)

Citation Context

... reduced graph. Analysis and measurements show multi-level graph partitioning algorithms are reasonably fast and produce good partitions, as measured by the number of cut edges which cross partitions =-=[31, 29]-=-. The quality of the partitions are fairly close to that achieved by applying the k-way graph partitioning algorithm to the original graph, but at a fraction of the expense. 3.3 Hierarchical graph clu...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University