Results 1 - 10
of
73
Maximizing Multiprocessor Performance with the SUIF Compiler
, 1996
"... This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to ..."
Abstract
-
Cited by 280 (22 self)
- Add to MetaCart
(Show Context)
This paper presents an overview of the SUIF compiler, which automatically parallelizes and optimizes sequential programs for shared-memory multiprocessors. We describe new technology in this system for locating coarse-grain parallelism and for optimizing multiprocessor memory behavior essential to obtaining good multiprocessor performance. These techniques have a significant impact on the performance of half of the NAS and SPECfp95 benchmark suites. In particular, we achieve the highest SPECfp95 ratio to date of 63.9 on an eight-processor 440MHz Digital AlphaServer. 1 Introduction Affordable shared-memory multiprocessors can potentially deliver supercomputer-like performance to the general public. Today, these machines are mainly used in a multiprogramming mode, increasing system throughput by running several independent applications in parallel. The multiple processors can also be used together to accelerate the execution of single applications. Automatic parallelization is a promis...
Dynamic feedback: an effective technique for adaptive computing
- PLDI ’97: Proceedings of the ACM SIGPLAN
, 1997
"... This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environ-ments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a dif-ferent optimization policy. The generated ..."
Abstract
-
Cited by 71 (6 self)
- Add to MetaCart
(Show Context)
This paper presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environ-ments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a dif-ferent optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase mea-sures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the pre-vious sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment. We have implemented dynamic feedback in the context of a par-allelizing compiler for object-based programs. The generated code uses dynamic feedback to automatically choose the best synchro-nization optimization policy. Our experimental results show that the synchronization optimization policy has a significant impact on the overall performance of the computation, that the best policy varies from program to program, that the compiler is unable to stat-ically choose the best policy, and that dynamic feedback enables the generated code to exhibit performance that is comparable to that of code that has been manually tuned to use the best policy. We have also performed a theoretical analysis which provides, under certain assumptions, a guaranteed optimality bound for dynamic feedback relative to a hypothetical (and unrealizable) optimal algorithm that uses the best policy at every point during the execution. 1
Compiler-directed page coloring for multiprocessors
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII
, 1996
"... This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the ac ..."
Abstract
-
Cited by 66 (8 self)
- Add to MetaCart
(Show Context)
This paper presents a new technique, compiler-directed page coloring, that eliminates conflict misses in multiprocessor applications. It enables applications to make better use of the increased aggregate cache size available in a multiprocessor. This technique uses the compiler’s knowledge of the access patterns of the parallelized applications to direct the operating system’s virtual memory page mapping strategy. We demonstrate that this technique can lead to significant performance improvements over two commonly used page mapping strategies for machines with either direct-mapped or two-way set-associative caches. We also show that it is complementary to latency-hiding techniques such as prefetching. We implemented compiler-directed page coloring in the SUIF parallelizing compiler and on two commercial operating systems. We applied the technique to the SPEC95fp benchmark suite, a representative set of numeric programs. We used the SimOS machine simulator to analyze the applications and isolate their performance bottlenecks. We also validated these results on a real machine, an eight-processor 350MHz Digital AlphaServer. Compiler-directed page coloring leads to significant performance improvements for several applications. Overall, our technique improves the SPEC95fp rating for eight processors by 8 % over Digital UNIX’s page mapping policy and by 20 % over a page coloring, a standard page mapping policy. The SUIF compiler achieves a SPEC95fp ratio of 57.4, the highest ratio to date.
A Matrix-Based Approach to Global Locality Optimization
, 1998
"... Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependences and may not be very successful in optimizing imperfectly nested loop ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
(Show Context)
Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependences and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependences, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for the optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general non-singular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an ¤-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our
Using fine-grain threads and run-time decision making in parallel computing
- Journal of Parallel and Distributed Computing
, 1996
"... Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run ..."
Abstract
-
Cited by 33 (14 self)
- Add to MetaCart
(Show Context)
Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run-time decisions leads to a simpler interface—because decisions are implicit—and it can lead to better decisions—because more information is available. This paper examines the costs, benefits, and details of making decisions at run time. The starting point is explicit fine-grain parallelism with any number (even thousands) of threads. Five specific techniques are considered: (1) implicitly coarsening the granularity of parallelism, (2) using implicit communication implemented by a distributed shared memory, (3) overlapping computation and communication, (4) adaptively moving threads and data between nodes to minimize communication and balance load, and (5) dynamically remapping data to pages to avoid false sharing. Details are given on the performance of each of these techniques as well as their overall performance on several scientific applications. 1
Integrating Loop and Data Transformations for Global Optimisation
- IN PROC. INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'98
, 1998
"... This paper is concerned with integrating global data transformations and local loop transformations in order to minimise overhead on distributed shared memory machines such as the SGi Origin 2000. By first developing an extended algebraic transformation framework, a new technique to allow the static ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
This paper is concerned with integrating global data transformations and local loop transformations in order to minimise overhead on distributed shared memory machines such as the SGi Origin 2000. By first developing an extended algebraic transformation framework, a new technique to allow the static application of global data transformations, such as partitioning, to reshaped arrays is presented, eliminating the need for expensive temporary copies and hence eliminating any communication and synchronisation. In addition, by integrating loop and data transformations, any introduced poor spatial locality and expensive array subscripts can be eliminated. A specific performance improving algorithm is implemented giving significant improvements in execution time.
AUTOMATIC COMPUTATION AND DATA DECOMPOSITION FOR MULTIPROCESSORS
, 1997
"... Memory subsystem efficiency is critical to achieving high performance on parallel machines. The memory subsystem organization of modern multiprocessor architectures makes their performance highly sensitive to both the distribution of the computation and the layout of the data. A key issue in progr ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Memory subsystem efficiency is critical to achieving high performance on parallel machines. The memory subsystem organization of modern multiprocessor architectures makes their performance highly sensitive to both the distribution of the computation and the layout of the data. A key issue in programming these machines is selecting the computation decomposition and data decomposition, the mapping of the computation and data, respectively, across the processors of the machine. A popular approach to the decomposition problem is to require programmers to perform the decomposition analysis themselves, and to communicate that information to the compiler using language extensions. This thesis presents a new compiler algorithm that automatically calculates computation and data decompositions for dense-matrix scientific codes. The core of the algorithm is based on a linear algebra framework for expressing and calculating the computation and data decompositions. Using the linear algebra model, the algorithm generates a system of equations that specifies the conditions the desired decompositions must satisfy. The decompositions are then calculated systematically by solving the system of equations. Since the best decompositions may change as different phases of the program are executed, the algorithm also considers re-organizing the data dynamically. The analysis is performed both within and across procedure boundaries so that entire programs can be analyzed.
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
, 1999
"... This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced ..."
Abstract
-
Cited by 25 (5 self)
- Add to MetaCart
This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism.
Automatic data layout for distributed-memory machines
- ACM Transactions on Programming Languages and Systems
, 1998
"... The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The p ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
(Show Context)
The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The performance of a data layout depends on the target compilation system, the target machine, the problem size, and the number of available processors. This makes the choice of a good layout extremely difficult for most users of such languages. If languages such as HPF are to find general acceptance, the need for data layout selection support has to be addressed. We believe that the appropriate way to provide the needed support is through a tool that generates data layout specifications automatically. This article discusses the design and implementation of a data layout selection tool that generates HPF-style data layout specifications automatically. Because layout is done in a tool that is not embedded in the target compiler and hence will be run only a few times during the tuning phase of an application, it can use techniques such as integer programming that may be considered too computationally expensive for inclusion in production compilers. The proposed framework for automatic data layout selection builds and examines search spaces of candidate data layouts. A candidate layout is an efficient layout for
Improving Locality For Adaptive Irregular Scientific Codes
, 1999
"... An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
(Show Context)
An important class of scientific codes access memory in an irregular manner. Because irregular access patterns reduce temporal and spatial locality, they tend to underutilize caches, resulting in poor performance. Researchers have shown that consecutively packing data relative to traversal order can significantly reduce cache miss rates by increasing spatial locality. In this paper, we investigate techniques for using partitioning algorithms to improve locality in adaptive irregular codes. We develop parameters to guide both geometric (RCB) and graph partitioning (METIS) algorithms, and develop a new graph partitioning algorithm based on hierarchical clustering (GPART) which achieves good locality with low overhead. We also examine the effectiveness of locality optimizations for adaptive codes, where connection patterns dynamically change at intervals during program execution. We use a simple cost model to guide locality optimizations when access patterns change. Experiments on irregular scientific codes for a variety of meshes show our partitioning algorithms are effective for static and adaptive codes on both sequential and parallel machines. Improved locality also enhances the effectiveness of LOCALWRITE, a parallelization technique for irregular reductions based on the owner computes rule.