Results 1 - 10
of
12
Compiler Transformations for High-Performance Computing
- ACM Computing Surveys
, 1994
"... In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. ..."
Abstract
-
Cited by 332 (4 self)
- Add to MetaCart
In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimization for
Provably efficient scheduling for languages with fine-grained parallelism
- IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract
-
Cited by 68 (22 self)
- Add to MetaCart
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
The Impact of Synchronization and Granularity on Parallel Systems
- In Int'l. Symp. on Computer Architecture
, 1990
"... In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ult ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration level parallelism seems to be a more appropriate level when those factors are considered. We also study barrier synchronization and data synchronization at the loopiteration level and found both schemes are needed for a better performance.
Automatic scalability analysis of parallel programs based on modeling techniques
- in Computer Performance Evaluation: Modelling Techniques and Tools (LNCS 794
, 1994
"... When implementing parallel programs for parallel computer systems the performance scalability of these programs should be tested and analyzed on different computer configurations and problem sizes. Since a complete scalability analysis is too time consuming and is limited to only existing systems, e ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
When implementing parallel programs for parallel computer systems the performance scalability of these programs should be tested and analyzed on different computer configurations and problem sizes. Since a complete scalability analysis is too time consuming and is limited to only existing systems, extensions of modeling approaches can be considered for analyzing the behavior of parallel programs under different problem and system scenarios. In this paper, a method for automatic scalability analysis using modeling is presented. Initially, we identify the important problems that arise when attempting to apply modeling techniques to scalability analysis. Based on this study, we define the Parallelization Description Language (PDL) that is used to describe parallel execution attributes of a generic program workload. Based on a parallelization description, stochastic models like graph models or Petri net models can be automatically generated from a generic model to analyze performance for scaled parallel systems as well as scaled input data. The complexity of the graph models produced depends significantly on the type of parallel computation described. We present several computation classes where tractable graph models can be generated and then compare the results of these automatically scaled models with their exact solutions using the PEPP modeling tool. 1
Parallel Loop Scheduling for High Performance Computers
- High Performance Computing: Technology, Methods, and Applications
, 1994
"... This article reviews current loop scheduling algorithms and studies their scheduling overhead versus load balancing tradeoffs. Using analytical models, simulations, and experimental measurements, the performance and the scalability of chunk scheduling, self-scheduling, guided self-scheduling, factor ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This article reviews current loop scheduling algorithms and studies their scheduling overhead versus load balancing tradeoffs. Using analytical models, simulations, and experimental measurements, the performance and the scalability of chunk scheduling, self-scheduling, guided self-scheduling, factoring, and trapezoid self-scheduling are compared.
Parameter Estimation for a Generalized Parallel Loop Scheduling Algorithm
- Department of Computer Science, University of Minnesota
, 1995
"... Algorithms that dynamically schedule parallel loop iterations in a shared-memory multiprocessor have been proposed to balance the processors' workload while maintaining low scheduling overhead. However, none of the existing strategies perform well for all types of loops on all types of system archit ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Algorithms that dynamically schedule parallel loop iterations in a shared-memory multiprocessor have been proposed to balance the processors' workload while maintaining low scheduling overhead. However, none of the existing strategies perform well for all types of loops on all types of system architectures. We present a generalized loop scheduling algorithm that can be adjusted to match the loop characteristics to the system environment. A new method of simulation using the Genetic Algorithm is developed to determine appropriate scheduling parameters. This approach allows us to quickly choose sets of scheduling parameters for different loops executing on different systems. Stochastic simulations show that our parameterized strategies perform at least as well as the best existing algorithms for different combinations of loop iteration characteristics and system assumptions. Our generalized strategy is thus more robust than existing strategies. Keywords: Parallel loop scheduling; Perform...
Extracting data flow information for parallelizing FORTRAN nested loop kernels
, 1994
"... Currently available parallelizing FORTRAN compilers expend a large amount of effort in determining data independent statements in a program such that these statements can be scheduled in parallel without need for synchronisation. This thesis hypothesises that it is just as important to derive exact ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Currently available parallelizing FORTRAN compilers expend a large amount of effort in determining data independent statements in a program such that these statements can be scheduled in parallel without need for synchronisation. This thesis hypothesises that it is just as important to derive exact data flow information about the data dependencies where they exist. We focus on the specific problem of imperative nested loop parallelization by describing a direct method for determining the distance vectors of the inter-loop data dependencies in an n-nested loop kernel. These distance vectors define dependence arcs between iterations which are represented as points in n-dimensional euclidean space. To demonstrate some of the benefits gained from deriving such exact data flow information about a nested loop computation we show how implicit task graph information about the computation can be deduced. Deriving the implicit task graph of the computation enables the parallelization of a class ...
Simulation Of Static And Dynamic Task Scheduling On Multiprocessor Systems
, 1994
"... CONTENTS Page 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 THE PROBLEM OF TASK SCHEDULING ON PARALLEL PROCESSING SYSTEMS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2.1 Defining Task Scheduling on Parallel Processing Systems : ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
CONTENTS Page 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 2 THE PROBLEM OF TASK SCHEDULING ON PARALLEL PROCESSING SYSTEMS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2.1 Defining Task Scheduling on Parallel Processing Systems : : : : : : : : : 4 2.2 Review of Research on the Problem of Task Scheduling : : : : : : : : : : 7 3 PROGRAM MODEL: THE HIERARCHICAL TASK GRAPH : : : : : : : : : 13 3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3.2 Directed Acyclic Graphs : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 3.3 Task Graph Representation of Computer Programs : : : : : : : : : : : : 14 3.4 Hierarchical Representation of Task Graphs : : : : : : : : : : : : : : : : 17 3.5 Hierarchical Task Graph Generator : : :
Cache-Based Data Distribution Constrained Scheduling
, 1994
"... The primary goal of processor scheduling is to assign tasks in a parallel program to processors, so as to minimize the execution time. Most existing approaches to processor scheduling for multiprocessors assume that the execution time of each task is fixed and is independent of processor scheduling. ..."
Abstract
- Add to MetaCart
The primary goal of processor scheduling is to assign tasks in a parallel program to processors, so as to minimize the execution time. Most existing approaches to processor scheduling for multiprocessors assume that the execution time of each task is fixed and is independent of processor scheduling. In this paper, we argue that the execution time of a given task is not fixed but is critically dependent on the performance of the caches, which have become an essential component of shared-memory multiprocessors and propose a scheduling algorithm, called data distribution constrained scheduling algorithm. The proposed scheduling algorithm tries to maximize the number of cache hits by scheduling the processors so that the task that brings a memory block into the cache and the tasks that subsequently access the same memory block are executed on the same processor. Keywords: Cache, scheduling, multiprocessor, tasks 1. Introduction. The desire and need for more computing power have motivated...

