Results 1 -
5 of
5
An Optimal Scheduling Scheme for Tiling in Distributed Systems
"... Abstract — There exist several scheduling schemes for parallelizing loops without dependences for shared and distributed memory systems. However, efficiently parallelizing loops with dependences is a more complicated task. This becomes even more difficult when the loops are executed on a distributed ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — There exist several scheduling schemes for parallelizing loops without dependences for shared and distributed memory systems. However, efficiently parallelizing loops with dependences is a more complicated task. This becomes even more difficult when the loops are executed on a distributed memory cluster where communication and synchronization can be a bottleneck. The problem lies in the processor idle time which occurs during the beginning and final stages of the execution. In this paper we propose a new scheduling scheme that minimizes the processor idle time and thus it enhances load balancing and performance. The new scheme is applied to two-dimensional iteration spaces with dependences. The proposed scheduling scheme follows a tiled wavefront pattern in which the tile size gradually decreases in all dimensions. We have tested the proposed scheme on a dedicated and homogeneous cluster of workstations and we verified that it significantly improves execution times over scheduling using traditional tiling. I.
Affine Loop Optimization Based on Modulo Unrolling in Chapel
"... This paper presents modulo unrolling without unrolling (mod-ulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine ar-ray accesses in Chapel, a Partitioned Global Address Space (PGAS) parallel programming language. Messages incur a non-triv ..."
Abstract
- Add to MetaCart
(Show Context)
This paper presents modulo unrolling without unrolling (mod-ulo unrolling WU), a method for message aggregation for parallel loops in message passing programs that use affine ar-ray accesses in Chapel, a Partitioned Global Address Space (PGAS) parallel programming language. Messages incur a non-trivial run time overhead, a significant component of which is independent of the size of the message. Therefore, aggregating messages improves performance. Our optimiza-tion for message aggregation is based on a technique known as modulo unrolling, pioneered by Barua [3], whose purpose was to ensure a statically predictable single tile number for each memory reference for tiled architectures, such as the MIT Raw Machine [18]. Modulo unrolling WU applies to data that is distributed in a cyclic or block-cyclic manner. In
Suppressing Independent Loops in Packing/Unpacking Loop Nest to Reduce Message Size for Message-Passing Code
"... Abstract- In this paper we experiment with two optimization techniques we are considering implementing in a parallelizing compiler that generates parallel code for a distributed-memory system. We have found that there are two problems that often arise from the automatically generated message-passing ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract- In this paper we experiment with two optimization techniques we are considering implementing in a parallelizing compiler that generates parallel code for a distributed-memory system. We have found that there are two problems that often arise from the automatically generated message-passing code: 1) messages contain redundant data, and 2) the same data is sometimes transmitted to different processors, yet the messages are repacked for each processor. Our experiments demonstrate that it is indeed worthwhile suppressing the packing of redundant information in a message. Not only did it improve performance, but it allowed us to run the program on a larger input size. We also discovered that it is not worthwhile to suppress the repacking of the same message. The reason is because the size of the messages is a greater factor in the performance of a message-passing program than the number of instructions executed.
1Communication-aware Supernode Shape
"... Abstract — In this paper we revisit the supernode-shape selec-tion problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parall ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — In this paper we revisit the supernode-shape selec-tion problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on scheduling-aware supernode transformations that maximize parallelism during the execution. In this paper we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason we define the metric of the per process communication volume and propose a method to miminize this metric by se-lecting a communication-aware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI Cart Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.
Coarse-grain Parallel Execution for 2-dimensional PDE Problems
"... This paper presents a new approach for the execution of coarse-grain (tiled) parallel SPMD code for applications derived from the explicit discretization of 2-dimensional PDE problems with finite-differencing schemes. Tiling transformation is an efficient loop transformation to achieve coarse-grain ..."
Abstract
- Add to MetaCart
(Show Context)
This paper presents a new approach for the execution of coarse-grain (tiled) parallel SPMD code for applications derived from the explicit discretization of 2-dimensional PDE problems with finite-differencing schemes. Tiling transformation is an efficient loop transformation to achieve coarse-grain parallelism in such algorithms, while rectangular tile shapes are the only feasible shapes that can be manually applied by program developers. However, rectangular tiling transformations are not always valid due to data dependencies, and thus requiring the application of an appropriate skewing transformation prior to tiling in order to enable rectangular tile shapes. We employ cyclic mapping of tiles to processes and propose a method to determine an efficient rectangular tiling transformation for a fixed number of processes for 2-dimensional, skewed PDE problems. Our experimental results confirm the merit of coarse-grain execution in this family of applications and indicate that the proposed method leads to the selection of highly efficient tiling transformations. 1