Results 1 -
2 of
2
Caching Puts and Gets in a PGAS Language Runtime
"... Abstract—We investigated a software cache for PGAS PUT and GET operations. The cache is implemented as a software write-back cache with dirty bits, local memory consistency operations, and programmer-guided prefetch. This cache sup-ports programmer productivity while enabling communication aggregati ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—We investigated a software cache for PGAS PUT and GET operations. The cache is implemented as a software write-back cache with dirty bits, local memory consistency operations, and programmer-guided prefetch. This cache sup-ports programmer productivity while enabling communication aggregation and overlap. We evaluated an implementation of this cache for remote data within the Chapel programming lan-guage. The cache provides a 2x speedup for several distributed memory application benchmarks written in Chapel across a variety of network configurations. In addition, we observed that improvements to compiler optimization did not remove the benefit of the cache. Keywords-PGAS; cache; remote data; communication aggre-gation; communication overlap; prefetch; Chapel I.
Improving Data Locality for Irregular Partitioned Global Address Space Parallel Programs
"... This paper describes a technique for improving the data ref-erence locality of parallel programs using the Partitioned Global Address Space (PGAS) model of computation. One of the principal challenges in writing PGAS parallel appli-cations is maximizing communication efficiency. This work describes ..."
Abstract
- Add to MetaCart
(Show Context)
This paper describes a technique for improving the data ref-erence locality of parallel programs using the Partitioned Global Address Space (PGAS) model of computation. One of the principal challenges in writing PGAS parallel appli-cations is maximizing communication efficiency. This work describes an on-line technique based on run-time data ref-erence profiling to organize fine-grained data elements into locality-aware blocks suitable for coarse-grained communi-cation. This technique is applicable to parallel applications with large, irregular, pointer-based applications. The de-scribed system can perform automatic data relayout using the locality-aware mapping with either iterative (timestep) based applications or as a collective data relayout opera-tion. An empirical evaluation of the approach shows that the technique is useful in increasing data reference local-ity and improves performance by 10-17 % on the SPLASH-2 Barnes-Hut tree benchmark.