Results 1 - 10
of
10
Improving Compiler and Run-Time Support for Adaptive Irregular Codes
- In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
, 1998
"... Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
(Show Context)
Irregular reductions form the core of adaptive irregular codes. On distributed-memory multiprocessors, they are parallelized either using sophisticated run-time systems (e.g., CHAOS, PILAR) or the shared-memory interface supported by software DSMs (e.g., CVM, TreadMarks). We introduce LOCALWRITE, a new technique based on the owner-computes rule which eliminates the need for buffers or synchronized writes but may replicate computation. We evaluate its performance for irregular codes while varying connectivity, locality, and adaptivity. LOCALWRITE improves performance by 50--150% compared to using replicated buffers, and can match or exceed gather/scatter for applications with low locality or high adaptivity. 1 Introduction Scientists are beginning to exploit parallelism to provide the computing power they need for research and development. As they attempt to model more complex problems, irregular adaptive computations become increasingly important. The core of these applications is fre...
Improving Compiler and Run-Time Support for Irregular Reductions
, 1998
"... Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DS ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
Compilers for distributed-memory multiprocessors parallelize irregular reductions either by generating calls to sophisticated run-time systems or relying on the sharedmemory interface supported by software DSMs. Run-time systems gather/scatter nonlocal results (e.g., CHAOS, PI-LAR) while software DSMs apply local reductions to replicated buffers (e.g., CVM, TreadMarks). We introduce LO-CALWRITE, a new technique for parallelizing irregular reductions based on the owner-computes rule. It eliminates the need for buffers or synchronized writes, but may replicate computation. We investigate the impact of connectivity (node/edge ratio), locality (accesses to local data) and adaptivity (edge modifications) on their relative performance. LOCALWRITE improves performance by 50-150% compared to using replicated buffers. Gather/scatter using CHAOS generally provides the best performance, but LO-CALWRITE can outperform CHAOS for applications with low locality or high adaptivity. We also discover the flushupdate coherence protocol can improve performance by 15-25 % for software DSMs over an invalidate protocol.
Compiler Optimization Techniques for OpenMP Programs
- Scientific Programming
, 2001
"... In this paper, we present some compiler optimization techniques for explicit parallel programs using OpenMP API. To enable optimizations across threads, we designed dataflow analysis techniques in which interaction between threads is effectively modeled. Structured description of parallelism and rel ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we present some compiler optimization techniques for explicit parallel programs using OpenMP API. To enable optimizations across threads, we designed dataflow analysis techniques in which interaction between threads is effectively modeled. Structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. We show algorithms for reaching de nitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. Our primary target is a compiler-directed software DSM system where aggressive compiler optimizations for software-implemented coherence scheme are crucial to obtain good performance. We also show optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. We consider compiler optimizations are bene cial for performance portability across various platforms and non-expert programmers. Experimental results for the coherency optimization in a compilerdirected software DSM system shows that aggressive compiler optimizations are quite effective for a shared-write intensive program because coherenceinduced communication volume in such a program is much larger than the those for shared-read intensive programs.
Efficient compiler and run-time support for parallel irregular reductions
, 2000
"... ..."
(Show Context)
Cas-dsm: a compiler assisted software distributed shared memory
- Int. J. Parallel Program
"... ernet.in Traditional software Distributed Shared Memory (DSM) systems rely on the virtual memory management mechanisms to detect accesses to shared memory locations and maintain their consistency. The resulting involvement of the OS (kernel) and the associated overhead which is significant, can be a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
ernet.in Traditional software Distributed Shared Memory (DSM) systems rely on the virtual memory management mechanisms to detect accesses to shared memory locations and maintain their consistency. The resulting involvement of the OS (kernel) and the associated overhead which is significant, can be avoided by careful compile time analysis and code instrumentation. In this paper, we propose such a Compiler Assisted Software support approach (CAS-DSM). In the CAS-DSM implementation, the involvement of the OS kernel is avoided by instrumenting the application code at the source level. The overhead caused by the execution of the instrumented code is reduced through several aggressive compile time optimizations. Finally, we also address the issue of reducing certain overheads in polling-based implementation of receiving asynchronous messages. We used SUIF, a public domain compiler tool, to implement compile time analysis, instrumentation and optimizations. We modified CVM, a publicly available software DSM to support the instrumentation inserted by the com-piler. Detailed performance evaluation of CAS-DSM is reported using a set of Splash/Splash2 parallel application benchmarks on a distributed memory IBM SP-2 machine. CAS-DSM achieved moderate to good performance improve-ments formost of the applications compared to the original CVM implementation. Reducing the overheads in polling-based implementation improves the perfor-mance of CAS-DSM significantly resulting in an overall improvement of 12–52 % over the original CVM implementation.
A Measurement-Based Algorithm for Mapping Consistency Protocols to Shared Data
- In The Second International Workshop on Software Distributed Shared Memory (in conjunction with The International Conference of Supercomputing
, 2000
"... Distributed Shared Memory (DSM) systems often provide exactly one consistency protocol for all shared data [2, 4]. Recent systems adaptively select consistency protocols using heuristical analysis of recent access patterns [1, 11, 12, 14, 17]. Although approaches based on access patterns can signifi ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Distributed Shared Memory (DSM) systems often provide exactly one consistency protocol for all shared data [2, 4]. Recent systems adaptively select consistency protocols using heuristical analysis of recent access patterns [1, 11, 12, 14, 17]. Although approaches based on access patterns can significantly improve application performance, there are other factors that influence the performance of a consistency protocol, such as network bandwidth, congestion, latency and topology. These factors become particularly important in a wide-area environment, where access pattern analysis alone can lead to suboptimal performance [6]. In this paper we describe an algorithm for selecting consistency protocols on a per-segment or per-object basis. It uses a low overhead measurement-based performance metric to determine the most appropriate consistency protocol for each segment. Because the measurements are based on "observed" performance, the system can react to any LAN or WAN environment in which t...
CAS-DSM: A Compiler Assisted Software Distributed Shared Memory \Lambda y
"... 1 level. The overhead caused by the execution of the instrumented code is reduced through several aggressive compile time optimizations. Finally, we also address the issue of reducing certain overheads in polling-based implementation of receiving asynchronous messages. We used SUIF, a public domain ..."
Abstract
- Add to MetaCart
(Show Context)
1 level. The overhead caused by the execution of the instrumented code is reduced through several aggressive compile time optimizations. Finally, we also address the issue of reducing certain overheads in polling-based implementation of receiving asynchronous messages. We used SUIF, a public domain compiler tool, to implement compile time analysis, instrumentation and optimizations.
Abstract Of Dissertation
, 2002
"... OF DISSERTATION A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the College of Engineering at the University of Kentucky By Christopher Stephen Diaz Marblehead, Massachusetts Director: Dr. James Griffioen, Professor of Computer Sc ..."
Abstract
- Add to MetaCart
OF DISSERTATION A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the College of Engineering at the University of Kentucky By Christopher Stephen Diaz Marblehead, Massachusetts Director: Dr. James Griffioen, Professor of Computer Science Lexington, Kentucky 2002 ADAPTIVE CONSISTENCY PROTOCOLS FOR LOCAL AND WIDE AREA NETWORKS While Distributed Shared Memory (DSM) systems provide runtime speedup for computeintensive applications, performance is often limited by use of a suboptimal consistency protocol. DSM systems traditionally either provide only one consistency protocol, and the application must accept the performance even if it is suboptimal, or require the application programmer to select the protocol, burdening the programmer to analyze details about the algorithm and performance effects of sharing data. To address these problems, recent research proposes that DSM adaptively chooses the consistency protocol. However, these approaches - based on data access patterns - may choose a suboptimal protocol and actually degrade performance. Furthermore, factors unknown until runtime may affect consistency protocol performance, so for the same application, a protocol chosen by access pattern may be optimal in one environment but suboptimal in another environment.
Summary
"... We have prepared three monoclonal antibodies against human epidermal keratins. These anti-bodies were highly specific for keratins and, in combination, recognized all major epidermal kera-tins of several mammalian species. We have used these antibodies to study the tissue distribution of epidermis-r ..."
Abstract
- Add to MetaCart
We have prepared three monoclonal antibodies against human epidermal keratins. These anti-bodies were highly specific for keratins and, in combination, recognized all major epidermal kera-tins of several mammalian species. We have used these antibodies to study the tissue distribution of epidermis-related keratins. In various mammalian epithelia, the antibodies recognized seven classes of keratins defined by their immunological reactivity and size. The 40, 46 and 52 kilodalton (kd) keratin classes were present in almost all epithelia; the 50 kd and 56 kd keratin classes were detected in all stratified squamous epithelia, but not in any simple epithelia; and the 56 kd and 65-67 kd keratin classes were unique to keratinired epidermls. Thus the expression of specific keratin classes appeared to correlate with different types of epithelial differ-entiation (simple versus stratified; keratinized ver-sus nonkeratinized).
A Measurement-Based Algorithm for Mapping Consistency Protocols to Shared Data
"... Distributed Shared Memory (DSM) systems often provide exactly one consistency protocol for all shared data [2, 4]. Recent systems adaptively select consistency protocols using heuristical analysis of recent access patterns [1, 11, 12, 14, 17]. Although approaches based on access patterns can signifi ..."
Abstract
- Add to MetaCart
(Show Context)
Distributed Shared Memory (DSM) systems often provide exactly one consistency protocol for all shared data [2, 4]. Recent systems adaptively select consistency protocols using heuristical analysis of recent access patterns [1, 11, 12, 14, 17]. Although approaches based on access patterns can significantly improve application performance, there are other factors that influence the performance of a consistency protocol, such as network bandwidth, congestion, latency and topology. These factors become particularly important in a wide-area environment, where access pattern analysis alone can lead to suboptimal performance [6]. In this paper we describe an algorithm for selecting consistency protocols on a per-segment or per-object basis. It uses a low overhead measurement-based performance metric to determine the most appropriate consistency protocol for each segment. Because the measurements are based on “observed ” performance, the system can react to any LAN or WAN environment in which the application is running. 1