Results 1 -
8 of
8
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
Experiment Management Support for Performance Tuning
- PROCEEDINGS OF THE SC’97 CONFERENCE
, 1997
"... The development of a high-performance parallel system or application is an evolutionary process -- both the code and the environment go through many changes during a program's lifetime -- and at each change, a key question for developers is: how and how much did the performance change? No existing ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
The development of a high-performance parallel system or application is an evolutionary process -- both the code and the environment go through many changes during a program's lifetime -- and at each change, a key question for developers is: how and how much did the performance change? No existing performance tool provides the necessary functionality to answer this question. This paper reports on the design and preliminary implementation of a tool which views each execution as a scientific experiment and provides the functionality to answer questions about a program's performance which span more than a single execution or environment. We report results of using our tool with an actual performance tuning study and with a scientific application run in changing environments. Our goal is to use historic program performance data to develop techniques for parallel program performance diagnosis.
Improving the Speedup of Parallel and Distributed Applications on Clusters and Multi-Clusters
, 2003
"... In parallel and distributed computing, clusters are increasingly used for computeand I/O-intensive applications. As we add computing resources to a parallel application, one of the fundamental questions is how well the application scales, both with regards to speedup and to increasing the problem si ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In parallel and distributed computing, clusters are increasingly used for computeand I/O-intensive applications. As we add computing resources to a parallel application, one of the fundamental questions is how well the application scales, both with regards to speedup and to increasing the problem size. This dissertation reports on two main issues impacting scaling. The first is the end-to-end communication latency. The other is the configuration and mapping of the application onto a cluster topology and architecture. Several factors were studied to determine their impact on end-to-end latency, including procotols, workload, and locating communication endpoints at user-, kerneland interrupt-level. The dominating contribution to latency comes from complex protocols, such as TCP/IP, which do not take advantage of properties in the interconnect to reduce the amount of processing for communication. The latency reduction from choosing a protocol with lower overhead is approximately an order of magnitude larger than the reduction
Experiment Management Support for Parallel Performance Tuning
- UNIVERSITY OF WISCONSIN - MADISON
, 1999
"... ..."
On-Line Debugging and Performance Monitoring with Barriers
- In15th International Parallel and Distributed Processing Symposium (IPDPS
, 2001
"... We introduce the Stupid Barrier Tricks (SBT) library for on-line debugging and performance monitoring of sharedmemory parallel programs. Single-program-multiple-data (SPMD) programs often use barriers to synchronize threads of execution and to delimit the start and end of different phases of computa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We introduce the Stupid Barrier Tricks (SBT) library for on-line debugging and performance monitoring of sharedmemory parallel programs. Single-program-multiple-data (SPMD) programs often use barriers to synchronize threads of execution and to delimit the start and end of different phases of computation. Through the novel (and simple) named barriers construct, dynamic performance warnings, and integration with lightweight performance counter libraries, SBT helps programmers localize deadlocks and performance bottlenecks in their programs. SBT is a portable library that currently supports both POSIX threads and SGI Irix sproc threads. SBT also supports both the PCL and Irix libperfex performance counter libraries. For production runs, the SBT overheads can be eliminated using conditional compilation. 1.
Hardware support for flexible distributed shared memory
- IEEE Transactions on Computers
, 1998
"... Abstract—Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-s ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Workstation-based parallel systems are attractive due to their low cost and competitive uniprocessor performance. However, supporting a cache-coherent global address space on these systems involves significant overheads. We examine two approaches to coping with these overheads. First, DSM-specific hardware can be added to the off-the-shelf component base to reduce overheads. Second, application-specific coherence protocols can avoid some overheads by exploiting programmer (or compiler) knowledge of an application’s communication patterns. To explore the interaction between these approaches, we simulated four designs that add DSM acceleration hardware to a collection of off-the-shelf workstation nodes. Three of the designs support user-level software coherence protocols, enabling application-specific protocol optimizations. To verify the feasibility of our hardware approach, we constructed a prototype of the simplest design. Measured speedups from the prototype match simulation results closely. We find that, even with aggressive DSM hardware support, custom protocols can provide significant speedups for some applications. In addition, the custom protocols are generally effective at reducing the impact of other overheads, including those due to less aggressive hardware support and larger network latencies. However, for three of our benchmarks, the additional hardware acceleration provided by our most aggressive design avoids the need to develop more efficient custom protocols. Index Terms—Parallel systems, distributed shared memory, cache coherence protocols, fine-grain cache coherence, coherence protocol optimization, workstation clusters. 1
TAPE: A Transactional Application Profiling Environment
- In ICS ’05: Proceedings of the 19th Annual International Conference on Supercomputing
, 2005
"... parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the par ..."
Abstract
- Add to MetaCart
parallel programming model that uses transactions as the basic unit of parallel work and communication. TCC simplifies the development of correct parallel code because hardware provides transaction atomicity and ordering. Nevertheless, the programmer or a dynamic compiler must still optimize the parallel code for performance. This paper presents TAPE, a hardware and software infrastructure for profiling in TCC systems. TAPE extends the hardware for transactional execution to identify performance impediments such as dependence violations, buffer overflows, and work imbalance. It filters infrequent events to reduce resource requirements and allows the programmer to focus on the most important bottlenecks. We demonstrate that TAPE introduces minimal die area and performance overhead and can be used continuously, even for production runs. Moreover, we demonstrate how to leverage the profiling information to guide optimization for a set of parallel applications. TAPE accurately identifies the source code location and type of the most important bottlenecks, allowing a programmer to achieve maximum parallel speedup with a few profiling steps.
Performance Tuning Software . . .
- JOURNAL OF SUPERCOMPUTING
"... Small organisations can now have access to high raw processing power using networks of workstations (NOW) as parallel computing pl atforms. Software Distributed Shared Memory (Software DSM) packages have been devel ped to facilq ate the programming of such systems. However, because of the high inter ..."
Abstract
- Add to MetaCart
Small organisations can now have access to high raw processing power using networks of workstations (NOW) as parallel computing pl atforms. Software Distributed Shared Memory (Software DSM) packages have been devel ped to facilq ate the programming of such systems. However, because of the high interprocessl atencies in a NOW, the performance of a software DSM appl ication is more susceptible to the partitioning of the problem than what might be expected. This paper presents an approach for a tool to visualise the execution of a program in a way that highl ights performance bottl enecks. The tool associates identified bottl enecks with the corresponding source codel ines in order to determine what piece of code is the cause of poor performance. The visual#R tion technique is demonstrated in two case studies. They cl ar l show that the visual isation is indeed useful and provides an effective way to acquire an understanding of what characterises an applications sharing behaviour.

