Results 1 - 10
of
34
The SCALASCA performance toolset architecture
- In International Workshop on Scalable Tools for High-End Computing (STHEC
, 2008
"... www.scalasca.org SCALASCA is a performance toolset that has been specifically designed to analyze parallel application execution behavior on large-scale systems. It offers an incremental performanceanalysis procedure that integrates runtime summaries with in-depth studies of concurrent behavior via ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
www.scalasca.org SCALASCA is a performance toolset that has been specifically designed to analyze parallel application execution behavior on large-scale systems. It offers an incremental performanceanalysis procedure that integrates runtime summaries with in-depth studies of concurrent behavior via event tracing, adopting a strategy of successively refined measurement configurations. Distinctive features are its ability to identify wait states in applications with very large numbers of processes and combine these with efficiently summarized local measurements. In this article, we review the current toolset architecture, emphasizing its scalable design and the role of the different components in transforming raw measurement data into knowledge of application execution behavior. The scalability and effectiveness of SCALASCA are then surveyed from experience measuring and analyzing real-world applications on a range of computer systems. 1
Timestamp synchronization for event traces of large-scale messagepassing applications
- In Proceedings of the 14th European PVM/MPI Conference
, 2007
"... www.hlrs.de Abstract. Identifying wait states in event traces of message-passing applications requires measuring temporal displacements between concurrent events. In the absence of synchronized hardware clocks, linear interpolation techniques can already account for differences in offset and drift, ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
www.hlrs.de Abstract. Identifying wait states in event traces of message-passing applications requires measuring temporal displacements between concurrent events. In the absence of synchronized hardware clocks, linear interpolation techniques can already account for differences in offset and drift, assuming that the drift of an individual processor is not time dependant. However, inaccuracies and drifts varying in time can still cause violations of the logical event ordering. The controlled logical clock algorithm accounts for such violations in point-to-point communication by shifting message events in time as much as needed while trying to preserve the length of intervals between local events. In this article, we describe how the controlled logical clock is extended to collective communication to enable a more complete correction of realistic message-passing traces. In addition, we present a parallel version of the algorithm that is intended to scale to thousands of application processes and outline its implementation within the framework of the scalasca toolkit.
Preserving Time in Large-Scale Communication Traces
, 2008
"... Analyzing the performance of large-scale scientific applications is becoming increasingly difficult due to the sheer size of performance data gathered. Recent work on scalable communication tracing applies online interprocess compression to address this problem. Yet, analysis of communication traces ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Analyzing the performance of large-scale scientific applications is becoming increasingly difficult due to the sheer size of performance data gathered. Recent work on scalable communication tracing applies online interprocess compression to address this problem. Yet, analysis of communication traces requires knowledge about time progression that cannot trivially be encoded in a scalable manner during compression. We develop scalable time stamp encoding schemes for communication traces. At the same time, our work contributes novel insights into the scalable representation of time stamped data. We show that our representations capture sufficient information to enable what-if explorations of architectural variations and analysis for path-based timing irregularities while not requiring excessive disk space. We evaluate the ability of several time-stamped compressed MPI trace approaches to enable accurate timed replay of communication events. We evaluate our timing methods against various stages of compression to study the effects of compression on timing accuracy. Specifically, we measure accuracy by comparing the original application execution times with the compressed trace replay times. Our lossless traces are orders of magnitude smaller, if not near constant size, regardless of the number of nodes while preserving timing information suitable for application tuning or assessing requirements of future procurements. Our results prove time-preserving tracing without loss of communication information can scale in the number of nodes and time steps, which is a result without precedent.
Verifying Causality Between Distant Performance Phenomena
- in Large-Scale MPI Applications, in: Proc. of the 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP
, 2009
"... Abstract—In message-passing applications, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. Just knowing the locations of wait states in the program is often insufficient to unders ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Abstract—In message-passing applications, the temporal or spatial distance between cause and symptom of a performance problem constitutes a major difficulty in deriving helpful conclusions from performance data. Just knowing the locations of wait states in the program is often insufficient to understand the reason for their occurrence. We present a method for verifying hypotheses on causality between temporally or spatially distant performance phenomena in message-passing applications without altering the application itself. The verification is accomplished by modifying MPI event traces and using them to simulate the hypothetical message-passing behavior. By performing a parallel real-time reenactment of the communication to be simulated using the original execution configuration, we can achieve high scalability and good predictive accuracy in relation to the measured behavior. Not relying on a potentially complex model of the message-passing subsystem, our method is also platform independent. I.
Automatic trace-based performance analysis of metacomputing applications
- In InternationalParallelandDistributedProcessing Symposium
, 2007
"... The processing power and memory capacity of independent and heterogeneous parallel machines can be combined to form a single parallel system that is more powerful than any of its constituents. However, achieving satisfactory application performance on such a metacomputer is hard because the high lat ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The processing power and memory capacity of independent and heterogeneous parallel machines can be combined to form a single parallel system that is more powerful than any of its constituents. However, achieving satisfactory application performance on such a metacomputer is hard because the high latency of inter-machine communication as well as differences in hardware of constituent machines may introduce various types of wait states. In our earlier work, we have demonstrated that automatic pattern search in event traces can identify the sources of wait states in parallel applications running on a single computer. In this article, we describe how this approach can be extended to metacomputing environments with special emphasis on performance problems related to inter-machine communication. In addition, we demonstrate the benefits of our solution using a real-world multi-physics application.
SCALASCA Parallel Performance Analyses of SPEC MPI2007 Applications
"... Abstract. The SPEC MPI2007 1.0 benchmark suite provides a rich variety of message-passing HPC application kernels to compare the performance of parallel/distributed computer systems. Its 13 applications use a representative cross-section of programming languages (C/C++/ Fortran, often combined) and ..."
Abstract
-
Cited by 7 (7 self)
- Add to MetaCart
Abstract. The SPEC MPI2007 1.0 benchmark suite provides a rich variety of message-passing HPC application kernels to compare the performance of parallel/distributed computer systems. Its 13 applications use a representative cross-section of programming languages (C/C++/ Fortran, often combined) and MPI programming patterns (e.g., blocking vs. non-blocking vs. persistent point-to-point communication, with or without extensive collective communication). This offers a basis with which to examine the effectiveness of parallel performance tools using real-world applications that have already been extensively optimized and tuned (at least for sequential execution), but which may still have parallelization inefficiencies and scalability problems. In this context, the Scalasca toolset for scalable performance analysis of large-scale parallel applications, which has been extended to distinguish iteration/timestep phases, is evaluated with this suite on an IBM SP2 ‘Regatta ’ system, and found to be effective at identifying significant performance improvement opportunities. Keywords: Parallel/distributed systems; Benchmark suite; Performance measurement & analysis tools; Application tracing & profiling.
A Parallel Trace-Data Interface for Scalable Performance Analysis
"... Abstract Automatic trace analysis is an effective method of identifying complex performance phenomena in parallel applications. To simplify the development of complex trace-analysis algorithms, the EARL library interface offers high-level access to individual events contained in a global trace file. ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
Abstract Automatic trace analysis is an effective method of identifying complex performance phenomena in parallel applications. To simplify the development of complex trace-analysis algorithms, the EARL library interface offers high-level access to individual events contained in a global trace file. However, as the size of parallel systems grows further and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file becomes increasingly constrained by the large number of events. To enable scalable trace analysis, we present a new design of the aforementioned EARL interface that accesses multiple local trace files in parallel while offering means to conveniently exchange events between processes. This article describes the modified view of the trace data as well as related programming abstractions provided by the new PEARL library interface and discusses its application in performance analysis. 1
Performance analysis and tuning of the XNS CFD solver on Blue Gene/L
"... Abstract. The xns computational fluid dynamics code was successfully running on Blue Gene/L, however, its scalability was unsatisfactory until the first Jülich Blue Gene/L Scaling Workshop provided an opportunity for the application developers and performance analysts to start working together. Inve ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Abstract. The xns computational fluid dynamics code was successfully running on Blue Gene/L, however, its scalability was unsatisfactory until the first Jülich Blue Gene/L Scaling Workshop provided an opportunity for the application developers and performance analysts to start working together. Investigation of solver performance pin-pointed a communication bottleneck that appeared with approximately 900 processes, and subsequent remediation allowed the application to continue scaling with a four-fold simulation performance improvement at 4,096 processes. This experience also validated the scalasca performance analysis toolset, when working with a complex application at large scale, and helped direct the development of more comprehensive analyses. Performance properties have now been incorporated to automatically quantify pointto-point synchronisation time and wait states in scan operations, both of which were significant for xns on Blue Gene/L.
Scalable collation and presentation of call-path profile data with CUBE
- In Parallel Computing: Architectures, Algorithms and Applications: Proc. Parallel Computing (ParCo’07, Jülich/Aachen
"... Developing performance-analysis tools for parallel applications running on thousands of processors is extremely challenging due to the vast amount of performance data generated, which may conflict with available processing capacity, memory limitations, and file system performance especially when lar ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Developing performance-analysis tools for parallel applications running on thousands of processors is extremely challenging due to the vast amount of performance data generated, which may conflict with available processing capacity, memory limitations, and file system performance especially when large numbers of files have to be written simultaneously. In this article, we describe how the scalability of CUBE, a presentation component for call-path profiles in the SCALASCA toolkit, has been improved to more efficiently handle data sets from thousands of processes. First, the speed of writing suitable input data sets has been increased by eliminating the need to create large numbers of temporary files. Second, CUBE’s capacity to hold and display data sets has been raised by shrinking their memory footprint. Third, after introducing a flexible client-server architecture, it is no longer necessary to move large data sets between the parallel machine where they have been created and the desktop system where they are displayed. Finally, CUBE’s interactive response times have been reduced by optimizing the algorithms used to calculate aggregate metrics. All improvements are explained in detail and validated using experimental results. 1
Performance simulation of non-blocking communication in message-passing applications
- in Proc. of the 2nd Workshop on Productivity and Performance (PROPER, in conjunction with Euro-Par 2009
, 2009
"... Abstract. In our previous work [1], we introduced performance simulation as an instrument to verify hypotheses on causality between locally and spatially distant performance phenomena without altering the application itself. This is accomplished by modifying mpi event traces and using them to simula ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. In our previous work [1], we introduced performance simulation as an instrument to verify hypotheses on causality between locally and spatially distant performance phenomena without altering the application itself. This is accomplished by modifying mpi event traces and using them to simulate hypothetical message-passing behavior. Here, we present enhancements to our approach, which was previously restricted to blocking communication, that now allow us to correctly simulate mpi non-blocking communication. We enhanced the underlying trace data format to record communication requests, and extended the simulator to even retain the inherently non-deterministic behavior of operations such as MPI Waitany. 1

