Results 11 - 20
of
35
Performance monitoring in a Myrinet-connected SHRIMP cluster
- In Proc. of 2nd SIGMETRICS Symposium on Parallel and Distributed Tools
, 1998
"... Performance monitoring is a crucial aspect of parallel programming. Extracting the best possible performance from the system is the main goal of parallel programming, and monitoring tools are often essential to achieving that goal. A common tradeoff arises in determining at which system level to mon ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
Performance monitoring is a crucial aspect of parallel programming. Extracting the best possible performance from the system is the main goal of parallel programming, and monitoring tools are often essential to achieving that goal. A common tradeoff arises in determining at which system level to monitor performance information and present results. High-level monitoring approaches can often gather data directly tied to the software programming model, but may abstract away crucial low-level hardware details. Lowlevel monitoring approaches can gather fairly complete performance information about the underlying system, but often at the expense of portability and flexibility. In this paper we discuss a compromise approach between the portability and flexibility of high-level monitoring and the detailed data awareness of low-level monitoring. We present a firmware-based performance monitor we designed for a Myrinet-connected Shrimp cluster. This monitor combines the portability and flexibility typically found in software-based monitors, with detailed, low-level information traditionally available only to hardware monitors. As with hardware approaches, ours results in little monitoring perturbation. Since it includes a software-based global clock, the monitor can track inter-node latencies accurately. Our tool is flexible and can monitor applications with a wide range of communication abstractions, though we focus here on its usage on shared virtual memory applications. The portability and flexibility of this firmware-based monitoring strategy make it a very promising approach for gathering low-level statistics about parallel program performance. 1
The Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols
, 1998
"... that I have read this dissertation and that in my opinion it is ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
(Show Context)
that I have read this dissertation and that in my opinion it is
Sip: Performance tuning through source code interdependence
- In Euro-Par’02
, 2002
"... Abstract. The gap between CPU peak performance and achieved ap-plication performance widens as CPU complexity, as well as the gap between CPU cycle time and DRAM access time, increases. While ad-vanced compilers can perform many optimizations to better utilize the cache system, the application progr ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
(Show Context)
Abstract. The gap between CPU peak performance and achieved ap-plication performance widens as CPU complexity, as well as the gap between CPU cycle time and DRAM access time, increases. While ad-vanced compilers can perform many optimizations to better utilize the cache system, the application programmer is still required to do some of the optimizations needed for efficient execution. Therefore, profiling should be performed on optimized binary code and performance prob-lems reported to the programmer in an intuitive way. Existing perfor-mance tools do not have adequate functionality to address these needs. Here we introduce source interdependence profiling, SIP, as a paradigm to collect and present performance data to the programmer. SIP identi-fies the performance problems that remain after the compiler optimiza-tion and gives intuitive hints at the source-code level as to how they can be avoided. Instead of just collecting information about the events directly caused by each source-code statement, SIP also presents data about events from some interdependent statements of source code. A first SIP prototype tool has been implemented. It supports both C and Fortran programs. We describe how the tool was used to improve the performance of the SPEC CPU2000 183.equake application by 59 percent. 1
Monitoring Shared Virtual Memory Performance on a Myrinet-based PC Cluster
- In International Conference on Supercomputing
, 1998
"... Network-connected clusters of PCs or workstations are becoming a widespread parallel computing platform. Performance methodologies that use either simulation or high-level software instrumentation cannot adequately measure the detailed behavior of such systems. The availability of new network techno ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
Network-connected clusters of PCs or workstations are becoming a widespread parallel computing platform. Performance methodologies that use either simulation or high-level software instrumentation cannot adequately measure the detailed behavior of such systems. The availability of new network technologies based on programmable network interfaces opens a new avenue of research in analyzing and improving the performance of software shared memory protocols. We have developed monitoring firmware embedded in the programmable network interfaces of a Myrinet-based PC cluster. Timestamps on network packets facilitate the collection of low-level statistics on, e.g., network latencies, interrupt handler times and inter-node synchronization. This paper describes our use of the low-level software performance monitor to measure and understand the performance of a Shared Virtual Memory (SVM) system implemented on a Myrinetbased cluster, running the SPLASH-2 benchmarks. We measured time spent in vari...
OS Support for Improving Data Locality on CC-NUMA Compute Servers
- In In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... The dominant architecture for the next generation of cache-coherent shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers, because they provide transparent access to local and remote memory. However, the access lat ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
The dominant architecture for the next generation of cache-coherent shared-memory multiprocessors is CC-NUMA (cache-coherent non-uniform memory architecture). These machines are attractive as compute servers, because they provide transparent access to local and remote memory. However, the access latency to remote memory is 3- 5 times the latency to local memory. Given the large remote access latencies, data locality is potentially the most important performance issue. In compute-server workloads, when moving processes between nodes for load balancing, to maintain data locality the OS needs to do page-migration and page-replication. Through trace-analysis and actual runs of realistic workloads, we study the potential improvements in performance provided by OS supported dynamic migration and replication. Analyzing our kernel-based implementation of the policy, we provide a detailed breakdown of the costs and point out the functions using the most time. We study alternatives to using full-cache miss information to drive the policy, and show that sampling of cache misses can be used to reduce cost without compromising performance, and that TLB misses are inconsistent as an approximation for cache misses. Finally, our workload runs show that OS supported dynamic page-migration and page-replication can substantially increase performance, as much as 29%, in some workloads.
Owl: Next-generation system monitoring
- In ACM Computing Frontiers
, 2005
"... As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult, and often requires obtaining and sifting through voluminous event traces or coordinating results from multiple, nonlocalized sources. Owl is a proposed framework that overcomes limitation ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult, and often requires obtaining and sifting through voluminous event traces or coordinating results from multiple, nonlocalized sources. Owl is a proposed framework that overcomes limitations faced by traditional performance counters and monitoring facilities in dealing with such complexity by pervasively deploying programmable monitoring elements throughout a system. The design exploits reconfigurable or programmable logic to realize hardware monitors located at event sources, such as memory buses. These monitors run and writeback results autonomously with respect to the CPU, mitigating the system impact of interrupt-driven monitoring or the need to communicate irrelevant events to higher levels of the system. The monitors are designed to snoop any kind of system transaction, e.g., within the core, on a bus, across the wire, or within I/O devices.
Mechanisms for Distributed Shared Memory
, 1996
"... Distributed shared memory (DSM) systems simplify the task of writing distributedmemory parallel programs by automating data distribution and communication. Unfortunately, DSM systems control memory and communication using fixed policies, even when programmers or compilers could manage these resource ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Distributed shared memory (DSM) systems simplify the task of writing distributedmemory parallel programs by automating data distribution and communication. Unfortunately, DSM systems control memory and communication using fixed policies, even when programmers or compilers could manage these resources more efficiently. This thesis proposes a new approach that lets users efficiently manage communication and memory on DSM systems. Systems provide primitive DSM mechanisms without binding them to fixed protocols (policies). Standard shared-memory programs use default protocols similar to those found in current DSM machines. Unlike current systems, these protocols are implemented in unprivileged software. Programmers and compilers are free to modify or replace them with optimized custom protocols that manage memory and communication directly and efficiently. To explore this new approach, this thesis: . identifies a set of mechanisms for distributed shared memory, . develops Tempest, a portab...
Optimizing Data Locality for SCI-based PC-Clusters with the SMiLE Monitoring Approach
- IN PROCEEDINGS OF INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT
, 1999
"... Modern low-latency and high-bandwidth interconnects like the Scalable Coherent Interface (SCI) deliver high communication performance for parallel and distributed systems. However, the performance of an SCI-based compute cluster with NUMA characteristics depends on the efficient use of local memory ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
(Show Context)
Modern low-latency and high-bandwidth interconnects like the Scalable Coherent Interface (SCI) deliver high communication performance for parallel and distributed systems. However, the performance of an SCI-based compute cluster with NUMA characteristics depends on the efficient use of local memory accesses. Therefore, programming and tool environments for such systems with distributed shared memory (DSM) should enable and exploit data locality. In this paper, we present an event-driven hybrid monitoring approach for a SCI-based PC cluster with hardwaresupported DSM. The core of that concept is a hardware monitor which is able to observe the fine-grained nature of the communication in such a parallel system with minimal impact to the system. The hardware monitor delivers in real-time detailed information about the communication and runtime of an examined program. The monitoring system allows the user an optimal evaluation of the network behavior and hence of the data locality of that parallel program.
Performance monitoring for run-time management of reconfigurable devices
- In International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA
, 2005
"... High-performance computing (HPC) systems with hardware-reconfigurable devices have the potential to achieve major performance increases over parallel computing systems based solely on traditional processors. However, providing services upon which users of traditional HPC systems have come to depend ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
High-performance computing (HPC) systems with hardware-reconfigurable devices have the potential to achieve major performance increases over parallel computing systems based solely on traditional processors. However, providing services upon which users of traditional HPC systems have come to depend is essential for large-scale reconfigurable computing (RC) systems to become mainstream. Along with critical needs for management services, core libraries, user-friendly interface, etc., mechanisms for system resource monitoring to support debug and performance analysis and optimization is an important feature in conventional HPC systems that is currently lacking in their RC counterparts. This paper presents the concept of hardware monitoring probes within the CARMA framework for RC-based HPC and examines several design options. Experimental results analyze probe design considerations on a case-study application.
Kismet: Parallel Speedup Estimates for Serial Programs
"... Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Ki ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs from previous approaches in that it does not require any manual analysis or modification of the program. This difference allows quick analysis of many programs, avoiding wasted engineering effort on those that are fundamentally limited. To accomplish this task, Kismet builds upon the hierarchical critical path analysis (HCPA) technique, a recently developed dynamic analysis that localizes parallelism to each of the potentially nested regions in the target program. It then uses a parallel execution time model to compute an approximate upper bound for performance, modeling constraints that stem from both hardware parameters and internal program structure. Our evaluation applies Kismet to eight high-parallelism NAS Parallel Benchmarks running on a 32-core AMD multicore system, five low-parallelism SpecInt benchmarks, and six medium-parallelism benchmarks running on the finegrained MIT Raw processor. The results are compelling. Kismet is able to significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.