Results 1 - 10
of
23
Binary Analysis for Measurement and Attribution of Program Performance
"... Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routine’s calling context. Existing performance tools fall short in this ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routine’s calling context. Existing performance tools fall short in this respect. Prior strategies for attributing context-sensitive performance at the source level either compromise measurement accuracy, remain too close to the binary, or require custom compilers. To understand the performance of fully optimized modular code, we developed two novel binary analysis techniques: 1) on-the-fly analysis of optimized machine code to enable minimally intrusive and accurate attribution of costs to dynamic calling contexts; and 2) post-mortem analysis of optimized machine code and its debugging sections to recover its program structure and reconstruct a mapping back to its source code. By combining the recovered static program structure with dynamic calling context information, we can accurately attribute performance metrics to calling contexts, procedures, loops, and inlined instances of procedures. We demonstrate that the fusion of this information provides unique insight into the performance of complex modular codes. This work is implemented in the HPC-TOOLKIT 1 performance tools.
Kitsune: Efficient, general-purpose dynamic software updating for C
, 2012
"... Dynamic software updating (DSU) systems allow programs to be updated while running, thereby allowing developers to add features and fix bugs without downtime. This paper introduces Kitsune, a new DSU system for C whose design has three notable features. First, Kitsune’s updating mechanism updates th ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Dynamic software updating (DSU) systems allow programs to be updated while running, thereby allowing developers to add features and fix bugs without downtime. This paper introduces Kitsune, a new DSU system for C whose design has three notable features. First, Kitsune’s updating mechanism updates the whole program, not individual functions. This mechanism is more flexible than most prior approaches and places no restrictions on data representations or allowed compiler optimizations. Second, Kitsune makes the important aspects of updating explicit in the program text, making its semantics easy to understand while keeping programmer work to a minimum. Finally, the programmer can write simple specifications to direct Kitsune to generate code that traverses and transforms old-version state for use by the new code; such state transformation is often necessary, and is significantly more difficult in prior DSU systems. We have used Kitsune to update five popular, open-source, single- and multi-threaded programs, and find that few program changes are required to use Kitsune, and that it incurs essentially no performance overhead. 1.
Asymmetries in Multi-Core Systems – Or Why We Need Better Performance Measurement Units
"... Future exascale systems will be based on multi-core processors, but even today’s multi-core processors can be asymmetric and exhibit limitations and bottlenecks that are different from those found on a symmetric multiprocessor. In this paper we investigate the performance of a cluster node based on ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Future exascale systems will be based on multi-core processors, but even today’s multi-core processors can be asymmetric and exhibit limitations and bottlenecks that are different from those found on a symmetric multiprocessor. In this paper we investigate the performance of a cluster node based on the Intel Xeon E5345 quad-core processor and note that despite the symmetry implied by the programming model, the available memory bandwidth is not shared equally among the cores. Consequently, applications experience substantial performance variance and slow-downs when the tasks (threads) are mapped to cores in a naive manner. An operating system scheduler could mitigate these effects by taking into account the memory bus structure but needs accurate information from the performance monitoring unit as the asymmetry is not directly exposed in the processor’s instruction set manual. Current performance monitoring units are quite inflexible and change from one processor to the next, so higher levels of the software tool chain are discouraged to use them. The next generation of Nehalem-based multicore systems poses similar challenges, and the development of portable performance monitoring units will be crucial if applications want to use the performance potential of exascale systems. We expect this situation to remain unchanged as long as memory is slow relative to the processor. 1
Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead
- In Proceedings of ISMM’11
"... Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mappi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.
P.: Can Linear Approximation Improve Performance Prediction
- In: Proceedings of EPEW 2011
, 2011
"... Abstract. Software performance evaluation relies on the ability of simple models to predict the performance of complex systems. Often, however, the models are not capturing potentially relevant effects in system behavior, such as sharing of memory caches or sharing of cores by hardware threads. The ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Software performance evaluation relies on the ability of simple models to predict the performance of complex systems. Often, however, the models are not capturing potentially relevant effects in system behavior, such as sharing of memory caches or sharing of cores by hardware threads. The goal of this paper is to investigate whether and to what degree a simple linear adjustment of service demands in software performance models captures these effects and thus improves accuracy. Outlined experiments explore the limits of the approach on two hardware platforms that include shared caches and hardware threads, with results indicating that the approach can improve throughput prediction accuracy significantly, but can also lead to loss of accuracy when the performance models are otherwise defective.
Evaluating the Accuracy of Java Profilers
"... Performance analysts profile their programs to find methods that are worth optimizing: the “hot ” methods. This paper shows that four commonly-used Java profilers (xprof, hprof, jprofile, and yourkit) often disagree on the identity of the hot methods. If two profilers disagree, at least one must be ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Performance analysts profile their programs to find methods that are worth optimizing: the “hot ” methods. This paper shows that four commonly-used Java profilers (xprof, hprof, jprofile, and yourkit) often disagree on the identity of the hot methods. If two profilers disagree, at least one must be incorrect. Thus, there is a good chance that a profiler will mislead a performance analyst into wasting time optimizing a cold method with little or no performance improvement. This paper uses causality analysis to evaluate profilers and to gain insight into the source of their incorrectness. It shows that these profilers all violate a fundamental requirement for samplingbased profilers: to be correct, a sampling-based profiler must collect samples randomly. We show that a proof-of-concept profiler, which collects samples randomly, does not suffer from the above problems. Specifically, we show, using a number of case studies, that our profiler correctly identifies methods that are important to optimize; in some cases other profilers report that these methods are cold and thus not worth optimizing. C.4 [Measurement tech-
Inferred Call Path Profiling
, 2009
"... Prior work has found call path profiles to be useful for optimizers and programmer-productivity tools. Unfortunately, previous approaches for collecting path profiles are expensive: they need to either execute additional instructions (to track calls and returns) or they need to walk the stack. The s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Prior work has found call path profiles to be useful for optimizers and programmer-productivity tools. Unfortunately, previous approaches for collecting path profiles are expensive: they need to either execute additional instructions (to track calls and returns) or they need to walk the stack. The state-of-the-art techniques for call path profiling slow down the program by 7 % (for C programs) and 20 % (for Java programs). This paper describes an innovative technique that collects minimal information from the running program and later (offline) infers the full call paths from this information. The key insight behind our approach is that readily available information during program execution—the height of the call stack and the identity of the current executing function—are good indicators of calling context. We call this pair a context identifier. Because more than one call path may have the same context identifier, we show how to disambiguate context identifiers by changing the sizes of function activation records. This disambiguation has no overhead in terms of executed instructions. We evaluate our approach on the SPEC CPU 2006 C++ and C benchmarks. We show that collecting context identifiers slows down programs by 0.17 % (geometric mean). We can map these context identifiers to the correct unique call path 80 % of the time for C++ programs and 95 % of the time for C programs.
Predicting Performance via Automated Feature-Interaction Detection
"... Abstract—Customizable programs and program families provide user-selectable features to allow users to tailor a program to an application scenario. Knowing in advance which feature selection yields the best performance is difficult because a direct measurement of all possible feature combinations is ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—Customizable programs and program families provide user-selectable features to allow users to tailor a program to an application scenario. Knowing in advance which feature selection yields the best performance is difficult because a direct measurement of all possible feature combinations is infeasible. Our work aims at predicting program performance based on selected features. However, when features interact, accurate predictions are challenging. An interaction occurs when a particular feature combination has an unexpected influence on performance. We present a method that automatically detects performance-relevant feature interactions to improve prediction accuracy. To this end, we propose three heuristics to reduce the number of measurements required to detect interactions. Our evaluation consists of six real-world case studies from varying domains (e.g., databases, encoding libraries, and web servers) using different configuration techniques (e.g., configuration files and preprocessor flags). Results show an average prediction accuracy of 95 %. I.
Exact temporal characterization of 10 Gbps optical widearea network using high precision instrumentation
- In Proceedings of the 10th Internet Measurement Conference (IMC’ 10
, 2010
"... We design and implement a novel class of highly precise network instrumentation and apply this tool to perform the first exact packet-timing measurements of a wide-area network ever undertaken, capturing 10 Gigabit Ethernet packets in flight on optical fiber. Through principled design, we improve ti ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We design and implement a novel class of highly precise network instrumentation and apply this tool to perform the first exact packet-timing measurements of a wide-area network ever undertaken, capturing 10 Gigabit Ethernet packets in flight on optical fiber. Through principled design, we improve timing precision by two to six orders of magnitude over existing techniques. Our observations contest several common assumptions about behavior of wide-area networks and the relationship between their input and output traffic flows. Further, we identify and characterize emergent packet chains as a mechanism to explain previously observed anomalous packet loss on receiver endpoints of such networks.
Automated Program Repair through the Evolution of Assembly Code
- Proc. IEEE/ ACM Int’l Conf. Automated Software Eng
, 2010
"... A method is described for automatically repairing legacy software at the assembly code level using evolutionary computation. The technique is demonstrated on Java byte code and x86 assembly programs, showing how to find program variations that correct defects while retaining desired behavior. Test c ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
A method is described for automatically repairing legacy software at the assembly code level using evolutionary computation. The technique is demonstrated on Java byte code and x86 assembly programs, showing how to find program variations that correct defects while retaining desired behavior. Test cases are used to demonstrate the defect and define required functionality. The paper explores advantages of assembly-level repair over earlier work at the source code level—the ability to repair programs written in many different languages; and the ability to repair bugs that were previously intractable. The paper reports experimental results showing reasonable performance of assembly language repair even on non-trivial programs.

