Results 1 - 10
of
144
Confidence Estimation for Speculation Control
- In 25th Annual International Symposium on Computer Architecture
, 1998
"... Modern processors improve instruction level parallelism by speculation. The outcome of data and control decisions is predicted, and the operations are speculatively executed and only committed if the original predictions were correct. There are a number of other ways that processor resources could b ..."
Abstract
-
Cited by 128 (3 self)
- Add to MetaCart
(Show Context)
Modern processors improve instruction level parallelism by speculation. The outcome of data and control decisions is predicted, and the operations are speculatively executed and only committed if the original predictions were correct. There are a number of other ways that processor resources could be used, such as threading or eager execution. As the use of speculation increases, we believe more processors will need some form of speculation control to balance the benefits of speculation against other possible activities. Confidence estimation is one technique that can be exploited by architects for speculation control. In this paper, we introduce performance metrics to compare confidence estimation mechanisms, and argue that these metrics are appropriate for speculation control. We compare a number of confidence estimation mechanisms, focusing on mechanisms that have a small implementation cost and gain benefit by exploiting characteristics of branch predictors, such as clustering of ...
User-level Internet Path Diagnosis
- SOSP'03
, 2003
"... Diagnosing faults in the Internet is arduous and time-consuming, in part because the network is composed of diverse components spread across many administrative domains. We consider an extreme form of this problem: can end users, with no special privileges, identify and pinpoint faults inside the ne ..."
Abstract
-
Cited by 101 (14 self)
- Add to MetaCart
(Show Context)
Diagnosing faults in the Internet is arduous and time-consuming, in part because the network is composed of diverse components spread across many administrative domains. We consider an extreme form of this problem: can end users, with no special privileges, identify and pinpoint faults inside the network that degrade the performance of their applications? To answer this question, we present both an architecture for user-level Internet path diagnosis and a practical tool to diagnose paths in the current Internet. Our architecture requires only a small amount of network support, yet it is nearly as complete as analyzing a packet trace collected at all routers along the path. Our tool, tulip, diagnoses reordering, loss and significant queuing events by leveraging well deployed but little exploited router features that approximate our architecture. Tulip can locate points of reordering and loss to within three hops and queuing to within four hops on most paths that we measured. This granularity is comparable to that of a hypothetical network tomography tool that uses 65 diverse hosts to localize faults on a given path. We conclude by proposing several simple changes to the Internet to further improve its diagnostic capabilities.
iWatcher: Efficient Architectural Support for Software Debugging
- In Proceedings of the 31st International Symposium on Computer Architecture (ISCA
, 2004
"... Recent impressive performance improvements in computer architecture have not led to significant gains in ease of debugging. Software debugging often relies on inserting run-time software checks. In many cases, however, it is hard to find the root cause of a bug. Moreover, program execution typically ..."
Abstract
-
Cited by 82 (12 self)
- Add to MetaCart
Recent impressive performance improvements in computer architecture have not led to significant gains in ease of debugging. Software debugging often relies on inserting run-time software checks. In many cases, however, it is hard to find the root cause of a bug. Moreover, program execution typically slows down significantly, often by 10-100 times.
A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the re ..."
Abstract
-
Cited by 80 (4 self)
- Add to MetaCart
(Show Context)
This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the retirement stage of a processor pipeline removed from the critical path. The features of the proposed design include rapid detection of program hot spots after changes in execution behavior, runtime-tunable selection criteria for hot spot detection, and negligible overhead during application execution. Experiments using several SPEC95 benchmarks, as well as several large WindowsNT applications, demonstrate the promise of the proposed design. 1 Introduction Optimizing compilers can gain significant performance benefits by performing code transformations based on a program's runtime profile. Traditionally, profiles are collected by running an instrumented version of the executable. However, bec...
HPCView: A tool for top-down analysis of node performance
- The Journal of Supercomputing
, 2002
"... ..."
(Show Context)
Quanto: Tracking Energy in Networked Embedded Systems
"... We present Quanto, a network-wide time and energy profiler for embedded network devices. By combining well-defined interfaces for hardware power states, fast high-resolution energy metering, and causal tracking of programmer-defined activities, Quanto can map how energy and time are spent on nodes a ..."
Abstract
-
Cited by 71 (11 self)
- Add to MetaCart
(Show Context)
We present Quanto, a network-wide time and energy profiler for embedded network devices. By combining well-defined interfaces for hardware power states, fast high-resolution energy metering, and causal tracking of programmer-defined activities, Quanto can map how energy and time are spent on nodes and across a network. Implementing Quanto on the TinyOS operating system required modifying under 350 lines of code and adding 1275 new lines. We show that being able to take finegrained energy consumption measurements as fast as reading a counter allows developers to precisely quantify the effects of low-level system implementation decisions, such as using DMA versus direct bus operations, or the effect of external interference on the power draw of a low duty-cycle radio. Finally, Quanto is lightweight enough that it has a minimal effect on system behavior: each sample takes 100 CPU cycles and 12 bytes of RAM. 1
Rapidly selecting good compiler optimizations using performance counters
- In Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO
, 2007
"... Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space ..."
Abstract
-
Cited by 66 (24 self)
- Add to MetaCart
(Show Context)
Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-ofthe-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter based approach outperforms techniques based on static code features. Finally, we show that such improvements are stable across varying input data sets. Using our technique we achieve a 10 % improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a recent AMD Athlon 64 3700+ platform in just three evaluations. 1
AccMon: Automatically Detecting Memory-related Bugs via Program Counter-based Invariants
- In 37th International Symposium on Microarchitecture (MICRO
, 2004
"... This paper makes two contributions to architectural support for software debugging. First, it proposes a novel statistics-based, onthe -fly bug detection method called PC-based invariant detection. The idea is based on the observation that, in most programs, a given memory location is typically acce ..."
Abstract
-
Cited by 65 (12 self)
- Add to MetaCart
(Show Context)
This paper makes two contributions to architectural support for software debugging. First, it proposes a novel statistics-based, onthe -fly bug detection method called PC-based invariant detection. The idea is based on the observation that, in most programs, a given memory location is typically accessed by only a few instructions. Therefore, by capturing the invariant of the set of PCs that normally access a given variable, we can detect accesses by outlier instructions, which are often caused by memory corruption, buffer overflow, stack smashing or other memory-related bugs. Since this method is statistics-based, it can detect bugs that do not violate any programming rules and that, therefore, are likely to be missed by many existing tools. The second contribution is a novel architectural extension called the Check Look-aside Buffer (CLB). The CLB uses a Bloom filter to reduce monitoring overheads in the recentlyproposed iWatcher architectural framework for software debugging. The CLB significantly reduces the overhead of PC-based invariant debugging.
A Survey of Adaptive Optimization in Virtual Machines
- PROCEEDINGS OF THE IEEE, 93(2), 2005. SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
, 2004
"... Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimiza ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimization overhead. Second, modular
A performance counter architecture for computing accurate CPI components
- In ASPLOS
, 2006
"... A common way of representing processor performance is to use Cycles per Instruction (CPI) ‘stacks ’ which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given micr ..."
Abstract
-
Cited by 55 (8 self)
- Add to MetaCart
(Show Context)
A common way of representing processor performance is to use Cycles per Instruction (CPI) ‘stacks ’ which break performance into a baseline CPI plus a number of individual miss event CPI components. CPI stacks can be very helpful in gaining insight into the behavior of an application on a given microprocessor; consequently, they are widely used by software application developers and computer architects. However, computing CPI stacks on superscalar out-of-order processors is challenging because of various overlaps among execution and miss events (cache misses, TLB misses, and branch mispredictions). This paper shows that meaningful and accurate CPI stacks can be computed for superscalar out-of-order processors. Using interval analysis, a novel method for analyzing out-of-order processor performance, we gain understanding into the performance impact of the various miss events. Based on this understanding, we propose a novel way of architecting hardware performance counters for building accurate CPI stacks. The additional hardware for implementing these counters is limited and comparable to existing hardware performance counter architectures while being significantly more accurate than previous approaches.