Results 1 -
6 of
6
Metric: Memory tracing via dynamic binary rewriting to identify cache inefficiencies
- ACM Transactions on Programming Languages and Systems
, 2007
"... With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extrac ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—compilers; optimization;
Owl: Next-generation system monitoring
- In ACM Computing Frontiers
, 2005
"... As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult, and often requires obtaining and sifting through voluminous event traces or coordinating results from multiple, nonlocalized sources. Owl is a proposed framework that overcomes limitation ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
As microarchitectural and system complexity grows, comprehending system behavior becomes increasingly difficult, and often requires obtaining and sifting through voluminous event traces or coordinating results from multiple, nonlocalized sources. Owl is a proposed framework that overcomes limitations faced by traditional performance counters and monitoring facilities in dealing with such complexity by pervasively deploying programmable monitoring elements throughout a system. The design exploits reconfigurable or programmable logic to realize hardware monitors located at event sources, such as memory buses. These monitors run and writeback results autonomously with respect to the CPU, mitigating the system impact of interrupt-driven monitoring or the need to communicate irrelevant events to higher levels of the system. The monitors are designed to snoop any kind of system transaction, e.g., within the core, on a bus, across the wire, or within I/O devices.
A hybrid hardware/software approach to efficiently determine cache coherence bottlenecks
- In International Conference on Supercomputing. accepted
, 2005
"... High-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as clusters of SMPs, nodes of chipmultiprocessors (CMP) or large-scale single-system image (SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data within applications and its impact ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
High-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as clusters of SMPs, nodes of chipmultiprocessors (CMP) or large-scale single-system image (SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data within applications and its impact on cache coherence. Sharing patterns that result in frequent invalidations followed by subsequent coherence misses create cache coherence bottlenecks with significant performance penalties. Past work on identifying coherence bottlenecks based on tracing memory accesses incurs considerable runtime overhead and does not scale well with increasing problem sizes, which makes it infeasible to use with real-world programs. In this paper, we introduce a novel low-cost, hardware-assisted approach to determine coherence bottlenecks in shared-memory OpenMP applications. We assess the merits of our approach on a
Analysis of Cache Coherence Bottlenecks with Hybrid Hardware/Software Techniques
"... Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherenc ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherence-analysis techniques that reduce trace sizes by two orders of magnitude over full traces. First, hardware performance monitoring is combined with capturing stores in software to provide a lossy-trace mechanism, which is an order of magnitude faster than software-instrumentation-based full-tracing and retains accuracy. Second, selected long-latency loads are instrumented via binary rewriting, which provides even higher accuracy and control over tracing but requires additional overhead.
A SEMANTICS-BASED APPROACH TO OPTIMIZING UNSTRUCTURED MESH ABSTRACTIONS
, 2008
"... Computational scientists are frequently confronted with a choice: implement algorithms using high-level abstractions, such as matrices and mesh entities, for greater programming productivity or code them using low-level language con-structs for greater execution efficiency. We have observed that the ..."
Abstract
- Add to MetaCart
Computational scientists are frequently confronted with a choice: implement algorithms using high-level abstractions, such as matrices and mesh entities, for greater programming productivity or code them using low-level language con-structs for greater execution efficiency. We have observed that the cost of im-plementing a representative unstructured mesh code with high-level abstractions is poor computational intensity—the ratio of floating point operations to mem-ory accesses. Related scientific applications frequently produce little “science per cycle ” because their abstractions both introduce additional overhead and hinder compiler analysis and subsequent optimization. Our work exploits the seman-tics of abstractions, as employed in unstructured mesh codes, to overcome these limitations and to guide a series of manual, domain-specific optimizations that significantly improve computational intensity. We propose a framework for the automation of such high-level optimizations within the ROSE source-to-source compiler infrastructure. The specification of optimizations is left to domain experts and library writers who best understand
Low Overhead Spatial and Temporal Data Locality Analysis
, 2003
"... Performance is getting increasingly sensitive to cache behavior because of the growing gap between processor cycle time and memory latency. To improve performance, applications need to be optimized for data locality. Run-time analysis of spatial and temporal data locality can be used to facilitate t ..."
Abstract
- Add to MetaCart
Performance is getting increasingly sensitive to cache behavior because of the growing gap between processor cycle time and memory latency. To improve performance, applications need to be optimized for data locality. Run-time analysis of spatial and temporal data locality can be used to facilitate this and should help both manual tuning and feedback-based compiler optimizations. Identifying cache behavior of individual data structures further enhances the optimization process. Current methods to perform such analysis include simulation combined with set sampling or time sampling, and hardware monitoring. Sampling often suffers from either poor accuracy or large run-time overhead, while hardware measurements have limited flexibility. We present DLTune, a prototype tool that performs spatial and temporal data-locality analysis in run time. It measures both spatial and temporal locality for the entire application and individual data structures in a single run, and effectively exposes poor data locality based on miss ratio estimates of fully-associative caches. The tool is based on an elaborate and novel sampling technique that allows all information to be collected in a single run with an overall sampling rate as low as one memory reference in ten million and an average slowdown below five on large workloads. 2 1 Introduction The use of caches and advanced memory hierarchies in modern computers assumes locality of data to perform well. Poor data locality causes the processor to stall frequently because many data accesses miss in the cache. Analysis of data locality helps in designing efficient software, both when used in a compiler framework to guide optimization decisions, and as a programming aid in the form of profiling tools. It is essential that such a tool is fast enough to allow for real-sized data sets and provides short turn-around times.

