Results 11 - 20
of
59
Memory Forwarding: Enabling Aggressive Layout Optimizations by Guaranteeing the Safety of Data Relocation
- IN PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1999
"... By optimizing data layout at run-time, we can potentially enhance the performance of caches by actively creating spatial locality, facilitating prefetching, and avoiding cache conflicts and false sharing. Unfortunately, it is extremely difficult to guarantee that such optimizations are safe in pract ..."
Abstract
-
Cited by 33 (1 self)
- Add to MetaCart
(Show Context)
By optimizing data layout at run-time, we can potentially enhance the performance of caches by actively creating spatial locality, facilitating prefetching, and avoiding cache conflicts and false sharing. Unfortunately, it is extremely difficult to guarantee that such optimizations are safe in practice on today's machines, since accurately updating all pointers to an object requires perfect alias information, which is well beyond the scope of the compiler for languages such as C. To overcome this limitation, we proposea technique called memory forwarding which effectively adds a new layer of indirection within the memory system whenever necessary to guarantee that data relocation is always safe. Because actual forwarding rarely occurs (it exists as a safety net), the mechanism can be implemented as an exception in modern superscalar processors. Our experimental results demonstrate that the aggressive layout optimizations enabled by memory forwarding can result in significant speedups--...
METRIC: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting
, 2003
"... In this paper, we present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions in this ..."
Abstract
-
Cited by 29 (15 self)
- Add to MetaCart
(Show Context)
In this paper, we present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions in this paper. First, we present methods to extract partial data traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial data traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial data traces. By exploiting summarized memory metrics, by-reference metrics as well as cache evictor information, we can pin-point the sources of performance problems. Fourth, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several experiments resulting in up to 40% lower miss ratios.
Concurrent event handling through multithreading
- doi: 10.1109/12.795220. [Online]. Available: http://dx.doi.org/10.1109/12.795220 826 PROCEEDINGS OF THE FEDCSIS
, 1999
"... ..."
(Show Context)
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy
- In Proceedings of the 37th International Symposium on Microarchitecture. IEEE
, 2004
"... A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect increases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cos ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
(Show Context)
A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect increases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cost in complexity, which may lead to an increase in pipeline length or a decrease in the maximum clock rate. This paper presents new designs for multithreaded processors which combine a conservative SMT implementation with a coarsegrained multithreading capability. By presenting more virtual contexts to the operating system and user than are supported in the core pipeline, the new designs can take advantage of the memory parallelism present in workloads with many threads, while avoiding the performance penalties inherent in a manycontext SMT processor design. A design with 4 virtual contexts, but which is based on a 2-context SMT processor core, gains an additional 26% throughput when 4 threads are run together.
Colorama: Architectural support for data-centric synchronization
- In Proceedings of the 13th International Symposium on High-Performance Computer Architecture
, 2007
"... With the advent of ubiquitous multi-core architectures, a major challenge is to simplify parallel programming. One way to tame one of the main sources of programming complexity, namely syn-chronization, is transactional memory (TM). However, we argue that TM does not go far enough, since the program ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
(Show Context)
With the advent of ubiquitous multi-core architectures, a major challenge is to simplify parallel programming. One way to tame one of the main sources of programming complexity, namely syn-chronization, is transactional memory (TM). However, we argue that TM does not go far enough, since the programmer still needs non-local reasoning to decide where to place transactions in the code. A significant improvement to the art is Data-Centric Synchroniza-tion (DCS), where the programmer uses local reasoning to assign synchronization constraints to data. Based on these, the system au-tomatically infers critical sections and inserts synchronization oper-ations. This paper proposes novel architectural support to make DCS feasible, and describes its programming model and interface. The proposal, called Colorama, needs only modest hardware extensions, supports general-purpose, pointer-based languages such as C/C++ and, in our opinion, can substantially simplify the task of writing new parallel programs. 1.
Compiler Orchestrated Prefetching via Speculation and Predication
- In ASPLOS-XI: Proceedings of the 11th international conference on Architectural
, 2004
"... This paper introduces a compiler-orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize th ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
This paper introduces a compiler-orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointerchasing applications, and demonstrate how they may benefit from a novel scheme of sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.
Shared-memory performance profiling
- In Sixth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming
, 1997
"... This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach exploits the underlying system’s cache coherence ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach exploits the underlying system’s cache coherence protocol to detect data sharing patterns that indicate potential performance bottlenecks and presents performance measurements in a data-centric manner. As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four. 1
Architectural Adaptation for Application-Specific Locality Optimizations
- IN PROCEEDINGS OF THE 1997 IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN
, 1997
"... We propose a machine architecture that integrates programmable logic into key components of the system with the goal of customizing architectural mechanisms and policies to match an application. This approach presents an improvement over traditional approach of exploiting programmable logic as a sep ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
We propose a machine architecture that integrates programmable logic into key components of the system with the goal of customizing architectural mechanisms and policies to match an application. This approach presents an improvement over traditional approach of exploiting programmable logic as a separate co-processor by preserving machine usability through software and over traditional computer architecture by providing application-specific hardware assists. We present two case studies of architectural customization to enhance latency tolerance and efficiently utilize network bisection on multiprocessors for sparse matrix computations. We demonstrate that using application-specific hardware assists and policies can provide substantial improvements in performance on a per application basis. Based on these preliminary results, we propose that an application-driven machine customization provides a promising approach to achieve high performance and combat performance fragility.
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor in
- ACM
"... Moore’s Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multicores, extending the ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Moore’s Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multicores, extending the current state-of-the-art CPU-GPU integration that physically “fuses ” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3Dspecific graphics processing is used to build more generalpurpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.