Results 1 -
9 of
9
Static Analysis and Compiler Design for Idempotent Processing
"... Recovery functionality has many applications in computing systems, from speculation recovery in modern microprocessors to fault recovery in high-reliability systems. Modern systems commonly recover using checkpoints. However, checkpoints introduce overheads, add complexity, and often save more state ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Recovery functionality has many applications in computing systems, from speculation recovery in modern microprocessors to fault recovery in high-reliability systems. Modern systems commonly recover using checkpoints. However, checkpoints introduce overheads, add complexity, and often save more state than necessary. This paper develops a novel compiler technique to recover program state without the overheads of explicit checkpoints. The technique breaks programs into idempotent regions—regions that can be freely re-executed—which allows recovery without checkpointed state. Leveraging the property of idempotence, recovery can be obtained by simple re-execution. We develop static analysis techniques to construct these regions and demonstrate low overheads and large region sizes for an LLVM-based implementation. Across a set of diverse benchmark suites, we construct idempotent regions close in size to those that could be obtained with perfect runtime information. Although the resulting code runs more slowly, typical performance overheads are in the range of just 2-12%. The paradigm of executing entire programs as a series of idempotent regions we call idempotent processing, and it has many applications in computer systems. As a concrete example, we demonstrate it applied to the problem of compiler-automated hardware fault recovery. In comparison to two other state-of-the-art techniques, redundant execution and checkpoint-logging, our idempotent processing technique outperforms both by over 15%.
Idempotent Code Generation: Implementation, Analysis, and Evaluation
"... Leveraging idempotence for efficient recovery is of emerging interest in compiler design. In particular, identifying semantically idempotent code and then compiling such code to preserve the semantic idempotence property enables recovery with substantially lower overheads than competing software tec ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Leveraging idempotence for efficient recovery is of emerging interest in compiler design. In particular, identifying semantically idempotent code and then compiling such code to preserve the semantic idempotence property enables recovery with substantially lower overheads than competing software techniques. However, the efficacy of this technique depends on application-, architecture-, and compiler-specific factors that are not well understood. In this paper, we develop algorithms for the code generation of idempotent code regions and evaluate these algorithms considering how they are impacted by these factors. Without optimizing for these factors, we find that typical performance overheads fall in the range of roughly 10-15%. However, manipulating application idempotent region size typically improves the run-time performance of compiled code by 2-10%, differences in the architecture instruction set affect performance by up to 15%, and knowing in the compiler whether control flow side-effects can or cannot occur can impact performance by up to 10%. Overall, we find that, with small idempotent region and careful architecture- and application-specific tuning, it is possible to bring compiler performance overheads consistently down into the single-digit percentage range. The absolute best performance occurs when constructing the largest possible idempotent regions; to this end, however, better compiler support is needed. In the interest of spurring development in this area, we open-source our LLVM compiler implementation and make it available as a research tool.
Software Data-Triggered Threads
"... The data-triggered threads (DTT) programming and execution model can increase parallelism and eliminate redundant compu-tation. However, the initial proposal requires signicant architec-ture support, which impedes existing applications and architectures from taking advantage of this model. This work ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
The data-triggered threads (DTT) programming and execution model can increase parallelism and eliminate redundant compu-tation. However, the initial proposal requires signicant architec-ture support, which impedes existing applications and architectures from taking advantage of this model. This work proposes a pure software solution that supports the DTT model without any hardware support. This research uses a prototype compiler and runtime libraries running on top of existing machines. Several enhancements to the initial software implemen-tation are presented, which further improve the performance. The software runtime system improves the performance of se-rial C SPEC benchmarks by 15 % on a Nehalem processor, but by over 7X over the full suite of single-thread applications. It is shown that the DTT model can work in conjunction with traditional paral-lelism. The DTT model provides up to 64X speedup over parallel applications exploiting traditional parallelism. 1.
CDTT: Compiler-Generated Data-Triggered Threads
"... This paper presents CDTT, a compiler framework that takes C/C++ code and automatically generates a binary that eliminates dynamically redundant code without pro-grammer intervention. It does so by exploiting underlying hardware or software support for the data-triggered threads (DTT) programming and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
This paper presents CDTT, a compiler framework that takes C/C++ code and automatically generates a binary that eliminates dynamically redundant code without pro-grammer intervention. It does so by exploiting underlying hardware or software support for the data-triggered threads (DTT) programming and execution model. With the help of idempotence analysis and inter-procedural name depen-dence analysis, CDTT identifies potential code regions and composes support thread functions that execute as soon as live-in data changes. CDTT can also use profile data to target the elimination of redundant computation. The compiled binary running on top of a software run-time system can achieve nearly the same level of perfor-mance as careful hand-coded modifications in most bench-marks. CDTT improves the performance of serial C SPEC benchmarks by as much as 57 % (average 11%) on a Ne-halem processor. 1.
iThreads: A Threading Library for Parallel Incremental Computation
"... Abstract Incremental computation strives for efficient successive runs of applications by re-executing only those parts of the computation that are affected by a given input change instead of recomputing everything from scratch. To realize these benefits automatically, we describe iThreads, a threa ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Incremental computation strives for efficient successive runs of applications by re-executing only those parts of the computation that are affected by a given input change instead of recomputing everything from scratch. To realize these benefits automatically, we describe iThreads, a threading library for parallel incremental computation. iThreads supports unmodified shared-memory multithreaded programs: it can be used as a replacement for pthreads by a simple exchange of dynamically linked libraries, without even recompiling the application code. To enable such an interface, we designed algorithms and an implementation to operate at the compiled binary code level by leveraging MMU-assisted memory access tracking and process-based thread isolation. Our evaluation on a multicore platform using applications from the PARSEC and Phoenix benchmarks and two casestudies shows significant performance gains.
Data-triggered Multithreading for Near-Data Processing
"... Data-centric computing becomes increasingly important because of the rapid growth of application data. In this work, we introduce the DTM (Data-Triggered Multithreading) pro-gramming model that extends the DTT (Data-Triggered Thread) model and is fully compatible with existing C/C++ programs. The DT ..."
Abstract
- Add to MetaCart
(Show Context)
Data-centric computing becomes increasingly important because of the rapid growth of application data. In this work, we introduce the DTM (Data-Triggered Multithreading) pro-gramming model that extends the DTT (Data-Triggered Thread) model and is fully compatible with existing C/C++ programs. The DTM model naturally attaches computation to data. Therefore, the runtime system can dynamically allo-cate the computing resource that provides affinity and locality. We demonstrate the potential of DTM model to improve re-sponse time and improve scalability over the traditional mul-tithreaded programming model. 1.
Rhythm: Harnessing Data Parallel Hardware for
"... Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center effi-ciency primarily focuses on the data center as a whole, using off-the-shelf hardware for individual servers. Server cap ..."
Abstract
- Add to MetaCart
(Show Context)
Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center effi-ciency primarily focuses on the data center as a whole, using off-the-shelf hardware for individual servers. Server capac-ity is typically increased by adding more machines, which is cheap, though inefficient in the long run in terms of energy and area. Our work builds on the observation that server workload execution patterns are not completely unique across multi-ple requests. We present a framework—called Rhythm—for high throughput servers that can exploit similarity across requests to improve server performance and power/energy efficiency by launching data parallel executions for re-quest cohorts. An implementation of the SPECWeb Bank-ing workload using Rhythm on NVIDIA GPUs provides a basis for evaluating both software and hardware for future cohort-based servers. Our evaluation of Rhythm on future server platforms shows that it achieves 4 × the throughput (reqs/sec) of a core i7 at efficiencies (reqs/Joule) compara-ble to a dual core ARM Cortex A9. A Rhythm implemen-tation that generates transposed responses achieves 8 × the i7 throughput while processing 2.5 × more requests/Joule compared to the A9. ∗ Now with Samsung Electronics,
Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores
"... Abstract—In this work we explore the tradeoffs between energy and performance for several last-level cache con-figurations in an asymmetric multi-core system. We show that for switching threads between cores at intervals on the order of 100k or more instructions, the performance difference is neglig ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—In this work we explore the tradeoffs between energy and performance for several last-level cache con-figurations in an asymmetric multi-core system. We show that for switching threads between cores at intervals on the order of 100k or more instructions, the performance difference is negligible when private last-level caches are used in place of shared last-level caches. Thus, last-level caches can be matched to meet the needs of their host core in order to improve energy efficiency. In particular, we show that when private last-level caches are used to maintain thread state, in conjunction with energy-saving optimizations, the energy delay product of the last-level caches can be reduced by 25 % on average for switching frequencies on the order of an operating system scheduling quanta—e.g., every 1 million instructions. Further, the optimizations we propose, such as power-state-aware data forwarding, are simple to implement, and the necessary support for them is already present in most current architectures. I.