Results 1 - 10
of
37
Runtime specialization with optimistic heap analysis
- In Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages and Applications
, 2005
"... We describe a highly practical program specializer for Java programs. The specializer is powerful, because it specializes optimistically, using (potentially transient) constants in the heap; it is precise, because it specializes using data structures that are only partially invariant; it is deployab ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
We describe a highly practical program specializer for Java programs. The specializer is powerful, because it specializes optimistically, using (potentially transient) constants in the heap; it is precise, because it specializes using data structures that are only partially invariant; it is deployable, because it is hidden in a JIT compiler and does not require any user annotations or offline preprocessing; it is simple, because it uses existing JIT compiler ingredients; and it is fast, because it specializes programs in under 1s. These properties are the result of (1) a new algorithm for selecting specializable code fragments, based on a notion of influence; (2) a precise store profile for identifying constant heap locations; and (3) an efficient invalidation mechanism for monitoring optimistic assumptions about heap constants. Our implementation of the specializer in the Jikes RVM has low overhead, selects specialization points that would be chosen manually, and produces speedups ranging from a factor of 1.2 to 6.4, comparable with annotationguided specializers.
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework
- IN PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO 2006)
, 2006
"... Software prefetching has been demonstrated as a powerful technique to tolerate long load latencies. However, to be effective, prefetching must target the most critical (frequently missing) loads, and prefetch them sufficiently far in advance. This is difficult to do correctly with a static optimizer ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Software prefetching has been demonstrated as a powerful technique to tolerate long load latencies. However, to be effective, prefetching must target the most critical (frequently missing) loads, and prefetch them sufficiently far in advance. This is difficult to do correctly with a static optimizer, because locality characteristics and cache latencies vary across data inputs and across different machines. This paper presents a mechanism that dynamically inserts prefetch instructions into frequently executed hot traces. Hot traces are dynamically analyzed to identify delinquent loads and the appropriate prefetch distance for those loads. Those prefetches are then inserted into the hot trace. The low overhead of the event-driven dynamic optimization system allows the optimizer to continuously monitor the performance of the software prefetches. This is done to find an accurate and stable prefetch distance and to adapt to changes in program behavior using what we call Self-Repairing prefetching. Relative to the baseline hardware stride prefetching, we find a total 23 % improvement when we use the self-repairing mechanism to adaptively discover the best prefetch distance for each load, which is 12 % better performance than dynamic prefetching techniques without adaptive repairing.
Predictor virtualization
- In Proceedings of the 13th International conference on Architectural Support for Programming Languages and Operating Systems
, 2008
"... Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, ..."
Abstract
-
Cited by 13 (7 self)
- Add to MetaCart
(Show Context)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Scenario based optimization: A framework for statically enabling online optimizations
- In CGO ’09: Proceedings of the 2009 International Symposium on Code Generation and Optimization
, 2009
"... Abstract—Online optimization allows the continuous restructuring and adaptation of an executing application using live information about its execution environment. The further advancement of performance monitoring hardware presents new opportunities for online optimization techniques. While managed ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Online optimization allows the continuous restructuring and adaptation of an executing application using live information about its execution environment. The further advancement of performance monitoring hardware presents new opportunities for online optimization techniques. While managed runtime environments lend themselves nicely to online and dynamic optimizations, such techniques remain difficult to successfully achieve at the binary level. Binary level Dynamic optimizers introduce virtual layers which at the binary level produces overhead that is often prohibitive. Another challenge comes from the lack of source level information that can significantly aid optimization. In this paper we present a new static/dynamic collaborative approach to online optimization that takes advantage of the strength of static optimization and the adaptive nature of dynamic optimization techniques. We call this hybrid optimization framework Scenario Based Optimization (SBO). Statically we multiversion and specialize functions for different dynamic scenarios that can be identified by monitoring microarchitectural events. Using these events to infer the current scenario, we dynamically reroute execution to the relevant code tuned to that scenario. We have implemented our static SBO infrastructure in GCC 4.3.1 and designed our Dynamic Introspection Engine using the Perfmon2 infrastructure. To demonstrate the effectiveness of our Scenario Based Optimization framework we have designed an SBO optimization we call the Online Aiding and Abetting of Aggressive Optimizations (OAAAO). Using SPEC2006 this optimization shows a speedup of 7 % to 10 % for a number of benchmarks and in our best case (h264ref) a 17 % speedup over native execution compiled at the-O2 optimization level. I.
Software Data Spreading: Leveraging Distributed Caches to Improve Single Thread Performance
"... Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages the cache capacity of extra cores and extra sockets rather than their computational resources. Software data spreading is a software-only technique that uses compiler-directed thread migration to aggregate cache capacity across cores and chips and improve performance. This paper describes an automated scheme that applies data spreading to various types of loops. Experiments with a set of SPEC2000, SPEC2006, NAS, and microbenchmark workloads show that data spreading can provide speedup of over 2, averaging 17 % for the SPEC and NAS applications on two systems. In addition, despite using more cores for the same computation, data spreading actually saves power since it reduces access to DRAM. D.3.4 [Programming Lan-Categories and Subject Descriptors
Data-Triggered Threads: Eliminating Redundant Computation
"... This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78 % of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%. 1.
Performance driven data cache prefetching in a dynamic software optimization system
- In Proceeding of the 2007 ACM International Conference on Supercomputing (ICS’07
, 2007
"... Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximize their positive impact while minimizing their negative impact on performance. In previous proposed dynamic frameworks, t ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Software or hardware data cache prefetching is an efficient way to hide cache miss latency. However effectiveness of the issued prefetches have to be monitored in order to maximize their positive impact while minimizing their negative impact on performance. In previous proposed dynamic frameworks, the monitoring scheme is either achieved using processor performance counters or using specific hardware. In this work, we propose a prefetching strategy which does not use any specific hardware component or processor performance counter. Our dynamic framework wants to be portable on any modern processor architecture providing at least a prefetch instruction. Opportunity and effectiveness of prefetching loads is simply guided by the time spent to effectively obtain the data. Every load of a program is monitored periodically and can be either associated to a dynamically inserted prefetch instruction or not. It can be associated to a prefetch instruction at some disjoint periods of the whole program run as soon as it is efficient. Our framework has been implemented for Itanium-2 machines. It involves several dynamic instrumentations of the binary code whose overhead is limited to only 4 % on average. On a large set of benchmarks, our system is able to speed up some programs by 2%-143%.
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators
- Proc. Int’l Conf. Hardware/Software Codesign and System Synthesis (CODES/ISSS 07), ACM Press, 2007
"... We present a dynamic optimization technique, thread warping, that uses a single processor on a multiprocessor system to dynamically synthesize threads into custom accelerator circuits on FPGAs (field-programmable gate arrays). Building on dynamic synthesis for single-processor single-thread systems, ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
We present a dynamic optimization technique, thread warping, that uses a single processor on a multiprocessor system to dynamically synthesize threads into custom accelerator circuits on FPGAs (field-programmable gate arrays). Building on dynamic synthesis for single-processor single-thread systems, known as warp processing, thread warping improves performances of multiprocessor systems by speeding up individual threads and by allowing more threads to execute concurrently. Furthermore, thread warping maintains the important separation of function from architecture, enabling portability of applications to architectures with different quantities of microprocessors and FPGA—an advantage not shared by static compilation/synthesis approaches. We introduce a framework of architecture, CAD tools, and operating system that together support thread warping. We summarize experiments on an extensive architectural simulation framework we developed, showing application speedups of 4x to 502x, averaging 130x compared to a multiprocessor system having four ARM11 microprocessors, for eight benchmark applications. Even compared to a 64-processor system, thread warping achieves 11x speedup.
Protean Code: Achieving Near-free Online Code Transformations for Warehouse Scale Computers
"... Abstract—Rampant dynamism due to load fluctuations, co-runner changes, and varying levels of interference poses a threat to application quality of service (QoS) and has limited our ability to allow co-locations in modern warehouse scale computers (WSCs). Instruction set features such as the non-temp ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract—Rampant dynamism due to load fluctuations, co-runner changes, and varying levels of interference poses a threat to application quality of service (QoS) and has limited our ability to allow co-locations in modern warehouse scale computers (WSCs). Instruction set features such as the non-temporal memory access hints found in modern ISAs (both ARM and x86) may be useful in mitigating these effects. However, despite the challenge of this dynamism and the availability of an instruction set mechanism that might help address the problem, a key capability missing in the system software stack in modern WSCs is the ability to dynamically transform (and re-transform) the executing application code to apply these instruction set features when necessary. In this work we introduce protean code, a novel approach for enacting arbitrary compiler transformations at runtime for native programs running on commodity hardware with negligible (<1%) overhead. The fundamental insight behind the underlying mechanism of protean code is that, instead of maintaining full control throughout the program’s execution as with traditional dynamic optimizers, protean code allows the original binary to execute continuously and diverts control flow only at a set of virtualized points, allowing rapid and seamless rerouting to the new code variants. In addition, the protean code compiler embeds IR with high-level semantic information into the program, empowering the dynamic compiler to perform rich analysis and transformations online with little overhead. Using a fully functional protean code compiler and runtime built on LLVM, we design PC3D, Protean Code for Cache Contention in Datacenters. PC3D dynamically employs non-temporal access hints to achieve utilization improvements of up to 2.8x (1.5x on average) higher than state-of-the-art contention mitigation runtime techniques at a QoS target of 98%. Keywords-compiler; dynamic compiler; optimization; cache; resource sharing; datacenter; warehouse scale computer I.