Results 1 - 10
of
30
Automatic thread extraction with decoupled software pipelining
- In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture
, 2005
"... {ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend ..."
Abstract
-
Cited by 101 (18 self)
- Add to MetaCart
(Show Context)
{ottoni, ram, astoler, august}@princeton.edu Abstract Until recently, a steadily rising clock rate and otheruniprocessor microarchitectural improvements could be relied upon to consistently deliver increasing performance fora wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturersto add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have notsucceeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improve-ment for a large class of existing codes. To find useful work for chip multiprocessors, we proposean automatic approach to thread extraction, called Decoupled Software Pipelining (DSWP). DSWP exploits the fine-grained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Useof the non-speculative and truly decoupled threads produced by DSWP can increase execution efficiency and pro-vide significant latency tolerance, mitigating design complexity by reducing inter-core communication and per-coreresource requirements. Using our initial fully automatic compiler implementation and a validated processor model,we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety ofcodes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promisingfuture for this approach. 1
Revisiting the sequential programming model for multi-core
- In Proceedings of the 40th Annual ACM/IEEE International Symposium on Microarchitecture
, 2007
"... Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race ..."
Abstract
-
Cited by 73 (11 self)
- Add to MetaCart
(Show Context)
Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race conditions, deadlocks, and other problems associated with parallel programming. To address these costs, other approaches, such as automatic thread extraction, have been explored. Unfortunately, the amount of parallelism that has been automatically extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic limitation of the sequential programming model, but rather occurs for two reasons. First, there exists no framework for automatic thread extraction that brings together key existing state-of-the-art compiler and hardware techniques. This paper shows that such a framework can yield scalable parallelization on several SPEC CINT2000 benchmarks. Second, existing sequential programming languages force programmers to define a single legal program outcome, rather than allowing for a range of legal outcomes. This paper shows that natural extensions to the sequential programming model enable parallelization for the remainder of the SPEC CINT2000 suite. Our experience demonstrates that, by changing only 60 source code lines, all of the C benchmarks in the SPEC CINT2000 suite were parallelizable by automatic thread extraction. This process, constrained by the limits of modern optimizing compilers, yielded a speedup of 454 % on these applications. 1
Superblock formation using static program analysis
- in Proceedings of the 26th Annual IEEE/ACM International Symposium on Microarchitecture (Micro-26
, 1993
"... Compile-time code transformations which expose instruction-level parallelism (ILP) typically take into account the constraints imposed byallexecution scenarios in the program. However, there areaddi tional opportunities to increase ILP along some execution sequences if the constraints from alternati ..."
Abstract
-
Cited by 51 (9 self)
- Add to MetaCart
(Show Context)
Compile-time code transformations which expose instruction-level parallelism (ILP) typically take into account the constraints imposed byallexecution scenarios in the program. However, there areaddi tional opportunities to increase ILP along some execution sequences if the constraints from alternative execution sequences can be ignored. Traditionally, pro le information has been used to identify important execution sequences for aggressive compiler optimization and scheduling. This paper presents a set of static program analysis heuristics used in the IMPACT compiler to identify execution sequences for aggressive optimization. We show that the static program analysis heuristics identify execution sequences without hazardous conditions that tend to prohibit compiler optimizations. As a result, the static program analysis approach often achieves optimization results comparable to pro le information in spite of its inferior branch prediction accuracies. This observation makes a strong case for using static program analysis with or without pro le information to facilitate aggressive compiler optimization and scheduling.
The importance of prepass code scheduling for superscalar and superpipelined processors
- IEEE Trans. Comput
, 1995
"... ..."
(Show Context)
Unrolling-Based Optimizations for Modulo Scheduling
- in Proceedings of the 28th International Symposium on Microarchitecture
, 1995
"... Modulo scheduling is a method for overlapping successive iterations of a loop in order to find sufficient instruction-level parallelism to fully utilize high-issue-rate processors. The achieved throughput modulo scheduled loop depends on the resource requirements, the dependence pattern, and the reg ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
(Show Context)
Modulo scheduling is a method for overlapping successive iterations of a loop in order to find sufficient instruction-level parallelism to fully utilize high-issue-rate processors. The achieved throughput modulo scheduled loop depends on the resource requirements, the dependence pattern, and the register requirements of the computation in the loop body. Traditionally, unrolling followed by acyclic scheduling of the unrolled body has been an alternative to modulo scheduling. However, there are benefits to unrolling even if the loop is to be modulo scheduled. Unrolling can improve the throughput by allowing a smaller non-integral effective initiation interval to be achieved. After unrolling, optimizations can be applied to the loop that reduce both the resource requirements and the height of the critical paths. Together, unrolling and unrolling-based optimizations can enable the completion of multiple iterations per cycle in some cases. This paper describes the benefits of unrolling and ...
DATA PRELOAD FOR SUPERSCALAR AND VLIW PROCESSORS
, 1993
"... ... decreased the average number of clock cycles per instruction. As a result, each execution cycle has become more significant to overall system performance. To maximize the effectiveness of each cycle, one must expose instruction-level parallelism and employ memory latency tolerant techniques. How ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
... decreased the average number of clock cycles per instruction. As a result, each execution cycle has become more significant to overall system performance. To maximize the effectiveness of each cycle, one must expose instruction-level parallelism and employ memory latency tolerant techniques. However, without special architecture support, a superscalar compiler cannot effectively accomplish these two tasks in the presence of control and memory access dependences. Preloading is a class of architectural support which allows memory reads to be performed early in spite of potential violation of control and memory access dependences. With preload support, a superscalar compiler can perform more aggressive code reordering to provide increased tolerance of cache and memory access latencies and increasing instruction-level parallelism. This thesis discusses the architectural features and compiler support required to effectively utilize preload instructions to increase the overall system performance. The first hardware support is preload register update, a data preload support for load scheduling to reduce first-level cache hit latency. Preload register update keeps the load destination
Optimization of machine descriptions for efficient use
- in Proceedings of the 29th International Symposium on Microarchitecture
, 1996
"... A machine description facility allows compiler writers to specify machine execution constraints to the optimization and scheduling phases of an instruction-level parallelism (ILP) optimizing compiler. The machine description (MDES) facility should support quick development and easy maintenance of ma ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
A machine description facility allows compiler writers to specify machine execution constraints to the optimization and scheduling phases of an instruction-level parallelism (ILP) optimizing compiler. The machine description (MDES) facility should support quick development and easy maintenance of machine execution constraint descriptions by compiler writers. However, the facility should also allow compact representation and efficient usage of the MDES during compilation. This paper advocates a model that allows compiler writers to develop the MDES in a high-level language, which is then translated into a low-level representation for efficient use by the compiler. The discrepancy between the requirements of the high-level language and the low-level representation is reconciled with a collection of transformations that derive efficient low-level representations from the easy-to-understand high-level descriptions. In order to support these transformations, a novel approach to representing machine execution constraints has been developed. Detailed and precise descriptions of the execution constraints for the HP PA7100, Intel Pentium, Sun Super-SPARC, and AMD-K5 processors are analyzed to show the advantage of using this new representation. The results show that performing these transformations and utilizing the new representation allow easy-to-maintain detailed descriptions written in high-level languages to be efficiently used by ILP-optimizing compilers. 1.
Improving Instruction-level Parallelism by Loop Unrolling and Dynamic Memory Disambiguation
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
(Show Context)
Exploitation of instruction-level parallelism is an effective mechanism for improving the performance of modern super-scalar/VLIW processors. Various software techniques can be applied to increase instruction-level parallelism. This paper describes and evaluates a software technique, dynamic memory disambiguation, that permits loops containing loads and stores to be scheduled more aggressively, thereby exposing more instruction-level parallelism. The results of our evaluation show that when dynamic memory disambiguation is applied in conjunction with loop unrolling, register renaming, and static memory disambiguation, the ILP of memory-intensive benchmarks can be increased by as much as 300 percent over loops where dynamic memory disambiguation is not performed. Our measurements also indicate that for the programs that benefit the most from these optimizations, the register usage does not exceed the number of registers on most high-performance processors. Keywords: loop unrolling, dyn...
Memory Bank Disambiguation using Modulo Unrolling for Raw Machines
- IN PROCEEDINGS OF THE ACM/IEEE FIFTH INT'L CONFERENCE ON HIGH PERFORMANCE COMPUTING(HIPC
, 1998
"... The Raw approach of replicated processor tiles interconnected with a fast static mesh network provides a simple, scalable design that maximizes the resources available in next generation processor technology. In Raw ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
The Raw approach of replicated processor tiles interconnected with a fast static mesh network provides a simple, scalable design that maximizes the resources available in next generation processor technology. In Raw
Using Profile Information to Assist Advanced Compiler Optimization and Scheduling
"... Compilers for superscalar and VLIW processors must expose sufficient instruction-level parallelism in order to achieve high performance. Compiletime code transformations which expose instruction-level parallelism typically take into account the constraints imposed by all execution scenarios in the p ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Compilers for superscalar and VLIW processors must expose sufficient instruction-level parallelism in order to achieve high performance. Compiletime code transformations which expose instruction-level parallelism typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instruction-level parallelism along the frequent execution scenario at the expense of the less frequent execution sequences. Profile information identifies these important execution sequences in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-based transformations have been incorporated into the IMPACT compiler. These transformations include global optimization, acyclic global scheduling, and software pipelining. The effectiveness of these profile-based techniques is evaluated for a range of superscalar and VLIW processors.