Results 1 - 10
of
15
Revisiting the sequential programming model for multi-core
- In Proceedings of the 40th Annual ACM/IEEE International Symposium on Microarchitecture
, 2007
"... Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race conditions, deadlocks, and other problems associated with parallel programming. To address these costs, other approaches, such as automatic thread extraction, have been explored. Unfortunately, the amount of parallelism that has been automatically extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic limitation of the sequential programming model, but rather occurs for two reasons. First, there exists no framework for automatic thread extraction that brings together key existing state-of-the-art compiler and hardware techniques. This paper shows that such a framework can yield scalable parallelization on several SPEC CINT2000 benchmarks. Second, existing sequential programming languages force programmers to define a single legal program outcome, rather than allowing for a range of legal outcomes. This paper shows that natural extensions to the sequential programming model enable parallelization for the remainder of the SPEC CINT2000 suite. Our experience demonstrates that, by changing only 60 source code lines, all of the C benchmarks in the SPEC CINT2000 suite were parallelizable by automatic thread extraction. This process, constrained by the limits of modern optimizing compilers, yielded a speedup of 454 % on these applications. 1
Uncovering hidden loop level parallelism in sequential applications
- In Proc. of the 14th International Symposium on High-Performance Computer Architecture
, 2008
"... As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applicatio ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of generalpurpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61 % of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27 % using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations. 1
Program demultiplexing: Data-flow based speculative parallelization of methods in sequential programs
- In ISCA’06: Proceedings of the 33rd International Symposium on Computer Architecture
, 2006
"... We present Program Demultiplexing (PD), an execution paradigm that creates concurrency in sequential programs by "demultiplexing " methods (functions or subroutines). Call sites of a demultiplexed method in the program are associated with handlers that allow the method to be separated from ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
We present Program Demultiplexing (PD), an execution paradigm that creates concurrency in sequential programs by "demultiplexing " methods (functions or subroutines). Call sites of a demultiplexed method in the program are associated with handlers that allow the method to be separated from the sequential program and executed on an auxiliary processor. The demultiplexed execution of a method (and its handler) is speculative and occurs when the inputs of the method are (speculatively) available, which is typically far in advance of when the method is actually called in the sequential execution. A trigger, composed of predicates that are based on program counters and memory write addresses, launches the speculative execution of the method on another processor. Our implementation of PD is based on a full-system execution-based chip multi-processor simulator with software to generate triggers and handlers from an x86program binary. We evaluate eight integer benchmarks from the SPEC2000 suite ⎯programs written in C with no explicit concurrency and/or motivation to create concurrency ⎯ and achieve a harmonic mean speedup of 1.8x with our implementation of PD. 1.
Loop Selection for Thread-Level Speculation
- In Proceedings of the 18 th International Workshop on Languages and Compilers for Parallel Computing
, 2005
"... Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication mak ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication makes it difficult to obtain the desired performance unless the speculative threads are carefully chosen. In this paper, we focus on extracting parallel threads from loops in generalpurpose applications because loops, with their regular structures and significant coverage on execution time, are ideal candidates for extracting parallel threads. General-purpose applications, however, usually contain a large number of nested loops with unpredictable parallel performance and dynamic behavior, thus making it difficult to decide which set of loops should be parallelized to improve overall program performance. Our proposed loop selection algorithm addresses all these difficulties. We have found that (i) with the aid of profiling information, compiler analyses can achieve a reasonably accurate estimation of the performance of parallel execution, and that (ii) different invocations of a loop may behave differently, and exploiting this dynamic behavior can further improve performance. With a judicious choice of loops, we can improve the overall program performance of SPEC2000 integer benchmarks by as much as 20%. 1
Commutativity Analysis for Software Parallelization: letting Program Transformations See the Big Picture
- In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems
, 2009
"... Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time parallelization, which can ease the burden ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time parallelization, which can ease the burden on software developers in many situations. Clearly, automatic parallelization in its present form is not suitable for many application domains and new compiler analyses are needed address its shortcomings. In this paper, we present one such analysis: a new approach for detecting commutative functions. Commutative functions are sections of code that can be executed in any order without affecting the outcome of the application, e.g., inserting elements into a set. Previous research on this topic had one significant limitation, in that the results of a commutative functions must produce identical memory layouts. This prevented previous techniques from detecting functions like malloc, which may return different pointers depending on the order in which it is called, but these differing results do not affect the overall output of the application. Our new commutativity analysis correctly identify these situations to better facilitate automatic parallelization. We demonstrate that this analysis can automatically extract significant amounts of parallelism from many applications, and where it is ineffective it can provide software developers a useful list of functions that may be commutative provided semantic program changes that are not automatable.
Spice: Speculative Parallel Iteration Chunk Execution
"... ABSTRACT The recent trend in the processor industry of packing multiple pro-cessor cores in a chip has increased the importance of automatic techniques for extracting thread level parallelism. A promising ap-proach for extracting thread level parallelism in general purpose applications is to apply m ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
ABSTRACT The recent trend in the processor industry of packing multiple pro-cessor cores in a chip has increased the importance of automatic techniques for extracting thread level parallelism. A promising ap-proach for extracting thread level parallelism in general purpose applications is to apply memory alias or value speculation to breakdependences amongst threads and executes them concurrently.
Exploiting Speculative Thread-Level Parallelism in Data Compression Applications
"... Abstract. Although hardware support for Thread-Level Speculation (TLS) can ease the compiler’s tasks in creating parallel programs by allowing the compiler to create potentially dependent parallel threads, advanced compiler optimization techniques must be developed and judiciously applied to achieve ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. Although hardware support for Thread-Level Speculation (TLS) can ease the compiler’s tasks in creating parallel programs by allowing the compiler to create potentially dependent parallel threads, advanced compiler optimization techniques must be developed and judiciously applied to achieve the desired performance. In this paper, we take a close examination on two data compression benchmarks, GZIP and BZIP2, propose, implement and evaluate new compiler optimization techniques to eliminate performance bottlenecks in their parallel execution and improve their performance. The proposed techniques (i) remove the critical forwarding path created by synchronizing memory-resident values; (ii) identify and categorize reduction-like variables whose intermediate results are used within loops, and propose code transformation to remove the inter-thread data dependences caused by these variables; and (iii) transform the program to eliminate stalls caused by variations in thread size. While no previous work has reported significant performance improvement on parallelizing these two benchmarks, we are able to achieve up to 36 % performance improvement for GZIP and 37 % for BZIP2. 1
Bosschere. Detecting the existence of coarse-grain parallelism in general-purpose programs
- In Proceedings of the First Workshop on Programmability Issues for Multi-Core Computers, MULTIPROG-1
"... Abstract. With the rise of chip-multiprocessors, the problem of parallelizing general-purpose programs has once again been placed on the research agenda. In the 1980s and early 1990s, great successes were obtained to extract parallelism from the inner loops of scientific computations. General-purpos ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. With the rise of chip-multiprocessors, the problem of parallelizing general-purpose programs has once again been placed on the research agenda. In the 1980s and early 1990s, great successes were obtained to extract parallelism from the inner loops of scientific computations. General-purpose programs, however, stayed out-of-reach due to the complexity of their control flow and data dependences. More recently, thread-level speculation (TLS) has been tauted as the definitive solution for general-purpose programs. TLS again targets inner loops. The program complexity issue is handled by checking and resolving dependences at runtime using complex hardware support. However, results so far have been disappointing and limit studies predict very low potential speedups, in one study just 18%. In this paper we advocate a completely different approach. We show that signficant amounts of coarse-grain parallelism exists in the outer program loops, even in general-purpose programs. This coarse-grain parallelism can be exploited efficiently on CMPs without additional hardware support. This paper presents a technique to extract coarse-grain parallelism from the outer program loops. Application of this technique to the MiBench and SPEC CPU2000 benchmarks shows that significant amounts of outerloop parallelism exist. This leads to a speedup of 5.18 for bzip2 compression and 11.8 for an MPEG2 encoder on a Sun UltraSPARC T1 CMP. The parallelization effort was limited to 10 to 20 person-hours per benchmark while we had no prior knowledge of the programs. 1
Compiler and Hardware Support for Reducing the Synchronization of Speculative Threads
"... Thread-Level Speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this article we focus on one important limitation of program performance under TLS, which is stalls due to synchronizi ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Thread-Level Speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this article we focus on one important limitation of program performance under TLS, which is stalls due to synchronizing and forwarding scalar values between speculative threads that would otherwise cause frequent data dependences and hence failed speculation. Using SPECint benchmarks that have been automatically-transformed by our compiler to exploit TLS, we present, evaluate in detail, and compare both compiler and hardware techniques for improving the communication of scalar values. We find that through our dataflow algorithms for three increasingly-aggressive instruction scheduling techniques, the compiler can drastically reduce the critical forwarding path introduced by the synchronization and forwarding of scalar values. We also show that hardware techniques for reducing synchronization can be complementary to compiler scheduling, but that the additional performance benefits are minimal and are generally not worth the cost.
Microprocessors in the Era of Terascale Integration Abstract
"... Moore’s Law will soon deliver tera-scale level transistor integration capacity. Power, variability, reliability, aging, and testing will pose as barriers and challenges to harness this integration capacity. Advances in microarchitecture and programming systems discussed in this paper are potential s ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Moore’s Law will soon deliver tera-scale level transistor integration capacity. Power, variability, reliability, aging, and testing will pose as barriers and challenges to harness this integration capacity. Advances in microarchitecture and programming systems discussed in this paper are potential solutions. 1.

