Results 1 - 10
of
64
Transactional memory coherence and consistency
- In ISCA
, 2004
"... In this paper, we propose a new shared memory model: Transactional ..."
Abstract
-
Cited by 233 (17 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a new shared memory model: Transactional
Programming with transactional coherence and consistency (tcc
- In ASPLOS-XI: Proceedings of the 11th international conference on Architectural
, 2004
"... Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction completes, it writes all of its newly p ..."
Abstract
-
Cited by 75 (9 self)
- Add to MetaCart
(Show Context)
Transactional Coherence and Consistency (TCC) offers a way to simplify parallel programming by executing all code within transactions. In TCC systems, transactions serve as the fundamental unit of parallel work, communication and coherence. As each transaction completes, it writes all of its newly produced state to shared memory atomically, while restarting other processors that have speculatively read stale data. With this mechanism, a TCCbased system automatically handles data synchronization correctly, without programmer intervention. To gain the benefits of TCC, programs must be decomposed into transactions. We describe two basic programming language constructs for decomposing programs into transactions, a loop conversion syntax and a general transaction-forking mechanism. With these constructs, writing correct parallel programs requires only small, incremental changes to correct sequential programs. The performance of these programs may then easily be optimized, based on feedback from real program execution, using a few simple techniques.
POSH: A TLS compiler that exploits program structure
- In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2006
"... As multi-core architectures with Thread-Level Speculation (TLS) are becoming better understood, it is important to focus on TLS compilation. TLS compilers are interesting in that, while they do not need to fully prove the independence of concurrent tasks, they make choices of where and when to gener ..."
Abstract
-
Cited by 65 (7 self)
- Add to MetaCart
(Show Context)
As multi-core architectures with Thread-Level Speculation (TLS) are becoming better understood, it is important to focus on TLS compilation. TLS compilers are interesting in that, while they do not need to fully prove the independence of concurrent tasks, they make choices of where and when to generate speculative tasks that are crucial to overall TLS performance. This paper presents POSH, a new, fully automated TLS compiler built on top of gcc. POSH is based on two design decisions. First, to partition the code into tasks, it leverages the code structures created by the programmer, namely subroutines and loops. Second, it uses a simple profiling pass to discard ineffective tasks. With the code generated by POSH, a simulated TLS chip multiprocessor with 4 superscalar cores delivers an average speedup of 1.30 for the SPECint 2000 applications. Moreover, an estimated 26 % of this speedup is a result of the implicit data prefetching provided by squashed tasks. Categories and Subject Descriptors D.1.3 [Programming Techniques]:
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping
- In PLDI
, 2009
"... Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
Compiler-based auto-parallelization is a much studied area, yet has still not found wide-spread application. This is largely due to the poor exploitation of application parallelism, subsequently resulting in performance levels far below those which a skilled expert programmer could achieve. We have identified two weaknesses in traditional parallelizing compilers and propose a novel, integrated approach, resulting in significant performance improvements of the generated parallel code. Using profile-driven parallelism detection we overcome the limitations of static analysis, enabling us to identify more application parallelism and only rely on the user for final approval. In addition, we replace the traditional target-specific and inflexible mapping heuristics with a machine-learning based prediction mechanism, resulting in better mapping decisions while providing more scope for adaptation to different target architectures.
A cost-driven compilation framework for speculative parallelization of sequential programs
- In ACM SIGPLAN 04 Conference on Programming Language Design and Implementation (PLDI’04
, 2004
"... The emerging hardware support for thread-level speculation opens new opportunities to parallelize sequential programs beyond the traditional limits. By speculating that many data dependences are unlikely during runtime, consecutive iterations of a sequential loop can be executed speculatively in par ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
(Show Context)
The emerging hardware support for thread-level speculation opens new opportunities to parallelize sequential programs beyond the traditional limits. By speculating that many data dependences are unlikely during runtime, consecutive iterations of a sequential loop can be executed speculatively in parallel. Runtime parallelism is obtained when the speculation is correct. To take full advantage of this new execution model, a program needs to be programmed or compiled in such a way that it exhibits high degree of speculative thread-level parallelism. We propose a comprehensive cost-driven compilation framework to perform speculative parallelization. Based on a misspeculation cost model, the compiler aggressively transforms loops into optimal speculative parallel loops and selects only those loops whose speculative parallel execution is likely to improve program
Uncovering hidden loop level parallelism in sequential applications
- In Proc. of the 14th International Symposium on High-Performance Computer Architecture
, 2008
"... As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applicatio ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
(Show Context)
As multicore systems become the dominant mainstream computing technology, one of the most difficult challenges the industry faces is the software. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores, but single-threaded applications realize little to no gains with additional cores. One solution to this problem is automatic parallelization that frees the programmer from the difficult task of parallel programming and offers hope for handling the vast amount of legacy single-threaded software. There is a long history of automatic parallelization for scientific applications, but the techniques have generally failed in the context of generalpurpose software. Thread-level speculation overcomes the problem of memory dependence analysis by speculating unlikely dependences that serialize execution. However, this approach has lead to only modest performance gains. In this paper, we take another look at exploiting loop-level parallelism in single-threaded applications. We show that substantial amounts of loop-level parallelism is available in general-purpose applications, but it lurks beneath the surface and is often obfuscated by a small number of data and control dependences. We adapt and extend several code transformations from the instruction-level and scientific parallelization communities to uncover the hidden parallelism. Our results show that 61 % of the dynamic execution of studied benchmarks can be parallelized with our techniques compared to 27 % using traditional thread-level speculation techniques, resulting in a speedup of 1.84 on a four core system compared to 1.41 without transformations. 1
Exposing speculative thread parallelism in SPEC2000
- In Proc. of PPoPP’05
, 2005
"... As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TL ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
(Show Context)
As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design. For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.
Commutativity Analysis for Software Parallelization: letting Program Transformations See the Big Picture
- In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems
, 2009
"... Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time parallelization, which can ease the burden ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time parallelization, which can ease the burden on software developers in many situations. Clearly, automatic parallelization in its present form is not suitable for many application domains and new compiler analyses are needed address its shortcomings. In this paper, we present one such analysis: a new approach for detecting commutative functions. Commutative functions are sections of code that can be executed in any order without affecting the outcome of the application, e.g., inserting elements into a set. Previous research on this topic had one significant limitation, in that the results of a commutative functions must produce identical memory layouts. This prevented previous techniques from detecting functions like malloc, which may return different pointers depending on the order in which it is called, but these differing results do not affect the overall output of the application. Our new commutativity analysis correctly identify these situations to better facilitate automatic parallelization. We demonstrate that this analysis can automatically extract significant amounts of parallelism from many applications, and where it is ineffective it can provide software developers a useful list of functions that may be commutative provided semantic program changes that are not automatable.
Modeling optimistic concurrency using quantitative dependence analysis.
- In Proceedings of the 13th ACM SIGPLAN symposium on Principles and practice of parallel programming,
, 2008
"... Abstract This work presents a quantitative approach to analyze parallelization opportunities in programs with irregular memory access where potential data dependences mask available parallelism. The model captures data and causal dependencies among critical sections as algorithmic properties and qu ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
(Show Context)
Abstract This work presents a quantitative approach to analyze parallelization opportunities in programs with irregular memory access where potential data dependences mask available parallelism. The model captures data and causal dependencies among critical sections as algorithmic properties and quantifies them as a density computed over the number of executed instructions. The model abstracts from runtime aspects such as scheduling, the number of threads, and concurrency control used in a particular parallelization. We illustrate the model on several applications requiring ordered and unordered execution of critical sections. We describe a run-time tool that computes the dependence densities from a deterministic single-threaded program execution. This density metric provides insights into the potential for optimistic parallelization, opportunities for algorithmic scheduling, and performance defects due to synchronization bottlenecks. Based on the results of our analysis, we classify applications into three categories with low, medium, and high dependence densities. Applications with low dependence density are naturally good candidates for optimistic concurrency, applications with medium density may require a scheduler that is aware of the algorithmic dependencies for optimistic concurrency to be effective, and applications with high dependence density may not be suitable for parallelization.