Results 1 - 10
of
30
SLAW: A scalable localityaware adaptive work-stealing scheduler
- In 24th IEEE International Symposium on Parallel and Distributed Processing (IPDPS
, 2010
"... Recent trend has made it clear that the processor makers are committed to the multi-core chip designs. The number of cores per chip is increasing, while there is little or no increase in the clock speed. This parallelism trend poses a significant and urgent challenge on computer software because pro ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
(Show Context)
Recent trend has made it clear that the processor makers are committed to the multi-core chip designs. The number of cores per chip is increasing, while there is little or no increase in the clock speed. This parallelism trend poses a significant and urgent challenge on computer software because programs have to be written or transformed into a multi-threaded form to take full advantage of future hardware advances. Task parallelism has been identified as one of the prerequisites for software produc-tivity. In task parallelism, programmers focus on decomposing the problem into sub-computations that can run in parallel and leave the compiler and runtime to handle the scheduling details. This separation of concerns between task decomposition and scheduling provides productivity to the programmer but poses challenges to the runtime scheduler. Our thesis is that work-stealing schedulers with adaptive scheduling policies and locality-awareness can provide a scalable and robust runtime foundation for multi-core task parallelism. We evaluate our thesis using the new Scalable Locality-aware
Design Principles for End-to-End Multicore Schedulers
"... As personal computing devices become increasingly parallel multiprocessors, the requirements for operating system schedulers change considerably. Future generalpurpose machines will need to handle a dynamic, bursty, and interactive mix of parallel programs sharing a heterogeneous multicore machine. ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
As personal computing devices become increasingly parallel multiprocessors, the requirements for operating system schedulers change considerably. Future generalpurpose machines will need to handle a dynamic, bursty, and interactive mix of parallel programs sharing a heterogeneous multicore machine. We argue that a key challenge for such machines is rethinking scheduling as an end-to-end problem integrating components from the hardware and kernel up to the programming language runtimes and applications themselves. We present several design principles for future OS schedulers, and discuss the implications of each for OS and runtime interfaces and structure. We illustrate the implementation challenges that result by describing the concrete choices we have made in the Barrelfish multikernel. This allows us to present one coherent scheduling design for an entire multicore machine, while at the same time drawing conclusions we think are applicable to the design of any general-purpose multicore OS. 1
Parallelism orchestration using DoPE: the degree of parallelism executive
- In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI
, 2011
"... In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performanc ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer’s development-time choices suboptimal. To address this problem, this paper presents the Degree of Parallelism Executive (DoPE), an API and run-time system that separates the concern of exposing parallelism from that of optimizing it. Using the DoPE API, the application developer expresses parallelism options. During program execution, DoPE’s run-time system uses this information to dynamically optimize the parallelism options in response to the facts on the ground. We easily port several emerging parallel applications to DoPE’s API and demonstrate the DoPE run-time system’s effectiveness in dynamically optimizing the parallelism for a variety of performance goals.
A System for Flexible Parallel Execution
"... growth in transistor density combined with diminishing returns from uniprocessor improvements has compelled the industry to transition to multicore architectures. To realize the performance potential of multicore architectures, programs must be parallelized effectively. The efficiency of parallel pr ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
growth in transistor density combined with diminishing returns from uniprocessor improvements has compelled the industry to transition to multicore architectures. To realize the performance potential of multicore architectures, programs must be parallelized effectively. The efficiency of parallel program execution depends on the execution environment comprised of workload, platform, and performance goal. In writing parallel programs, most programmers and compilers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer’s or compiler’s development-time or compile-time choices suboptimal. This dissertation presents Parcae1, a generally applicable holistic system for platformwide dynamic parallelism tuning. Parcae includes: 1. the Nona compiler, which applies a variety of auto-parallelization techniques to create flexible parallel programs whose tasks can be efficiently paused, reconfigured,
Loop parallelism: a new skeleton perspective on data parallel patterns
- In Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing
, 2014
"... Abstract—Traditionally, skeleton based parallel programming frameworks support data parallelism by providing the pro-grammer with a comprehensive set of data parallel skeletons, based on different variants of map and reduce patterns. On the other side, more conventional parallel programming framewor ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Traditionally, skeleton based parallel programming frameworks support data parallelism by providing the pro-grammer with a comprehensive set of data parallel skeletons, based on different variants of map and reduce patterns. On the other side, more conventional parallel programming frameworks provide application programmers with the possibility to introduce parallelism in the execution of loops with a relatively small programming effort. In this work, we discuss a “ParallelFor” skeleton provided within the FastFlow framework and aimed at filling the usability and expressivity gap between the classical data parallel skeleton approach and the loop parallelisation facilities offered by frameworks such as OpenMP and Intel TBB. By exploiting the low run-time overhead of the FastFlow parallel skeletons and the new facilities offered by the C++11 standard, our ParallelFor skeleton succeeds to obtain comparable or better performance than both OpenMP and TBB on the Intel Phi many-core and Intel Nehalem multi-core for a set of benchmarks considered, yet requiring a comparable programming effort.
A Machine LearningBased Approach for Thread Mapping on Transactional Memory Applications
- In High Performance Computing Conference (HiPC
, 2011
"... Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plicat ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plication behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and reso-lution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile sev-eral STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46 % compared to the worst case and up to 6.37 % over the Linux default thread mapping strategy. Keywords-machine learning; software transactional memory; thread mapping. I.
Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems
"... General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-p ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. Such an approach brings together the benefits of a clear high level language (OpenMP) and an emerging standard (OpenCL) for heterogeneous multi-cores. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses predictive modeling to automatically determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multi-core host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on two distinct GPU based systems: Core i7/NVIDIA GeForce GTX 580 and Core i7/AMD Radeon 7970. We achieved average (up to) speedups of 4.51x and 4.20x (143x and 67x) respectively over a sequential baseline. This is, on average, a factor 1.63 and 1.56 times faster than a hand-coded, GPU-specific OpenCL implementation developed by independent expert programmers.
Exploiting Inter-Sequence Correlations for Program Behavior Prediction
"... Prediction of program dynamic behaviors is fundamental to program optimizations, resource management, and architecture reconfigurations. Most existing predictors are based on locality of program behaviors, subject to some inherent limitations. In this paper, we revisit the design philosophy and syst ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Prediction of program dynamic behaviors is fundamental to program optimizations, resource management, and architecture reconfigurations. Most existing predictors are based on locality of program behaviors, subject to some inherent limitations. In this paper, we revisit the design philosophy and systematically explore a second source of clues: statistical correlations between the behavior sequences of different program entities. Concentrated on loops, we examine the correlations ’ existence, strength, and values in enhancing the design of program behavior predictors. We create the first taxonomy of program behavior sequence patterns. We develop a new form of predictors, named sequence predictors, to effectively translate the correlations into largescope, proactive predictions of program behavior sequences. We demonstrate the usefulness of the prediction in dynamic version selection and loop importance estimation, showing 19 % average speedup on a number of real-world utility applications. By taking scope and timing of behavior prediction as the first-order design objectives, the new approach overcomes limitations of existing program behavior predictors, opening up many new opportunities for runtime optimizations at various layers of computing.
Automatically tuning parallel and parallelized programs
- Languages and Compilers for Parallel Computing
, 2010
"... Abstract. In today’s multicore era, parallelization of serial code is es-sential in order to exploit the architectures ’ performance potential. Par-allelization, especially of legacy code, however, proves to be a challenge as manual efforts must either be directed towards algorithmic modifica-tions ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract. In today’s multicore era, parallelization of serial code is es-sential in order to exploit the architectures ’ performance potential. Par-allelization, especially of legacy code, however, proves to be a challenge as manual efforts must either be directed towards algorithmic modifica-tions or towards analysis of computationally intensive sections of code for the best possible parallel performance, both of which are difficult and time-consuming. Automatic parallelization uses sophisticated compile-time techniques in order to identify parallelism in serial programs, thus reducing the burden on the program developer. Similar sophistication is needed to improve the performance of hand-parallelized programs. A key difficulty is that optimizing compilers are generally unable to estimate the performance of an application or even a program section at compile-time, and so the task of performance improvement invariably rests with the developer. Automatic tuning uses static analysis and runtime perfor-mance metrics to determine the best possible compile-time approach for optimal application performance. This paper describes an offline tuning approach that uses a source-to-source parallelizing compiler, Cetus, and a tuning framework to tune parallel application performance. The im-plementation uses an existing, generic tuning algorithm called Combined Elimination to study the effect of serializing parallelizable loops based on measured whole program execution time, and provides a combina-tion of parallel loops as an outcome that ensures to equal or improve performance of the original program. We evaluated our algorithm on a suite of hand-parallelized C benchmarks from the SPEC OMP2001 and NAS Parallel benchmarks and provide two sets of results. The first ig-nores hand-parallelized loops and only tunes application performance based on Cetus-parallelized loops. The second set of results considers the tuning of additional parallelism in hand-parallelized code. We show that our implementation always performs near-equal or better than serial code while tuning only Cetus-parallelized loops and equal to or better than hand-parallelized code while tuning additional parallelism. 1
H.: Auto-tuning parallel skeletons
- Parallel Processing Letters
, 2012
"... ABSTRACT Parallel skeletons are a structured parallel programming abstraction that provide programmers with a predefined set of algorithmic templates that can be combined, nested and parameterized with sequential code to produce complex programs. The implementation of these skeletons is currently a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT Parallel skeletons are a structured parallel programming abstraction that provide programmers with a predefined set of algorithmic templates that can be combined, nested and parameterized with sequential code to produce complex programs. The implementation of these skeletons is currently a manual process, requiring human expertise to choose suitable implementation parameters that provide good performance. This paper presents an empirical exploration of the optimization space of the FastFlow parallel skeleton framework. We performed this using a Monte Carlo search of a random subset of the space, for a representative set of platforms and programs. The results show that the space is program and platform dependent, non-linear, and that automatic search achieves a significant average speedup in program execution time of 1.6× over a human expert. An exploratory data analysis of the results shows a linear dependence between two of the parameters, and that another two parameters have little effect on performance. These properties are then used to reduce the size of the space by a factor of 6, reducing the cost of the search. This provides a starting point for automatically optimizing parallel skeleton programs without the need for human expertise, and with a large improvement in execution time compared to that achievable using human expert tuning.