• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Mapping parallelism to multi-cores: a machine learning based approach, in PPoPP (2009)

by Z Wang, M O’Boyle
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 30
Next 10 →

SLAW: A scalable localityaware adaptive work-stealing scheduler

by Yi Guo, D. Cooper, Lin Zhong, Yi Guo - In 24th IEEE International Symposium on Parallel and Distributed Processing (IPDPS , 2010
"... Recent trend has made it clear that the processor makers are committed to the multi-core chip designs. The number of cores per chip is increasing, while there is little or no increase in the clock speed. This parallelism trend poses a significant and urgent challenge on computer software because pro ..."
Abstract - Cited by 38 (2 self) - Add to MetaCart
Recent trend has made it clear that the processor makers are committed to the multi-core chip designs. The number of cores per chip is increasing, while there is little or no increase in the clock speed. This parallelism trend poses a significant and urgent challenge on computer software because programs have to be written or transformed into a multi-threaded form to take full advantage of future hardware advances. Task parallelism has been identified as one of the prerequisites for software produc-tivity. In task parallelism, programmers focus on decomposing the problem into sub-computations that can run in parallel and leave the compiler and runtime to handle the scheduling details. This separation of concerns between task decomposition and scheduling provides productivity to the programmer but poses challenges to the runtime scheduler. Our thesis is that work-stealing schedulers with adaptive scheduling policies and locality-awareness can provide a scalable and robust runtime foundation for multi-core task parallelism. We evaluate our thesis using the new Scalable Locality-aware
(Show Context)

Citation Context

...s kinds of approach has the potential risk of come up with tasks that are too coarse-grain such that the load balancing problem arises again. Some techniques are proposed to find the best granularity =-=[90, 88]-=-. SLAW currently does not change the granularity of the task at runtime. Instead, it tries to schedule the given tasks in an efficient way by policy adaptation. The technique to improve performance by...

Design Principles for End-to-End Multicore Schedulers

by Simon Peter, Adrian Schüpbach, Paul Barham, Andrew Baumann, Rebecca Isaacs, Tim Harris, Timothy Roscoe
"... As personal computing devices become increasingly parallel multiprocessors, the requirements for operating system schedulers change considerably. Future generalpurpose machines will need to handle a dynamic, bursty, and interactive mix of parallel programs sharing a heterogeneous multicore machine. ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
As personal computing devices become increasingly parallel multiprocessors, the requirements for operating system schedulers change considerably. Future generalpurpose machines will need to handle a dynamic, bursty, and interactive mix of parallel programs sharing a heterogeneous multicore machine. We argue that a key challenge for such machines is rethinking scheduling as an end-to-end problem integrating components from the hardware and kernel up to the programming language runtimes and applications themselves. We present several design principles for future OS schedulers, and discuss the implications of each for OS and runtime interfaces and structure. We illustrate the implementation challenges that result by describing the concrete choices we have made in the Barrelfish multikernel. This allows us to present one coherent scheduling design for an entire multicore machine, while at the same time drawing conclusions we think are applicable to the design of any general-purpose multicore OS. 1
(Show Context)

Citation Context

...fectively allocate resources. For example, MapReduce applications follow fixed data-flow phases, and it is possible to determine this information at compile-time for programming paradigms like OpenMP =-=[26]-=-. Implementation At startup, or during execution, Barrelfish applications may present a scheduling manifest to the planner, containing a specification of predicted longterm resource requirements, expr...

Parallelism orchestration using DoPE: the degree of parallelism executive

by Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, David I. August - In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI , 2011
"... In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performanc ..."
Abstract - Cited by 9 (1 self) - Add to MetaCart
In writing parallel programs, programmers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer’s development-time choices suboptimal. To address this problem, this paper presents the Degree of Parallelism Executive (DoPE), an API and run-time system that separates the concern of exposing parallelism from that of optimizing it. Using the DoPE API, the application developer expresses parallelism options. During program execution, DoPE’s run-time system uses this information to dynamically optimize the parallelism options in response to the facts on the ground. We easily port several emerging parallel applications to DoPE’s API and demonstrate the DoPE run-time system’s effectiveness in dynamically optimizing the parallelism for a variety of performance goals.
(Show Context)

Citation Context

...to respond to system events. 9. Related Work Parallelization Libraries Several interfaces and associated runtime systems have been proposed to adapt parallel program execution to run-time variability =-=[4, 8, 9, 13, 20, 26, 29, 35, 37, 38]-=-. However, each interface is tied to a specific performance goal, specific mechanism of adaptation, or a specific application/platform domain. OpenMP [22], Cilk [5], and Intel TBB [26] support task pa...

A System for Flexible Parallel Execution

by Arun Raman
"... growth in transistor density combined with diminishing returns from uniprocessor improvements has compelled the industry to transition to multicore architectures. To realize the performance potential of multicore architectures, programs must be parallelized effectively. The efficiency of parallel pr ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
growth in transistor density combined with diminishing returns from uniprocessor improvements has compelled the industry to transition to multicore architectures. To realize the performance potential of multicore architectures, programs must be parallelized effectively. The efficiency of parallel program execution depends on the execution environment comprised of workload, platform, and performance goal. In writing parallel programs, most programmers and compilers expose parallelism and optimize it to meet a particular performance goal on a single platform under an assumed set of workload characteristics. In the field, changing workload characteristics, new parallel platforms, and deployments with different performance goals make the programmer’s or compiler’s development-time or compile-time choices suboptimal. This dissertation presents Parcae1, a generally applicable holistic system for platformwide dynamic parallelism tuning. Parcae includes: 1. the Nona compiler, which applies a variety of auto-parallelization techniques to create flexible parallel programs whose tasks can be efficiently paused, reconfigured,
(Show Context)

Citation Context

...in-specific models. 9.1.1 General-purpose Parallel Programming Models Several interfaces and associated run-time systems have been proposed to adapt parallel program execution to run-time variability =-=[16, 25, 28, 45, 59, 77, 85, 90, 94, 97, 99]-=-. However, each interface is tied to a specific performance goal, specific mechanism of adaptation, or a specific parallelism type. Most run-time systems enabling the parallel programming interfaces a...

Loop parallelism: a new skeleton perspective on data parallel patterns

by M. Danelutto, M. Torquati - In Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing , 2014
"... Abstract—Traditionally, skeleton based parallel programming frameworks support data parallelism by providing the pro-grammer with a comprehensive set of data parallel skeletons, based on different variants of map and reduce patterns. On the other side, more conventional parallel programming framewor ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
Abstract—Traditionally, skeleton based parallel programming frameworks support data parallelism by providing the pro-grammer with a comprehensive set of data parallel skeletons, based on different variants of map and reduce patterns. On the other side, more conventional parallel programming frameworks provide application programmers with the possibility to introduce parallelism in the execution of loops with a relatively small programming effort. In this work, we discuss a “ParallelFor” skeleton provided within the FastFlow framework and aimed at filling the usability and expressivity gap between the classical data parallel skeleton approach and the loop parallelisation facilities offered by frameworks such as OpenMP and Intel TBB. By exploiting the low run-time overhead of the FastFlow parallel skeletons and the new facilities offered by the C++11 standard, our ParallelFor skeleton succeeds to obtain comparable or better performance than both OpenMP and TBB on the Intel Phi many-core and Intel Nehalem multi-core for a set of benchmarks considered, yet requiring a comparable programming effort.
(Show Context)

Citation Context

...EEE DOI 10.1109/PDP.2014.13 52 II. BACKGROUND Loop parallelism is a topic that has been repeatedly investigated over the years using different approaches and techniques for iterations scheduling [5], =-=[6]-=-, [7]. In this paper we concentrate for performance comparison on OpenMP [2] and TBB [8], which represent to a major extent, the most widely used and studied frameworks for loop parallelisations. A. O...

A Machine LearningBased Approach for Thread Mapping on Transactional Memory Applications

by Christiane Pousa Ribeiro - In High Performance Computing Conference (HiPC , 2011
"... Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plicat ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plication behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and reso-lution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile sev-eral STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46 % compared to the worst case and up to 6.37 % over the Linux default thread mapping strategy. Keywords-machine learning; software transactional memory; thread mapping. I.
(Show Context)

Citation Context

...iously trained ML-based prediction mechanism to each parallel loop candidate in order to select a scheduling policy from the four options implemented by OpenMP (cyclic, dynamic, guided or static). In =-=[25]-=-, the authors proposed a ML approach to thread mapping on parallel applications developed with OpenMP. Using the machine learning approach, the proposed solution is capable of predicting the number of...

Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems

by Dominik Grewe, Zheng Wang, Michael F. P. O’boyle
"... General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-p ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. Such an approach brings together the benefits of a clear high level language (OpenMP) and an emerging standard (OpenCL) for heterogeneous multi-cores. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses predictive modeling to automatically determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multi-core host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on two distinct GPU based systems: Core i7/NVIDIA GeForce GTX 580 and Core i7/AMD Radeon 7970. We achieved average (up to) speedups of 4.51x and 4.20x (143x and 67x) respectively over a sequential baseline. This is, on average, a factor 1.63 and 1.56 times faster than a hand-coded, GPU-specific OpenCL implementation developed by independent expert programmers.
(Show Context)

Citation Context

... the data is already on the GPU. Predictive Modeling In addition to optimizing sequential programs [6], recent studies have shown that predictive modeling is effective in optimizing parallel programs =-=[25, 26]-=-. The Qilin [19] compiler uses off-line profiling to create a regression model that is employed to predict a data parallel program’s execution time. Unlike Qilin, our approach does not require any pro...

Exploiting Inter-Sequence Correlations for Program Behavior Prediction

by Bo Wu, Zhijia Zhao, Xipeng Shen, Yulian Jiang, Yaoqing Gao, Raul Silvera
"... Prediction of program dynamic behaviors is fundamental to program optimizations, resource management, and architecture reconfigurations. Most existing predictors are based on locality of program behaviors, subject to some inherent limitations. In this paper, we revisit the design philosophy and syst ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
Prediction of program dynamic behaviors is fundamental to program optimizations, resource management, and architecture reconfigurations. Most existing predictors are based on locality of program behaviors, subject to some inherent limitations. In this paper, we revisit the design philosophy and systematically explore a second source of clues: statistical correlations between the behavior sequences of different program entities. Concentrated on loops, we examine the correlations ’ existence, strength, and values in enhancing the design of program behavior predictors. We create the first taxonomy of program behavior sequence patterns. We develop a new form of predictors, named sequence predictors, to effectively translate the correlations into largescope, proactive predictions of program behavior sequences. We demonstrate the usefulness of the prediction in dynamic version selection and loop importance estimation, showing 19 % average speedup on a number of real-world utility applications. By taking scope and timing of behavior prediction as the first-order design objectives, the new approach overcomes limitations of existing program behavior predictors, opening up many new opportunities for runtime optimizations at various layers of computing.
(Show Context)

Citation Context

...hat uses Machine Learning (ML) techniques to assist program optimizations. Examples include prediction of the optimization levels for a Java method [6], prediction of suitable parallelization schemes =-=[26]-=-, and so on. This current work is complementary to these previous techniques. These studies typically build predictive models mapping from program code features (e.g., portions of various instructions...

Automatically tuning parallel and parallelized programs

by Chirag Dave, Rudolf Eigenmann - Languages and Compilers for Parallel Computing , 2010
"... Abstract. In today’s multicore era, parallelization of serial code is es-sential in order to exploit the architectures ’ performance potential. Par-allelization, especially of legacy code, however, proves to be a challenge as manual efforts must either be directed towards algorithmic modifica-tions ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Abstract. In today’s multicore era, parallelization of serial code is es-sential in order to exploit the architectures ’ performance potential. Par-allelization, especially of legacy code, however, proves to be a challenge as manual efforts must either be directed towards algorithmic modifica-tions or towards analysis of computationally intensive sections of code for the best possible parallel performance, both of which are difficult and time-consuming. Automatic parallelization uses sophisticated compile-time techniques in order to identify parallelism in serial programs, thus reducing the burden on the program developer. Similar sophistication is needed to improve the performance of hand-parallelized programs. A key difficulty is that optimizing compilers are generally unable to estimate the performance of an application or even a program section at compile-time, and so the task of performance improvement invariably rests with the developer. Automatic tuning uses static analysis and runtime perfor-mance metrics to determine the best possible compile-time approach for optimal application performance. This paper describes an offline tuning approach that uses a source-to-source parallelizing compiler, Cetus, and a tuning framework to tune parallel application performance. The im-plementation uses an existing, generic tuning algorithm called Combined Elimination to study the effect of serializing parallelizable loops based on measured whole program execution time, and provides a combina-tion of parallel loops as an outcome that ensures to equal or improve performance of the original program. We evaluated our algorithm on a suite of hand-parallelized C benchmarks from the SPEC OMP2001 and NAS Parallel benchmarks and provide two sets of results. The first ig-nores hand-parallelized loops and only tunes application performance based on Cetus-parallelized loops. The second set of results considers the tuning of additional parallelism in hand-parallelized code. We show that our implementation always performs near-equal or better than serial code while tuning only Cetus-parallelized loops and equal to or better than hand-parallelized code while tuning additional parallelism. 1
(Show Context)

Citation Context

... execution time. The two methods actually complement each other and such complementary approaches have been proposed that use run-time profile information to drive optimization tuning at compile time =-=[3, 7]-=-. Our work also aims at a third goal. Most of today’s tuning systems apply a chosen compilation variant to the whole program. An important un-met need is to customize compiler techniques on a section-...

H.: Auto-tuning parallel skeletons

by Alexander Collins , Christian Fensch , Hugh Leather - Parallel Processing Letters , 2012
"... ABSTRACT Parallel skeletons are a structured parallel programming abstraction that provide programmers with a predefined set of algorithmic templates that can be combined, nested and parameterized with sequential code to produce complex programs. The implementation of these skeletons is currently a ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
ABSTRACT Parallel skeletons are a structured parallel programming abstraction that provide programmers with a predefined set of algorithmic templates that can be combined, nested and parameterized with sequential code to produce complex programs. The implementation of these skeletons is currently a manual process, requiring human expertise to choose suitable implementation parameters that provide good performance. This paper presents an empirical exploration of the optimization space of the FastFlow parallel skeleton framework. We performed this using a Monte Carlo search of a random subset of the space, for a representative set of platforms and programs. The results show that the space is program and platform dependent, non-linear, and that automatic search achieves a significant average speedup in program execution time of 1.6× over a human expert. An exploratory data analysis of the results shows a linear dependence between two of the parameters, and that another two parameters have little effect on performance. These properties are then used to reduce the size of the space by a factor of 6, reducing the cost of the search. This provides a starting point for automatically optimizing parallel skeleton programs without the need for human expertise, and with a large improvement in execution time compared to that achievable using human expert tuning.
(Show Context)

Citation Context

... Their method requires 3,000 iterations and achieves about 60% of the best obtainable performance. Dastgeer et al. [19] use machine learning to autotune a skeleton for simple data parallel operations. However, this work only evaluates autotuning for one parameter and with one application. In addition, several groups have looked at autotuning of generic parallel programming frameworks. Contreras and Martonosi investigate how the scheduler in Intel’s Threading Building Blocks library can be optimized [20]; while Wang and O’Boyle optimize thread number and scheduling strategy for OpenMP programs [21]. 8. Conclusions and Future Work The experiments carried out have demonstrated that there is significant scope for improvement in performance over a human expert, with an average speedup across all the programs and platforms of 1.6×. This is an impressive result, as human expertise is usually capable of doing as well as, if not better than, a compiler. They also demonstrate that the space is non-linear, and program and platform dependent, which makes it very difficult for a human expert to manually tune programs. Our exploratory analysis of the optimization space performs an initial step in de...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University