• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Towards a Holistic Approach to Auto-parallelization: Integrating Profile-driven Parallelism Detection and Machine-learning Based Mapping. (2009)

by G Tournavitis, Z Wang, B Franke, M O’Boyle
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 49
Next 10 →

Parallelizing sequential programs with statistical accuracy tests

by Sasa Misailovic, Deokhwan Kim, Martin Rinard
"... We present QuickStep, a novel system for parallelizing sequential programs. Unlike standard parallelizing compilers (which are designed to preserve the semantics of the original sequential computation), QuickStep is instead designed to generate (potentially nondeterministic) parallel programs that p ..."
Abstract - Cited by 19 (15 self) - Add to MetaCart
We present QuickStep, a novel system for parallelizing sequential programs. Unlike standard parallelizing compilers (which are designed to preserve the semantics of the original sequential computation), QuickStep is instead designed to generate (potentially nondeterministic) parallel programs that produce acceptably accurate results acceptably often. The freedom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential program enables a dramatic simplification of the compiler, a dramatic increase in the range of applications that it can parallelize, and a significant expansion in the range of parallel programs that it can legally generate. Results from our benchmark set of applications show that QuickStep can automatically generate acceptably accurate and efficient parallel programs—the automatically generated parallel versions of five of our six benchmark applications run between 5.0 and 7.8 times faster on eight cores than the original sequential versions. These applications and parallelizations contain features (such as the use of modern object-oriented programming constructs or desirable parallelizations with infrequent but acceptable data races) that place them inherently beyond the reach of standard approaches.
(Show Context)

Citation Context

...on representative inputs, dynamically observe the memory access patterns, then use the observed access patterns to suggest potential parallelizations that do not violate the observed data dependences =-=[14, 37, 42]-=-. The dynamic analysis may be augmented with a static analysis to recognize parallel patterns such as reductions. These potential parallelizations are typically presented to the developer for approval...

Kremlin: Rethinking and rebooting gprof for the multicore age

by Saturnino Garcia, Donghwan Jeon, Christopher M Louie, Michael Bedford Taylor - In PLDI , 2011
"... ..."
Abstract - Cited by 18 (3 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...pared Kremlin’s plans against those parallelized by humans in the SPEC OMP2001 versions. For art and ammp, SPEC OMP versions benefit from serial optimizations compared to their SPEC 2000 counterparts =-=[38]-=-. To exclude the effect of serial optimizations, we applied those optimizations on the SPEC 2000 code before running Kremlin. Our evaluation included only third-party benchmarks that have preexisting ...

Milepost GCC: machine learning enabled self-tuning compiler

by Grigori Fursin, Yuriy Kashnikov, Abdul Wahid, Memon Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, Francois Bodin, Phil Barnard, Elton Ashton, Edwin Bonilla, John Thomson, Christopher K. I. Williams , 2009
"... Contact: ..."
Abstract - Cited by 13 (1 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

... sequences of optimization passes or polyhedral transformations [59, 78]. We started combining Milepost technology with machine-learning based autoparallelization and predictive scheduling techniques =-=[60,51,76]-=-. We have also started investigating staged compilation techniques to balance between static and dynamic optimizations using machine learning in LLVM or Milepost GCC4CIL connected to Mono virtual mach...

SD 3 : A scalable approach to dynamic datadependence profiling

by Minjang Kim, Hyesoon Kim, Chi-keung Luk , 2010
"... Abstract—As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization c ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
Abstract—As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program. However, state-of-the-art data-dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications: (1) runtime overhead and (2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications or only report very limited information. In this paper, we propose a scalable approach to datadependence profiling that addresses both runtime and memory overhead in a single framework. Our technique, called SD 3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD 3 reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1 × and 9.7 × on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20 × improvement in memory consumption and a 16 × speedup in profiling time when 32 cores are used. Keywords-profiling, data dependence, parallel programming, program analysis, compression, parallelization. I.
(Show Context)

Citation Context

...nce all memory addresses are resolved in runtime. Data-dependence profiling has already been used in parallelization efforts like speculative multithreading [4, 7, 17, 23, 29] and finding parallelism =-=[15, 22, 25, 27, 31]-=-. It is also being employed in the commercial tools we mentioned. However, the current algorithm for data-dependence profiling incurs significant costs of time and memory overhead. Surprisingly, altho...

Collective Tuning Initiative: automating and Accelerating Development and Optimization Of computing systems

by Grigori Fursin - GCC DEVELOPERS' SUMMIT , 2009
"... Computing systems rarely deliver best possible performance due to ever increasing hardware and software complexity and limitations of the current optimization technology. Additional code and architecture optimizations are often required to improve execution time, size, power consumption, reliability ..."
Abstract - Cited by 10 (6 self) - Add to MetaCart
Computing systems rarely deliver best possible performance due to ever increasing hardware and software complexity and limitations of the current optimization technology. Additional code and architecture optimizations are often required to improve execution time, size, power consumption, reliability and other important characteristics of computing systems. However, it is often a tedious, repetitive, isolated and time consuming process. In order to automate, simplify and systematize program optimization and architecture design, we are developing open-source modular plugin-based Collective Tuning Infrastructure
(Show Context)

Citation Context

...mization, dynamic data partitioning and predictive scheduling, empirical iterative compilation, statistical analysis, machine learning and decision trees together with program and 25dataset features =-=[50, 62, 56, 51, 47, 60, 45]-=-. More information about collaborative UNIDAPT R&D is available at [30]. 5 Usage Scenarios 5.1 Manual sharing of optimization cases Collective tuning infrastructure provides multiple ways to optimize ...

P.: Dynamic trace-based analysis of vectorization potential of applications

by Justin Holewinski, Ragavendar Ramamurthi, Mahesh Ravishankar, Naznin Fauzia, Atanas Rountev - In: ACM SIGPLAN Conf. on Programming Language Design and Implementation , 2012
"... Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast ma-jority of existing applications were devel ..."
Abstract - Cited by 9 (0 self) - Add to MetaCart
Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast ma-jority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, In-tel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectoriza-tion capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.
(Show Context)

Citation Context

...zation of loops, where shadow locations are used to track dynamic dependences across loop iterations [22]. Several other approaches of similar nature have been investigated in more recent work (e.g., =-=[2, 19, 30, 31, 35, 39]-=-). The efficient collection of dynamic data-dependence information has also been explored by previous work. Tallam et al. use runtime control flow information to reconstruct significant portions of th...

Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation

by Yuanjie Huang, Liang Peng, Chengyong Wu, Yuriy Kashnikov, Jörn Rennecke, Grigori Fursin - 2ND INTERNATIONAL WORKSHOP ON GCC RESEARCH OPPORTUNITIES (GROW'10) , 2010
"... ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Abstract not found

Alter: exploiting breakable dependences for parallelization

by Abhishek Udupa , Kaushik Rajan , William Thies - In PLDI , 2011
"... Abstract For decades, compilers have relied on dependence analysis to determine the legality of their transformations. While this conservative approach has enabled many robust optimizations, when it comes to parallelization there are many opportunities that can only be exploited by changing or re-o ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Abstract For decades, compilers have relied on dependence analysis to determine the legality of their transformations. While this conservative approach has enabled many robust optimizations, when it comes to parallelization there are many opportunities that can only be exploited by changing or re-ordering the dependences in the program. This paper presents ALTER: a system for identifying and enforcing parallelism that violates certain dependences while preserving overall program functionality. Based on programmer annotations, ALTER exploits new parallelism in loops by reordering iterations or allowing stale reads. ALTER can also infer which annotations are likely to benefit the program by using a test-driven framework. Our evaluation of ALTER demonstrates that it uncovers parallelism that is beyond the reach of existing static and dynamic tools. Across a selection of 12 performance-intensive loops, 9 of which have loop-carried dependences, ALTER obtains an average speedup of 2.0x on 4 cores.
(Show Context)

Citation Context

...plex induction variables (such as iterators through a linked list) are difficult to detect automatically. Second, even if dependences are precisely identified (e.g., using dynamic dependence analysis =-=[43]-=-, program annotations [45], or speculative parallelization [11, 15, 21, 27, 33, 39, 44]), there remain many programs in which memory dependences are accidental artifacts of the implementation and shou...

Automatic parallelization with statistical accuracy bounds

by Sasa Misailovic, Deokhwan Kim, Martin Rinard, Sasa Misailovic, Deokhwan Kim, Martin Rinard , 2010
"... Traditional parallelizing compilers are designed to generate paral-lel programs that produce identical outputs as the original sequen-tial program. The difficulty of performing the program analysis re-quired to satisfy this goal and the restricted space of possible target parallel programs have both ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
Traditional parallelizing compilers are designed to generate paral-lel programs that produce identical outputs as the original sequen-tial program. The difficulty of performing the program analysis re-quired to satisfy this goal and the restricted space of possible target parallel programs have both posed significant obstacles to the de-velopment of effective parallelizing compilers. The QuickStep compiler is instead designed to generate paral-lel programs that satisfy statistical accuracy guarantees. The free-dom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential pro-gram enables a dramatic simplification of the compiler and a signif-icant expansion in the range of parallel programs that it can legally generate. QuickStep exploits this flexibility to take a fundamen-tally different approach from traditional parallelizing compilers. It
(Show Context)

Citation Context

...on representative inputs, dynamically observe the memory access patterns, then use the observed access patterns to suggest potential parallelizations that do not violate the observed data dependences =-=[34, 30, 10]-=-. These potential parallelizations are then typically presented to the developer for approval. It is possible to use QuickStep in a similar way, to explore potential parallelizations that are then cer...

A Machine LearningBased Approach for Thread Mapping on Transactional Memory Applications

by Christiane Pousa Ribeiro - In High Performance Computing Conference (HiPC , 2011
"... Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plicat ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plication behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and reso-lution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile sev-eral STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46 % compared to the worst case and up to 6.37 % over the Linux default thread mapping strategy. Keywords-machine learning; software transactional memory; thread mapping. I.
(Show Context)

Citation Context

...sion, the program is then passed to the second stage which is responsible for mapping tasks to both GPUs and CPUs. The number of tasks to be mapped to GPUs and CPUs is determined by the predictor. In =-=[24]-=-, the authors proposed a two-staged parallelization approach combining profiling-driven parallelism detection and ML-based mapping to generate OpenMP annotated parallel programs. In this method, first...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University