Results 1 - 10
of
49
Parallelizing sequential programs with statistical accuracy tests
"... We present QuickStep, a novel system for parallelizing sequential programs. Unlike standard parallelizing compilers (which are designed to preserve the semantics of the original sequential computation), QuickStep is instead designed to generate (potentially nondeterministic) parallel programs that p ..."
Abstract
-
Cited by 19 (15 self)
- Add to MetaCart
(Show Context)
We present QuickStep, a novel system for parallelizing sequential programs. Unlike standard parallelizing compilers (which are designed to preserve the semantics of the original sequential computation), QuickStep is instead designed to generate (potentially nondeterministic) parallel programs that produce acceptably accurate results acceptably often. The freedom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential program enables a dramatic simplification of the compiler, a dramatic increase in the range of applications that it can parallelize, and a significant expansion in the range of parallel programs that it can legally generate. Results from our benchmark set of applications show that QuickStep can automatically generate acceptably accurate and efficient parallel programs—the automatically generated parallel versions of five of our six benchmark applications run between 5.0 and 7.8 times faster on eight cores than the original sequential versions. These applications and parallelizations contain features (such as the use of modern object-oriented programming constructs or desirable parallelizations with infrequent but acceptable data races) that place them inherently beyond the reach of standard approaches.
Kremlin: Rethinking and rebooting gprof for the multicore age
- In PLDI
, 2011
"... ..."
(Show Context)
Milepost GCC: machine learning enabled self-tuning compiler
, 2009
"... Contact: ..."
(Show Context)
SD 3 : A scalable approach to dynamic datadependence profiling
, 2010
"... Abstract—As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization c ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Abstract—As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program. However, state-of-the-art data-dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications: (1) runtime overhead and (2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications or only report very limited information. In this paper, we propose a scalable approach to datadependence profiling that addresses both runtime and memory overhead in a single framework. Our technique, called SD 3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format. We demonstrate that SD 3 reduces the runtime overhead when profiling SPEC 2006 by a factor of 4.1 × and 9.7 × on eight cores and 32 cores, respectively. For the memory overhead, we successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the train input. In some cases, we observe more than a 20 × improvement in memory consumption and a 16 × speedup in profiling time when 32 cores are used. Keywords-profiling, data dependence, parallel programming, program analysis, compression, parallelization. I.
Collective Tuning Initiative: automating and Accelerating Development and Optimization Of computing systems
- GCC DEVELOPERS' SUMMIT
, 2009
"... Computing systems rarely deliver best possible performance due to ever increasing hardware and software complexity and limitations of the current optimization technology. Additional code and architecture optimizations are often required to improve execution time, size, power consumption, reliability ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
(Show Context)
Computing systems rarely deliver best possible performance due to ever increasing hardware and software complexity and limitations of the current optimization technology. Additional code and architecture optimizations are often required to improve execution time, size, power consumption, reliability and other important characteristics of computing systems. However, it is often a tedious, repetitive, isolated and time consuming process. In order to automate, simplify and systematize program optimization and architecture design, we are developing open-source modular plugin-based Collective Tuning Infrastructure
P.: Dynamic trace-based analysis of vectorization potential of applications
- In: ACM SIGPLAN Conf. on Programming Language Design and Implementation
, 2012
"... Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast ma-jority of existing applications were devel ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Recent hardware trends with GPUs and the increasing vector lengths of SSE-like ISA extensions for multicore CPUs imply that effective exploitation of SIMD parallelism is critical for achieving high performance on emerging and future architectures. A vast ma-jority of existing applications were developed without any attention by their developers towards effective vectorizability of the codes. While developers of production compilers such as GNU gcc, In-tel icc, PGI pgcc, and IBM xlc have invested considerable effort and made significant advances in enhancing automatic vectoriza-tion capabilities, these compilers still cannot effectively vectorize many existing scientific and engineering codes. It is therefore of considerable interest to analyze existing applications to assess the inherent latent potential for SIMD parallelism, exploitable through further compiler advances and/or via manual code changes.
Transforming GCC into a research-friendly environment: plugins for optimization tuning and reordering, function cloning and program instrumentation
- 2ND INTERNATIONAL WORKSHOP ON GCC RESEARCH OPPORTUNITIES (GROW'10)
, 2010
"... ..."
Alter: exploiting breakable dependences for parallelization
- In PLDI
, 2011
"... Abstract For decades, compilers have relied on dependence analysis to determine the legality of their transformations. While this conservative approach has enabled many robust optimizations, when it comes to parallelization there are many opportunities that can only be exploited by changing or re-o ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Abstract For decades, compilers have relied on dependence analysis to determine the legality of their transformations. While this conservative approach has enabled many robust optimizations, when it comes to parallelization there are many opportunities that can only be exploited by changing or re-ordering the dependences in the program. This paper presents ALTER: a system for identifying and enforcing parallelism that violates certain dependences while preserving overall program functionality. Based on programmer annotations, ALTER exploits new parallelism in loops by reordering iterations or allowing stale reads. ALTER can also infer which annotations are likely to benefit the program by using a test-driven framework. Our evaluation of ALTER demonstrates that it uncovers parallelism that is beyond the reach of existing static and dynamic tools. Across a selection of 12 performance-intensive loops, 9 of which have loop-carried dependences, ALTER obtains an average speedup of 2.0x on 4 cores.
Automatic parallelization with statistical accuracy bounds
, 2010
"... Traditional parallelizing compilers are designed to generate paral-lel programs that produce identical outputs as the original sequen-tial program. The difficulty of performing the program analysis re-quired to satisfy this goal and the restricted space of possible target parallel programs have both ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Traditional parallelizing compilers are designed to generate paral-lel programs that produce identical outputs as the original sequen-tial program. The difficulty of performing the program analysis re-quired to satisfy this goal and the restricted space of possible target parallel programs have both posed significant obstacles to the de-velopment of effective parallelizing compilers. The QuickStep compiler is instead designed to generate paral-lel programs that satisfy statistical accuracy guarantees. The free-dom to generate parallel programs whose output may differ (within statistical accuracy bounds) from the output of the sequential pro-gram enables a dramatic simplification of the compiler and a signif-icant expansion in the range of parallel programs that it can legally generate. QuickStep exploits this flexibility to take a fundamen-tally different approach from traditional parallelizing compilers. It
A Machine LearningBased Approach for Thread Mapping on Transactional Memory Applications
- In High Performance Computing Conference (HiPC
, 2011
"... Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plicat ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching ap-plication behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and reso-lution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile sev-eral STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46 % compared to the worst case and up to 6.37 % over the Linux default thread mapping strategy. Keywords-machine learning; software transactional memory; thread mapping. I.