Results 1 - 10
of
76
Blocking and Array Contraction Across Arbitrarily Nested Loops Using Affine Partitioning
, 2001
"... Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine parti ..."
Abstract
-
Cited by 80 (1 self)
- Add to MetaCart
(Show Context)
Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and statement reordering. Algorithms based on affine partitioning have been shown to be effective for parallelization and communication minimization. This paper presents algorithms that improve data locality using affine partitioning. Blocking and array contraction are two important optimizations that have been shown to be useful for data locality. Blocking creates a set of inner loops so that data brought into the faster levels of the memory hierarchy can be reused. Array contraction reduces an array to a scalar variable and thereby reduces the number of memory operations executed and the memory footprint. Loop transforms are often necessary to make blocking and array contraction possible.
Revisiting the sequential programming model for multi-core
- In Proceedings of the 40th Annual ACM/IEEE International Symposium on Microarchitecture
, 2007
"... Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race ..."
Abstract
-
Cited by 73 (11 self)
- Add to MetaCart
(Show Context)
Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race conditions, deadlocks, and other problems associated with parallel programming. To address these costs, other approaches, such as automatic thread extraction, have been explored. Unfortunately, the amount of parallelism that has been automatically extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic limitation of the sequential programming model, but rather occurs for two reasons. First, there exists no framework for automatic thread extraction that brings together key existing state-of-the-art compiler and hardware techniques. This paper shows that such a framework can yield scalable parallelization on several SPEC CINT2000 benchmarks. Second, existing sequential programming languages force programmers to define a single legal program outcome, rather than allowing for a range of legal outcomes. This paper shows that natural extensions to the sequential programming model enable parallelization for the remainder of the SPEC CINT2000 suite. Our experience demonstrates that, by changing only 60 source code lines, all of the C benchmarks in the SPEC CINT2000 suite were parallelizable by automatic thread extraction. This process, constrained by the limits of modern optimizing compilers, yielded a speedup of 454 % on these applications. 1
Finding effective optimization phase sequences
- In Proceedings of the 2003 ACM SIGPLAN conference on Language, Compiler, and Tool for Embedded Systems
, 2003
"... It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumptio ..."
Abstract
-
Cited by 71 (12 self)
- Add to MetaCart
(Show Context)
It has long been known that a single ordering of optimization phases will not produce the best code for every application. This phase ordering problem can be more severe when generating code for embedded systems due to the need to meet conflicting constraints on time, code size, and power consumption. Given that many embedded application developers are willing to spend time tuning an application, we believe a viable approach is to allow the developer to steer the process of optimizing a function. In this paper, we describe support in VISTA, an interactive compilation system, for finding effective sequences of optimization phases. VISTA provides the user with dynamic and static performance information that can be used during an interactive compilation session to gauge the progress of improving the code. In addition, VISTA provides support for automatically using performance information to select the best optimization sequence among several attempted. One such feature is the use of a genetic algorithm to search for the most efficient sequence based on specified fitness criteria. We hav e included a number of experimental results that evaluate the effectiveness of using a genetic algorithm in VISTA to find effective optimization phase sequences.
Tracking Pointers with Path and Context Sensitivity for Bug Detection in C Programs
- in ACM SIGSOFT Symposium on the Foundations of Software Engineering
, 2003
"... This paper proposes a pointer alias analysis for automatic error detection. State-of-the-art pointer alias analyses are either too slow or too imprecise for finding errors in real-life programs. We propose a hybrid pointer analysis that tracks actively manipulated pointers held in local variables an ..."
Abstract
-
Cited by 45 (3 self)
- Add to MetaCart
(Show Context)
This paper proposes a pointer alias analysis for automatic error detection. State-of-the-art pointer alias analyses are either too slow or too imprecise for finding errors in real-life programs. We propose a hybrid pointer analysis that tracks actively manipulated pointers held in local variables and parameters accurately with path and context sensitivity and handles pointers stored in recursive data structures less precisely but efficiently. We make the unsound assumption that pointers passed into a procedure, in parameters, global variables, and locations reached by applying simple access paths to parameters and global variables, are all distinct from each other and from any other locations. This assumption matches the semantics of many functions, reduces spurious aliases and speeds up the analysis. We present a program representation...
Refactoring sequential java code for concurrency via concurrent libraries.
- In 31st International Conference on Software Engineering (ICSE),
, 2009
"... Abstract Parallelizing existing sequential programs to run efficiently on multicores is hard. The Java 5 package java.util.concurrent (j.u.c.) supports writing concurrent programs: much of the complexity of writing threadsafe and scalable programs is hidden in the library. To use this package, prog ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
(Show Context)
Abstract Parallelizing existing sequential programs to run efficiently on multicores is hard. The Java 5 package java.util.concurrent (j.u.c.) supports writing concurrent programs: much of the complexity of writing threadsafe and scalable programs is hidden in the library. To use this package, programmers still need to reengineer existing code. This is tedious because it requires changing many lines of code, is error-prone because programmers can use the wrong APIs, and is omission-prone because programmers can miss opportunities to use the enhanced APIs. This paper presents our tool, CONCURRENCER, that enables programmers to refactor sequential code into parallel code that uses three j.u.c. concurrent utilities. CONCURRENCER does not require any program annotations. Its transformations span multiple, non-adjacent, program statements. A find-and-replace tool can not perform such transformations, which require program analysis. Empirical evaluation shows that CONCURRENCER refactors code effectively: CON-CURRENCER correctly identifies and applies transformations that some open-source developers overlooked, and the converted code exhibits good speedup.
Exposing speculative thread parallelism in SPEC2000
- In Proc. of PPoPP’05
, 2005
"... As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TL ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
(Show Context)
As increasing the performance of single-threaded processors becomes increasingly difficult, consumer desktop processors are moving toward multi-core designs. One way to enhance the performance of chip multiprocessors that has received considerable attention is the use of thread-level speculation (TLS). As a case study, we manually parallelized several of the SPEC CPU2000 floating point and integer applications using TLS. The use of manual parallelization enabled us to apply techniques and programmer expertise that are beyond the current capabilities of automated parallelizers. With the experience gained from this, we provide insight into ways to aggressively apply TLS to parallelize applications for high performance. This information can help guide future advanced TLS compiler design. For each application, we discuss how and where parallelism was located within the application, the impediments to extracting this parallelism using TLS, and the code transformations that were required to overcome these impediments. We also generalize these experiences to a discussion of common hindrances to TLS parallelization, and describe methods of programming that help expose application parallelism to TLS systems. These guidelines can assist developers of uniprocessor programs to create applications that can easily port to TLS systems and yield good performance. By using manual parallelization on SPEC2000, we provide guidance on where thread-level parallelism exists in these well known benchmarks, what limits its extraction, how to reduce these limitations and what performance can be expected on these applications from a chip multiprocessor system with TLS.
An Empirical Study of Static Program Slice Size
"... This article presents results from a study of all slices from 43 programs, ranging up to 136,000 lines of code in size. The study investigates the effect of five aspects that affect slice size. Three slicing algorithms are used to study two algorithmic aspects: calling-context treatment and slice gr ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
(Show Context)
This article presents results from a study of all slices from 43 programs, ranging up to 136,000 lines of code in size. The study investigates the effect of five aspects that affect slice size. Three slicing algorithms are used to study two algorithmic aspects: calling-context treatment and slice granularity. The remaining three aspects affect the upstream dependencies considered by the slicer. These include collapsing structure fields, removal of dead code, and the influence of points-to analysis. The results show that for the most precise slicer, the average slice contains just under one-third of the program. Furthermore, ignoring calling context causes a 50 % increase in slice size, and while (coarse-grained) function-level slices are 33 % larger than corresponding statement-level slices, they may be useful predictors of the (finer-grained) statement-level slice size. Finally, upstream analyses have an order of magnitude less influence on slice size.
Kremlin: Rethinking and rebooting gprof for the multicore age
- In PLDI
, 2011
"... ..."
(Show Context)
Resolution, Optimization, and Encoding of Pointer Variables for the Behavioral Synthesis from C
, 2001
"... As designers may model mixed hardware--software systems using a subset of or ++, we present SpC, a solution to synthesize and optimize hardware models with pointers. In hardware, a pointer is not only the address of data in memory, but it may also reference data mapped to registers, ports, or wires. ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
As designers may model mixed hardware--software systems using a subset of or ++, we present SpC, a solution to synthesize and optimize hardware models with pointers. In hardware, a pointer is not only the address of data in memory, but it may also reference data mapped to registers, ports, or wires. Pointer analysis is used to find the set of locations each pointer may reference in a program at compile time. In this paper, we address the problem of synthesizing and optimizing pointers to multiple variables or array elements. The value of the pointers are encoded and branching statements are used to dynamically access data referenced by pointers. A heuristic is used to efficiently encode the values of the pointers. Compiler techniques are also used to reduce storage before loads and stores. An implementation using the SUIF framework (Wilson et al., 1994; SUIF Compiler Framework) is presented, followed by some case studies and experimental results.