| S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Prog. Lang. Syst., 15(3):400--462, July 1993. |
....converts a nested parallel to an equivalent at program without reducing the parallelism speci ed in the original program. Despite the importance of this transformation, on its own, it is not su cient for producing good code for modern processors. Subsequent work in particular, Chatterjee s thesis [9] and scalar vector single memory shared memory distributed memory Workstations (SPARC etc. VX 1 (Fujitsu) E6500 (Sun) T94 (Cray) Origin 2000 (SGI) T3E (Cray) AP 3000 (Fujitsu) VPP 700 (Fujitsu) Workstation Clusters Fig. 1. Architecture space the second author s thesis ....
....summarises and evaluates the benchmarks. Finally, Section 6 brie y reviews related work and concludes. 2 The Architecture Space Previous work addressed the implementation of nested data parallelism on a range of di erent machines, such as vector processors [3, 10] shared memory multiprocessors [9, 10], and distributed memory machines [6, 14] However, the implementations, while being based on attening, used di erent optimisation techniques and enjoyed various levels of success. We are investigating to which extend we can target the full range of architectures with a uniform compilation system ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Transactions on Programming Languages and Systems, 15(3), july 1993.
....to tackle the mentioned problems. We extend AEattening such that it maps the new structures to eOEcient, AEat data parallel code. Our extension ts easily into existing formalizations and implementations of AEattening; in particular, the optimization techniques of previous work [PP93,PPW95,KS96,Cha93,Kel98] remain applicable. This paper makes the following three main contributions: 1) It demonstrates the usefulness of recursive types for nested data parallel languages (Section 2) 2) it formally species our extension of AEattening including user dened recursive types (Section 3) and (3) it ....
....operations on nested vectors more eOEciently. We delay the discussion of these primitives as it benets from knowing the target data representation. We choose Fkl as the target language as there are optimizing code generation techniques mapping it on dioeerent parallel architectures [BCH 93,Cha93,Kel98] 3.2 Concrete Data Representation Before presenting the instantiation procedure for polymorphic primitives, we discuss an eOEcient target representation of nested vectors, vectors of tuples, and vectors of recursive types. Nested vectors. We can represent nested vectors of basic type ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. on Prog. Lang. and Systems, 15(3), 1993.
....elements. In fact, CMU s implementation does not generate an executable, but instead emits VCODE and interprets it at runtime by an interpreter linked to their C Vector Library (CVL) 3] Unfortunately, this approach, while working fine for vector computers, is not satisfying for shared memory [7] and distributed memory machines [14] The problems with this approach are mainly for three reasons: 1) Processor caches are badly utilized, 2) communication operations cannot be merged, and (3) data distribution is too rigid. In short, the program cannot be optimized for the memory hierarchy. ....
....the deficiency of the library approach, consider that in the two dimensional Barnes Hut algorithm, we achieve a speedup of about a factor of 10 by fusion. 6 Related Work and Conclusions With regard to implementing nested parallelism, Chatterjee et al. s work, which covers shared memory machines [7], is probably closest to ours; however, their work is less general and not easily adapted for distributed memory. In the BMF community, p bounded lists are used to model block distributions of lists [22] this technique is related to distributed types, but the latter are more general as they (a) ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Transactions on Programming Languages and Systems, 15(3):400--462, july 1993.
....its properties. Finally, section 5 describes the setup and results of the experiments evaluating the effectiveness of our techniques. 2 Related Work Several researchers have studied different variations of the synchronization elimination problem in the context of compiling data parallel programs [7, 10, 13, 14, 16, 17]. Our approach shares some similarities with the work of Hatcher and Quinn [10] They use a data parallel language that assigns a private address space to each virtual processor. Data from other virtual processors can only be accessed by explicit communication. Hence, synchronizations are only ....
....are not explicitly visible; compilers need sophisticated data dependence analysis to capture access interferences. Furthermore, our solution is more general than the work of Hatcher and Quinn because our restructuring algorithm works on (sub ) expressions and is extremely fine grained. The article [7] by Chatterjee focuses on the compilation of VCODE for shared memory multiprocessors. VCODE is a low level, data parallel vector language intended to serve as the target for optimizing compilers of higher level languages. It is based on the shared address space paradigm and allows for nested ....
S. Chatterjee. Compiling nested data-parallel programs for shared memory multiprocessors. ACM TOPLAS, 15(3):400--462, March 1993.
....and collections allows the MultiSet to be a reusable abstraction that presents a concurrent interface. This combination supports reusable libraries of concurrent abstractions. 4. 4 Discussion Collections in ICC represent a unification of collections as distributed arrays of objects as in [29, 10] and the aggregate approach as in [18] The array approach is more compatible with the preexisting C notion of arrays and offers the advantage of separating the collection and constituent types. This can allow distinct members to be defined upon each type. A drawback to the independence of the ....
S. Chatterjee. Compiling nested data parallel programs for shared memory multiprocessors. ACM Transactions of Programming Languages and Systems, 15(3), 1993.
....to be a reusable abstraction that presents a concurrent interface. This combination makes libraries of concurrent abstractions possible, a key technology for concurrent programming. 4. 4 Discussion Collections in ICC represent a unification of collections as distributed arrays of objects as in [32, 12] and the aggregate approach as in [21] The array approach is more compatible with the preexisting C notion of arrays and offers the advantage of separating the collection and constituent types. This can allow distinct members to be defined upon each type. A drawback to the independence of the ....
S. Chatterjee. Compiling nested data parallel programs for shared memory multiprocessors. ACM Transactions of Programming Languages and Systems, 15(3), 1993.
....irregular applications. The alignment adds a significant task to program formulation, and generally forces the programmer to build structures quite different from sequential program versions. This increases the programming effort. More flexible models of data parallels such as nested parallelism [3, 8] are more promising. Another body of related work are a series of application studies for cache coherent shared memory systems, which employ a model of shared address space and threads [42, 37] While we have leveraged these studies, borrowing the algorithmic structure and data locality and load ....
S. Chatterjee. Compiling nested data parallel programs for shared memory multiprocessors. ACM Transactions of Programming Languages and Systems, 15(3), 1993.
No context found.
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Prog. Lang. Syst., 15(3):400--462, July 1993.
No context found.
Chatterjee, S. Compiling nested data-parallel programs for shared memory multiprocessors. ACM Trans. Program. Lang. Syst. 15, 3 (July 1993), 400--462.
....that multiple generators can match with a single accumulator and vice versa. Furthermore, pairs of generators accumulators that exhibit the same piecewise structural behavior are conformable and should be placed in the same piecewise execution loop for best performance. Chatterjee s size inference [8] can be used to identify conforming operations. Restrict operations require the introduction of additional loops to provide the data dependent number of input pieces necessary for restrict to generate an output piece. Finally, piecewise execution loops require a complex control flow mechanisms to ....
....Waters speculated on extending his program transformations to handle nested series expressions, he did not implement it. In 1993, Chatterjee compiled nested data parallel programs to increase code granularity and relax lock step synchrony so the programs could effectively execute on MIMD machines [8]. Although his compiler did not implement the fixed memory evaluation of Abrams, he was the first to apply temporary elimination in the context of nested dataparallel programs. His system used loop fusion to eliminate intermediate temporary storage from transformed NESL programs. 7.2 System ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Prog. Lang. Syst., 15(3):400462, July 1993.
....work compared with a sequential implementation. However, full parallelism and optimal load balance are easily achieved in this approach. Compile time techniques to fuse data parallel operations can reduce the number of barrier synchronizations, decrease space requirements, and improve reuse [12, 24]. The two approaches are illustrated for a nested data parallel computation and its associated dependence graph 1 in Fig. 3. Here G and H denote assignment statements that can not introduce additional dependences, since there can be no data dependences between iterations of FORALL loops. In ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Prog. Lang. Syst., 15(3):400--462, July 1993.
....work compared with a sequential implementation. However, full parallelism and optimal load balance are easily achieved in this approach. Compile time techniques to fuse data parallel operations can reduce the number of barrier synchronizations, decrease space requirements, and improve reuse [12, 24]. The two approaches are illustrated for a nested data parallel computation and its associated dependence graph 1 in Fig. 3.Here# and # denote assignment statements that can not introduce additional dependences, since there can be no data dependences between iterations of ###### loops. In Fig. ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Prog. Lang. Syst., 15(3):400--462, July 1993.
....problem in two or more dimensions is the same as the Geometric Steiner Tree problem, which is NP hard for the grid and discrete metrics. If common subexpressions are allowed, the exact optimization problem is again NP complete #Gilbert and Schreiber 1991; Mace 1987#. In a companion paper #Chatterjee et al. 1993b#, we extend our algorithms #heuristically# to basic blocks, which can be described as directed acyclic graphs. 1.1.2 Communication Cost. We assume that the cost of communicating data between two di#erent positions p and q can be described as w # d#p; q#, where w is the amount of data ....
....Our model makes all intermediate values explicit, as does #for example# three address code #Aho et al. 1986#. We expect that a later phase of compilation will perform storage optimization. Cytron et al. #1991# describe such optimizations in the context of static single assignment form; Chatterjee #1993# discusses similar optimizations for compiling #ne grained data parallel programs for shared memory multiprocessors. We believe that such techniques can be adapted to the storage optimization problem for array expressions. Optimal Evaluation of Array Expressions # 5 D BC A E B C 2 ....
[Article contains additional citation context not shown here]
Chatterjee, S. 1993. Compiling nested data-parallel programs for shared-memorymultiprocessors. ACM Trans. Program. Lang. Syst. 15, 3 #July#, 400#462.
....work compared with a sequential implementation. However, full parallelism and optimal load balance are easily achieved in this approach. Compile time techniques to fuse data parallel operations can reduce the number of barrier synchronizations, decrease space requirements, and improve reuse [14, 31, 19]. 11 To flatten the sparse matrix vector product smvp, we replace the nested sequence representation of A with a linearized (flattened) representation #A 0 ; s#.HereA 0 is an array of r pairs, indexed by val and col, partitioned into rows of A by s, i.e. s is an array of n integers equal to ....
S. Chatterjee. Compiling nested data-parallel programs for shared-memory multiprocessors. ACM Trans. Prog. Lang. Syst., 15(3):400--462, July 1993. 4.2
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC