| J. E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, Ph.D. thesis, Center for Supercomputing Research and Development, University of Illinois, Urbana, IL, Jan. 1995. |
....by hand. They are based on asynchronous parallel functions that let the caller thread to continue while the function is executed by another thread of control. Programmers have to learn new syntactic constructs and new semantics to use these language extensions. On the other hand, Poly93][More95] propose to automatically parallelize applications written in standard languages (e.g. C and FORTRAN) They are based on the nano threads programming model, as defined by [Poly89a] Nano threaded applications are able to exploit both loop and functional parallelism. Using a parallelizing compiler ....
....to record all the information available about control and data dependencies. The HTG makes possible to equally handle parallel loop iterations and parallel functions. Although the program decomposition allows the application to adapt to a variable number of processors, the first implementation by [More95] does not consider this fact. We have designed and implemented a prototype of the complete nano threads execution environment based on a user level library (the nano threads library) and a user level CPU manager that controls the allocation of processors to the running applications ....
[Article contains additional citation context not shown here]
Jos E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for Shared-Memory Multiprocessors", Ph.D. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....schedulers. In this work we used NthLib , a nano Threads Library [Mart96] built on top of QuickThreads for the implementation and performance evaluation of the proposed scheduling policy. Most well known scheduling policies address simple loop parallelism, and although there has been evidence [More95], that functional parallelism can be present in substantial amounts in certain applications, little has been done to address the simultaneous exploitation of functional or irregular and loop parallelism; both in terms of compiler support and scheduling policies. Although classical DAG scheduling ....
....heuristic algorithms for loop parallelism. Execution time is used as the metric for comparison between different scheduling policies. In our experiments we used an application suite consisting of five application codes that have been used previously by other researchers for similar comparisons [More95]. These are two real life , complete applications, namely Computational Fluid Dynamics (CFD) and Molecular Dynamics (MDJ) a Complex Matrix Multiply (CMM) and two kernels that are found in several large applications, namely LU Decomposition (LU) and Adjoin Convolution (AC) We have incorporated ....
[Article contains additional citation context not shown here]
J.E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors", PhD. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....schedulers. In this work we used NthLib , a nano Threads Library [Mart96] built on top of QuickThreads for the implementation and performance evaluation of the proposed scheduling policy. Most well known scheduling policies address simple loop parallelism, and although there has been evidence [More95], that functional parallelism can be present in substantial amounts in certain applications, little has been done to address the simultaneous exploitation of functional or irregular and loop parallelism; both in terms of compiler support and scheduling policies. Although classical DAG scheduling ....
....for loop parallelism. Execution time is used as the metric for comparison between different scheduling policies. 4. 1 Framework In our experiments we used an application suite consisting of five application codes that have been used previously by other researchers for similar comparisons [More95]. These are two real life , complete applications, namely Computational Fluid Dynamics (CFD) and Molecular Dynamics (MDJ) a Complex Matrix Multiply (CMM) and two kernels that are found in several large applications, namely LU Decomposition (LU) and Adjoint Convolution (AC) We have incorporated ....
J.E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors", PhD. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....application through data and control dependence analysis and generates an intermediate representation of the parallel application in the form of a hierarchical task graph This work has been supported by the Ministry of Education of Spain (CICYT) under contracts TIC95 0492 and TIC94 0439. HTG) 2][6]. We plan to use the Parafrase 2 compiler [9] to generate executable code from the HTG intermediate representation. 2. Objectives Objectives of this paper are to study the viability of the nano threads parallel programming model demonstrating that It is possible to build an efficient ....
Moreira, J. E.: On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors. PhD. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....applications by hand. They are based on asynchronous parallel functions that let the caller thread to continue while the function is executed by another thread of control. Programmers have to learn new syntactic constructs and new semantics to use these language extensions. On the other hand, 16][13] propose to automatically parallelize applications written in standard languages (e.g. C and FORTRAN) They are based on the nano threads programming model, as defined in [14] Nano threaded applications are able to exploit both loop and functional parallelism. Using a parallelizing compiler such ....
....userlevel control flow. The application maintains a user level queue of ready to run nano threads. The execution of an HTG begins with the main node being prepared for execution and inserted in the ready queue as a nano thread. Virtual processors assigned to the application use auto scheduling [13] to execute the nano threads selected from the ready queue. 2.4. Application adaptability to the available resources The overall execution of a nano threaded application is able to adapt to changes in the number of processors assigned to it. The adaptation is dynamic, at run time, and includes ....
J. E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for Shared-Memory Multiprocessors", Ph.D. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
.... x 1 x 1 Fig. 2. Hierarchy of ready queues and enqueing of complex matrix multiplication tasks. 4 Evaluation Framework We evaluate our technique with a number of application and synthetic benchmarks. The application benchmarks presented here were used in a previous study of autoscheduling [9]. This gives us the opportunity for a quantitative analysis of our results in direct comparison with existing work in the field. The purpose of the synthetic benchmarks is to demonstrate the performance implications of autoscheduling and the ability of our method to overcome them. The application ....
....of size 256 Theta256. The molecular dynamics kernel consists of 14 parallel functions, each one executing a parallel loop with 1000 iterations. More details on the structure of these benchmarks, as well as their performance under autoscheduling on a bus based multiprocessor can be found in [9]. Our first synthetic benchmark consists of 8 parallel dot products. Each dot product is computed with two vectors of 64 Kilobytes of double precision elements. A detailed description of the benchmark can be found in [11] Two levels of functional and data parallelism are exploitable, similarly to ....
[Article contains additional citation context not shown here]
J. Moreira, On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, PhD Thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, 1995.
....and benefits of our approach. The parallelizing compiler identifies the maximum parallelism contained in the application through data and control dependence analysis and generates an intermediate representation of the parallel application taking the form of a Hierarchical Task Graph (HTG) [More95, Poly93a]. From the HTG the compiler generates code using the services offered by the NthLib library. At this point, the compiler statically determines the finest granularity of parallel tasks worth to be exploited having in mind the efficiency of the user level package implementation. At run time, the ....
J.E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for Shared-Memory Multiprocessors", PhD. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....called the Hierarchical Task Graph (HTG) Girk92] The HTG representation has the ability to capture different levels of both structured (loop level) and unstructured (task level) parallelism. The runtime system uses the HTG representation to apply an autoscheduling execution mechanism [More95]. This mechanism gener ates drive code, which creates and schedules nano threads that execute the parallel tasks represented by the HTG. The runtime system operates in close coordination with the operating system, in order to dynamically control the granularity of the generated nano threads. ....
....This means that nano threads use standard mechanisms to access local and global data. These mechanisms can be highly optimized by existing C compilers. Moreover, they avoid the overhead of explicit memory allocation of activation frames from the heap and the maintenance of a cactus stack [More95]. Therefore, they simplify memory management at user level. NthLib allocates memory from the heap only for nano thread stacks. Stacks are recycled in a memory pool, in order to avoid the overhead of invoking the operating system memory allocator each time a new nano thread is created. Context ....
J. Moreira, On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, PhD Thesis, University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, 1995.
....program and its parallel work. chunk scheduling (BCS) The performance of balanced chunk scheduling will be compared with that of chunk scheduling and self scheduling. The reported performance is the result of compiling and running our examples on a system developed at the University of Illinois [Moreira 1995]. It is often the characteristics of loop iterations that make one scheduling scheme perform better than the other. In the case of coarse grained iterations with variable execution times, self scheduling performs much better than chunk scheduling, while fine grained iterations with constant ....
Moreira, J. E. 1995. On the implementation and effectiveness of autoscheduling for sharedmemory multiprocessors. Ph.D. thesis, Univ. of Illinois at Urbana-Champaign, Urbana, Ill.
....of modern programming languages, and lends itself to optimizations from the target compiler. Thread nesting and communication of variables from a parent thread to a child can be done by parameter passing to the stack of the child thread. The overall organization resembles a cactus stack [12]. Relying solely on per thread stacks and forcing a full thread switch on every invocation of a parallel task increases the overhead up to a point which may be unacceptable for fine grain programs. The rest of this subsection is devoted to the optimizations applied in the nanothreads runtime ....
....among processors. The implementation of static and dynamic work descriptors is described in Section 4.2. 4. 2 Synchronization, Run Queues, Memory Pools and Work Descriptors The nanothreads runtime system employs a dependence driven execution model, derived from parallelizing compiler technology [12]. Conceptually, a parallel task is activated for execution as soon as other tasks on which the task is data and or control dependent have completed their execution. Practically, a thread is created as soon as the control flow of the program assures that the thread will actually be executed. The ....
J. Moreira, On the Implementation and Effectiveness of Autoscheduling, PhD Dissertation, University of Illinois at Urbana-Champaign, 1995.
....to complete a particular task. The worst imbalance is a function of the largest integer, so reducing the size of tasks promotes load balancing, albeit at the cost of increased overhead. 24 3. 3 Issues In The Implementation Of An Autoscheduling Library Current implementations of autoscheduling[Mor95, Bec93] are done at the machine level. This requires rewriting a compiler to generate machine specific autoscheduling code. A library implementation of autoscheduling would allow a user, or a machine independent source to source code restructurer to write code to take advantage of autoscheduling. ....
Jos'e Moreira. On the Implementation and Effectiveness of Autoscheduling for Shared-Memory Multiprocessors. PhD thesis, University of Illinois, 1995.
....parallelism at run time, while software pipelining creates instruction level parallelism at compile time. Unfortunately, neither of these concepts deals with the issue of providing parallelism at a higher level, so that a multiprocessor machine can execute parallel code. Autoscheduling [18] is a technique that produces parallelism that can be exploited across many processors. Autoscheduling embeds drive code into the program to allow task scheduling to be performed at run time. The hierarchical task graph (HTG) which contains the control and data dependencies present at different ....
....tasks can reduce, or even completely eliminate, the benefit of running tasks in parallel. In this chapter, the relationship between the overhead for executing parallel tasks and the speedup of the code as compared to the serial case will be analyzed. An execution driven simulator, as described in [18], will be used to show this relation on different benchmark codes. 4.2 Overview of Simulator Starting from the HTG representation, the autoscheduling compiler creates autoscheduling C code for a given program. This C code is linked to instrumented versions of run time libraries and data ....
[Article contains additional citation context not shown here]
J. Moreira, On the implementation and effectiveness of autoscheduling for sharedmemory multiprocessors, Ph.D. dissertation, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1995.
.... [Poly89b] extended to support Open MP like directives [Aygu97] identifies the maximum parallelism contained in the application through data and control dependence analysis and generates an intermediate representation of the parallel application taking the form of a Hierarchical Task Graph (HTG) [More95][Poly93b] From the HTG the compiler generates the executable code. The HTG is composed of simple and compound nodes. Simple nodes contain sets of operations that need to be executed sequentially or do not represent enough work to be executed in parallel. Compound nodes contain complex ....
José E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors", Ph.D. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....(Section 4) HTG visualization tool (Section 5) and performance prediction and visualization tools (Section 6) Finally, Section 7 summarizes the project workplan for the next period. 2 Parafrase 2 Initial Status This section summarizes the Parafrase 2 system main features. Parafrase 2 [More95, Pol85, Pol89] is a source to source multilanguage restructuring compiler. It provides a reliable portable and efficient research tool for experimenting with program transformations and other compiler techniques for parallel shared memory parallel computers. It uses an aggressive approach for dependence ....
J.E. Moreira, On the Implementation and Effectiveness of Autoscheduling for Shared-Memory Multiprocessors, PhD. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1995.
....be done much faster for user level threads than for kernel level threads. There are also implementations with a single level of scheduling. Examples of systems that use the fork join model to support reconfiguration are Cray Multitasking [11] Process Control [12] and Minos [5] Autoscheduling [13, 14] has shown how an efficient fork join model can support macro dataflow execution on time variant processor partitions. The work on fork join models mentioned above is all in the context of shared memory multiprocessors, which eliminates the need for dynamically changing the binding between data ....
Moreira, J. E. On the Implementation and Effectiveness of Autoscheduling for Shared-Memory Multiprocessors. PhD thesis, University of Illinois at Urbana-Champaign, 1995.
No context found.
J. E. Moreira, "On the Implementation and Effectiveness of Autoscheduling for SharedMemory Multiprocessors, Ph.D. thesis, Center for Supercomputing Research and Development, University of Illinois, Urbana, IL, Jan. 1995.
No context found.
J. E. Moreira and C. D. Polychronopoulos, "On the Implementation and Effectiveness of Autoscheduling," Tech. Rep. CSRD-1372, Center for Supercomputing Research and Development, University of Illinois, Urbana, IL, May 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC