18 citations found. Retrieving documents...
C. J. Beckmann, "Hardware and software for functional and fine grain parallelism," Ph.D. dissertation, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, CSRD Report 1346, 1993.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
The Performance Impact of Granularity Control - And Functional Parallelism   (Correct)

....for this research is a low overhead threads model based on user level scheduling. Keywords: dynamic scheduling, functional parallelism, task granularity, parallel processing, threads. 1 Introduction The magnitude to which runtime overhead affects performance has been widely demonstrated [2, 3, 12]. In order to alleviate this problem [12] and other subsequent studies provided an environment that allows the user to control the number of parallel tasks a given parallel application generates. Given a fixed number of resources, a user or This work was supported by the Office of Naval Research ....

....in the HTG is smaller than a certain size. The minimum size depends on the per task overhead of the system. The details of selecting an appropriate minimum size are beyond the scope of this paper. Static minimum granularity control can be implemented through task merging, a process described in [3]. Another aspect of static granularity control is to help prevent unnecessarily conservative dynamic granularity control decisions. When the overhead for task scheduling is negligible, as compared to the task size, there is little advantage in serializing the task. On the other hand, a dynamic ....

Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.


Analysis of Several Scheduling Algorithms under the.. - Martorell, Jesus (1997)   (2 citations)  (Correct)

....C function is generated for each node in the HTG. Nano thread local variables are allocated on the stack. Output dependencies are embedded in the generated code as calls to the run time library. Input dependencies are represented by a per node counter of pending data dependencies, as in [Beck93]. A C function is generated both for simple and compound nodes. Code generated for simple nodes performs the set of operations contained in the node. Code generated for a compound node consists of a control function (associated to the start node) that sequences the execution of the internal ....

Carl J. Beckmann, "Hardware and Software for Functional and Fine Grain Parallelism", Ph.D. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1993.


Integrating Library Modules into Special Purpose Parallel.. - Rauber, Rünger (1997)   (1 citation)  (Correct)

....use hierarchical task graphs to represent the functional parallelism in programs. The task graph is formed by extracting the loop structure of the program and representing the body of each loop by an acyclic control flow graph. Algorithms for the dynamic scheduling of task graphs are presented in [7, 6]. A similar graph, the hierarchical macro dataflow graph (MDG) is used in the Paradigm compiler to represent data and task parallelism [24, 22] Nodes in the MDG correspond to basic parallel tasks or loop constructs, edges in the MDG correspond to precedence constraints that exist between tasks. ....

C. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, University of Illinois, Urbana--Champaign,IL, 1994.


On The Implementation And Effectiveness Of Autoscheduling For.. - Moreira (1995)   (16 citations)  (Correct)

....capability for representing parallel loops, allow the dynamic exploitation of parallelism. The formal definition of the HTG and construction algorithms can be found in [57, 89, 90] In this section we present the HTG as an abstract program model. The following definitions are based on the work of [89, 90, 91]. The hierarchical task graph is a directed acyclic graph (DAG) H = X; A) where X is a set of nodes and A is a set of arcs. Let X = fx 1 ; x 2 ; x n g. Each node x i 2 X; i = 1; n is a task, a computation to be performed. For that reason, an HTG is an acyclic task graph (ATG) A ....

....enqueue task x j g g The access to PROCESSED has to be protected by locks because multiple processors may try to update it simultaneously. For the same reason, the new value of PROCESSED has to be stored in a temporary. This has been proven to guarantee that each task will be enqueued only once [91]. The entry block of each task must evaluate its firing tag to determine if the body of the task is to be executed. One obvious simplification is that tasks without control dependences do not have to perform this test. The code, for a task x i with control dependences, is if (BRANCHES FIRE x i ....

[Article contains additional citation context not shown here]

Carl J. Beckmann, Hardware and Software for Functional and Fine Grain Parallelism. Ph.D. thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.


Simulation Of Static And Dynamic Task Scheduling On.. - Dimitriou (1994)   (Correct)

....obviously give better results. For example, by moving the scheduling decisions from a central scheduler to the tasks themselves, through compiler injected code or even dedicated hardware, we achieve the scheme called auto scheduling [Polychronopoulos, 1990; Beckmann and Polychronopoulos, 1992; Beckmann, 1993] In our work, we have implemented both schemes. In the centralized scheme, scheduler requests are put in a FIFO queue, to be serviced by a dedicated processor. Scheduler availability is then determined by that processor, and services are all sequential. In the distributed scheme, the scheduler ....

Beckmann, C. J. (1993). Hardware and Software for Functional and Fine Grain Parallelism. Ph.D. dissertation, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign.


On the Implementation and Effectiveness of Autoscheduling - Moreira (1995)   (16 citations)  (Correct)

....by the compiler for each schedulable unit (task) The (embedded) scheduler can therefore be highly optimized for each particular program, making parallel processing affordable at fine granularity levels. Although the abstract model of autoscheduling has been previously developed and analyzed [31, 21, 13, 4, 7], in this paper we address the design issues and implementation details of autoscheduling for shared memory architectures. We implement the model and conduct experiments to measure the the effectiveness of the approach for the first time. We consider specifically as target architecture a ....

....a program (described below) as its execution vehicle. The efficiency in scheduling provided by the drive code makes affordable parallel processing of tasks as small as 100 instructions without the need of special hardware, and can be applied to instruction level parallelism with hardware support [7]. Moreover, it allows a program to control the size of parallel tasks at run time, in order to better suit the dynamic environment conditions. There can be two sources of dynamic conditions: ffl Program: the behavior of a program is in general dependent on its input data, and therefore the best ....

[Article contains additional citation context not shown here]

Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.


Analysis of Several Scheduling Algorithms under the.. - Martorell, al. (1997)   (2 citations)  (Correct)

....A C function is generated for both simple and compound nodes. Local variables are allocated on the stack. Output dependencies are embedded in the generated code as calls to the run time library. Input data dependencies are represented by a per node counter of pending data dependencies, as in [2]. Following the previous scheme, the emitted code is an executable representation of the HTG. All services needed by the generated code such as creation of parallelism, control of dependencies, thread management, etc. can be provided by a run time user level thread package or by direct code ....

C. J. Beckmann, "Hardware and Software for Functional and Fine Grain Parallelism", Ph.D. thesis, Dep. of Elec. and Comp. Eng., Univ. of Illinois at Urbana-Champaign, 1993.


Autoscheduling in a Shared Memory Multiprocessor - Moreira, Polychronopoulos   (Correct)

....facilitating task granularity control. It is the information on control and data dependences that allow the exploitationof functional (task level) parallelism, in addition to data (loop level) parallelism. The definitions, properties, and construction mechanism of the HTG presented here are from [21, 11, 16, 6]. The hierarchical task graph is a directed acyclic graph HTG = HV; HE) with unique nodes START and STOP 2 HV . Each node in HV can be one of the following types: Simple, Compound, Simple loop, and Compound loop, representing, respectively, a task that has no subtasks (a basic block, or a ....

.... the set of nodes that may be enabled by the execution of such branch; we call this set BranCand(k; l) The candidates are all the nodes that are data and control dependent on node k, plus all the nodes that are data dependent on nodes bypassed by branch l, minus those nodes bypassed by branch l [6]. Let TG be a given task graph (a certain level of an HTG) with n nodes. A bit vector DONE [1: n] is used to represent the data dependences. DONE [i] is set to TRUE whenever the data dependences originating from node x i are satisfied. Another bit vector, CONT [1: n] is used to represent the ....

[Article contains additional citation context not shown here]

Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.


The Potential of Exploiting Coarse-Grain Task Parallelism.. - Hordijk, Corporaal   (Correct)

....developed and these exploit only a limited amount of coarse grain task parallelism [5, 6, 7] Most of the research has been focussed on generation of data independent tasks. The performance of these exploitation techniques is still poor, especially for tasks with larger granularity; e.g. Beckmann [8] reports a parallelism of about 1.5 for coarse grain tasks. The explanation of these disappointing results is given by the fact that function calls are only in 18 percent data independent from each other [6] 1 . For basic blocks and individual statements this is 45 and 76 respectively. It must ....

....parallelism and task parallelism, is shown for the basic machine models. The BASE model hardly exploits more parallelism than the available instruction level parallelism. This directly supports conclusion of Beckmann s study which reported the tightly coupled behavior of procedures [8]. The DATA and CTL models naturally extract more task parallelism than the BASE model, and the DATA model performs somewhat better than the CTL model. However, the ALL model has superior speedups. This simply shows that most coarsegrain tasks are both control and data dependent on each other. ....

Carl Josef Beckmann. Hardware and Software for Functional and Fine Grain Parallellism. PhD thesis, University of Illinois at Urbana-Champaign, Centre of Supercompter Research and Development, 1994.


Nano-Threads: Programming Model Specification - Labarta (1998)   (Correct)

....and updated at run time depending on the system conditions. Output data precedences are represented through successor nano threads; several successors can be specified for each nano thread. A per thread counter represents the remaining unresolved input data precedences for each nano thread [Beck93]. The counter is initialized at thread creation. Every time a predecessor terminates, the counter is decremented. When the counter reaches zero, the nano thread is ready for execution. Due to several optimizations done while creating parallelism it may occur that new nano threads can dynamically ....

C.J. Beckmann, "Hardware and Software for Functional and Fine Grain Parallelism", Ph.D. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at Urbana-Champaign, 1993.


The Impact of Data Communication and Control Synchronization.. - Jeroen Hordijk (1996)   (1 citation)  (Correct)

....misprediction. Most of the research into detection and exploitation of task parallelism from imperative languages has been focussed on generation of data independent tasks. The performance of these exploitation techniques is still poor, especially for tasks with larger granularity; e.g. Beckmann [6] reports a parallelism of about 1.5 for coarse grain tasks. The explanation of these disappointing results is given by the fact that function calls are only in 18 percent data independent [4] 1 . For basic blocks and individual statements this is 45 and 76 respectively. It must be concluded ....

Carl Josef Beckmann. Hardware and Software for Functional and Fine Grain Parallellism. PhD thesis, University of Illinois at Urbana-Champaign, Centre of Supercompter Research and Development, 1994.


The Performance Impact of Granularity Control and Functional.. - Jos Moreiray   (Correct)

....for this research is a low overhead threads model based on user level scheduling. Keywords: dynamic scheduling, functional parallelism, task granularity, parallel processing, threads. 1 Introduction The magnitude to which runtime overhead affects performance has been widely demonstrated [2, 3, 12]. In order to alleviate this problem [12] and other subsequent studies provided an environment that allows the user to control the number of parallel tasks a given parallel application generates. Given a fixed number of resources, a user or This work was supported by the Office of Naval ....

....structure, thus facilitating task granularity control. Information on control and data dependences allows the exploitation of functional (task level) parallelism, in addition to data (loop level) parallelism. A brief summary of the properties of the HTG is given here and details can be found in [3, 8, 9, 10, 19]. The hierarchical task graph is a directed acyclic graph HTG = HV ; HE) with unique nodes START and STOP 2 HV , the set of vertices. Its edges, HE , are a union of control (HC ) and data dependence (HD) arcs: HE = HC [ HD . The nodes represent program tasks and can be of three types: simple, ....

[Article contains additional citation context not shown here]

Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.


Efficient Scheduling Of Parallel Tasks In A Multiprogramming.. - Schouten (1995)   (1 citation)  (Correct)

....complete a particular task. The worst imbalance is a function of the largest integer, so reducing the size of tasks promotes load balancing, albeit at the cost of increased overhead. 24 3. 3 Issues In The Implementation Of An Autoscheduling Library Current implementations of autoscheduling[Mor95, Bec93] are done at the machine level. This requires rewriting a compiler to generate machine specific autoscheduling code. A library implementation of autoscheduling would allow a user, or a machine independent source to source code restructurer to write code to take advantage of autoscheduling. ....

....to one of the nodes above. void t 1 ( void t 2 ( void t N ( g Figure 3.5: Class declaration for ATG. Every task function, h p , corresponding to a node p contains an exit block that signals its completion to the nodes that are (control or data) dependent on it. The counter algorithm[Bec93] is used to implement autoscheduling, and works as follows. For every edge e p q , task function h p contains in its exit block an instruction that decrements the dependence counter for h q . If e p q is a data dependence, the counter is decremented by one. If e p q is a control dependence, then ....

[Article contains additional citation context not shown here]

Carl Josef Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, UIUC, Oct 1993.


n-RTL Implementation - April Universitat (1998)   (Correct)

....HTG nodes at the same level of the hierarchy are created in reverse order to correctly setup the successors for each (predecessor) node. Several successors can be specified for each nano thread. A per thread counter represents the remaining unresolved input data dependencies for each nano thread [Beck93]. The counter is initialized at thread creation. Every time a predecessor terminates, the counter is decremented. When the counter reaches zero, the nano thread is ready for execution. Due to several optimizations done while creating parallelism it may occur that new nano threads can dynamically ....

Carl J. Beckmann, "Hardware and Software for Functional and Fine Grain Parallelism", Ph.D. thesis, Department of Electrical and Computer Engineering, Univ. of Illinois at UrbanaChampaign, 1993.


Autoscheduling in a Distributed Shared-Memory Environment - Jos'e Moreira (1994)   (8 citations)  (Correct)

....structure, thus facilitating task granularity control. Information on control and data dependences allows the exploitation of functional (task level) parallelism in addition to data (loop level) parallelism. A brief summary of the properties of the HTG is given here, and details can be found in [3, 7, 16, 22]. The hierarchical task graph is a directed acyclic graph HTG = HV ; HE ) with unique nodes START and STOP 2 HV , the set of vertices. Its edges, HE , are a union of control (HC ) and data dependence (HD) arcs: HE = HC [ HD . The nodes represent tasks of a program and can be of three types: ....

Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1993.


Explicit Dynamic Scheduling: A Practical Micro-Dataflow.. - Beckmann.. (1993)   Self-citation (Beckmann)   (Correct)

....subject only to dependences detected at run time. Unlike these schemes, the EDS hardware is not burdened with detecting dependences. Each loop s dependence graph is computed (and optimized) by the compiler and passed explicitly to the hardware via opcode extensions and scheduling vector tables [6, 7]. This both simplifies the hardware, and avoids potentially introducing false dependences due to hardware resource constraints. This follows proven RISC design principles: keep the hardware simple and fast; and let the compiler do as much of the work as possible. EDS also has performance ....

....dependences between nanotasks within the loop body. A nanotask is a chain of one or more sequentially dependent instructions that execute under the same control dependence conditions. Nanotasks are detected by the compiler as the maximal grains that do not limit the exploitable parallelism [7]. Instruction issuing within a nanotask is sequential, but the initiation of nanotasks is subject to 1 Execution outside of innermost loops is similar to a conventional machine. 6 the loop s dependence graph and is carried out by the TGSU. Unlike a thread or activation , a nanotask has no ....

[Article contains additional citation context not shown here]

Carl Josef Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, University of Illinois at Urbana-Champaign, Center for Supercomputing Research and Development, September 1993.


Memory Latency Rediction via Data Prefetching and Data Forwarding .. - Poulsen (1994)   (Correct)

No context found.

C. J. Beckmann, "Hardware and software for functional and fine grain parallelism," Ph.D. dissertation, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, CSRD Report 1346, 1993.


Permission to Make Digital Or Hard Copies of All Or Part.. - Personal Or Classroom   (Correct)

No context found.

Carl J. Beckmann. Hardware and Software for Functional and Fine Grain Parallelism. PhD thesis, University of Illinois at Urbana-Champaign, April 1994.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC