48 citations found. Retrieving documents...
S. F. Hummel, E. Schonberg, and L. E. Flynn, `Factoring: a method for scheduling parallel loops'. Comm. ACM 35(8), 90-101 (1992).

 Home/Search   Document Not in Database   Summary   ACM   TOC   Related Articles   Check  

This paper is cited in the following contexts:
ParC - An Extension of C for Shared Memory Parallel.. - Ben-Asher, Feitelson..   (Correct)

....to generate more ecient code. Note that this construct does not add functionality, it is useful only for optimizations. The implementation can choose to balance the load between the P activities statically, by allocating n=P iterates to each, or dynamically, by using chunked self scheduling [11, 12, 13]. This is a further optimization that does not change the semantics. It was decided to add a special construct to the language instead of just adding a hint to the compiler, as found in many commercial parallel languages, because of the semantic implications. lparfor explicitly implies that the ....

S. F. Hummel, E. Schonberg, and L. E. Flynn, `Factoring: a method for scheduling parallel loops'. Comm. ACM 35(8), 90-101 (1992).


Adaptive Scheduling of Master/Worker Applications on Distributed.. - Shao (2001)   (10 citations)  (Correct)

.... from both parameter prediction errors when using a fixed allocation strategy and the overhead costs when using a self scheduling allocation strategy, other work allocation techniques have been proposed in the literature to draw on the advantages of the two techniques in various combinations [58][45][39] A common theme in all proposed techniques is to reduce the number of transfers in the basic self scheduling approach by allocating work units in groups, while 64 increasing application tolerance to real time variances over the fixed approach by allocating the groups during execution. The ....

....than GSS, but then uses a linear decrease in subsequent allocation sizes. Because of this behavior, TSS can be expected to require more total allocation steps to be made compared with GSS in allocating the same number of work units. Factoring (FAC2) is a strategy introduced by Hummel and Flynn [45] which allocates work units in groups that are organized into rounds. Each round consists of P allocation sequences, where P is the total number of processors. Half of all the remaining work units are allocated to active worker processes in each round, resulting in unallocated work units ....

Hummel, S. F., Schonberg, E., and Flynn, L. E. Factoring: A method for scheduling parallel loops. Communications of the ACM 35, 8 (Aug. 1992), 90--101.


Affinity Scheduling of Unbalanced Workloads - Srikant Subramaniam And (1994)   (14 citations)  (Correct)

....present in each region of the loop. In general, choosing an appropriate value of k is quite difficult, and an optimal choice would require detailed information about 2 the loop and the machine environment that may be difficult to obtain in practice. In variable self scheduling algorithms [4] [6] [11] 16] processors initially obtain large chunks but take increasingly smaller chunks as loop iterations become depleted. By taking many iterations in each initial chunk, these methods offer the possibility of low overhead since relatively few scheduling operations are required. By taking ....

S. F. Hummel, E. Schonberg, L. E. Flynn, "Factoring: A Method for Scheduling Parallel Loops", Communications of the ACM, Vol. 35, No. 8 (August 1992), pp. 90-101.


Metacomputing with MILAN - Baratloo, Dasgupta, Karamcheti, Kedem (1999)   (6 citations)  (Correct)

....execute the next task (of a bunch) while the results of the previous task are being sent back on the network. Finally, bunching allows the programmer to write fine grained parallel programs that are automatically and transparently executed in a coarse grained manner. We have implemented factoring [19], an algorithm that computes the bunch size based on the number of remaining tasks and the number of currently available machines. 2.4 Preemptive Scheduling Eager scheduling provides load balancing and fault isolation in a dynamic environment. However, our description so far has considered only ....

S. F. Hummel, E. Edith Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. Communications of the ACM, 35(8):90--101, Aug. 1992.


Charlotte: Metacomputing on the Web - Baratloo, Karaul, Kedem, Wyckoff (1996)   (83 citations)  (Correct)

....fault masking and load balancing. We employ dynamic granularity management (or bunching for short) to mask network latencies associated with the process of assigning tasks to machines. Bunching extends self scheduling by assigning a set of tasks (a bunch) at once. We have implemented Factoring [20] which computes the bunch size based on the number of remaining tasks and the number of currently available machines, which was shown effective for executing parallel programs on networks of workstations in Calypso [4] Bunching has three benefits. First, it reduces the number of task assignments, ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring a method for scheduling parallel loops. Communications of the ACM, 1992.


Charlotte: Metacomputing on the Web - Baratloo Karaul Kedem (1996)   (83 citations)  (Correct)

....fault masking and load balancing. We employ dynamic granularity management (or bunching for short) to mask network latencies associated with the process of assigning tasks to machines. Bunching extends self scheduling by assigning a set of tasks (a bunch) at once. We have implemented Factoring [16] which computes the bunch size based on the number of remaining tasks and the number of currently available machines, which was shown effective for executing parallel programs on networks of workstations in Calypso [3] Bunching has three 8 benefits. First, it reduces the number of task ....

S. F. Hummel, E. Schonberg, and L. E. Flynn, Factoring a method for scheduling parallel loops, Communications of the ACM, 1992.


Declustering and Load-Balancing Methods for.. - Shekhar, Ravada..   (4 citations)  (Correct)

....be enough to achieve good load balance. In such a case, both static partitioning and DLB techniques can be used. Wang [32] used dynamic allocation of work at different levels (e.g, polygons, edges) for map overlay computation. In addition, several dynamic load balancing methods have been developed [12, 20, 23, 25] for load balancing in different applications. Data Partitioning for map overlay [32] spatial join, and access methods [18, 19] is not related to the work presented in this paper. Declustering and dynamic load balancing for extended spatial data have not received adequate attention in the ....

....transferred between a donor processor and an idle processor. This granularity may depend on the size of the remaining work, the number of processors, the cost of the work transfer, and the accuracy in estimating the remaining work. Several strategies like self scheduling [12] factoring scheduling [20], and chunk scheduling [23] exist for determining the amount of work to be transferred. Also, the simplest case of transferring one piece of work at a time is also considered in some cases. If communication cost is negligible or very small when compared to the average cost of solving the ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring - a method for scheduling parallel loops. Communications of the ACM, pages 35--90, August 1992.


Parallelizing Spatial Databases on Shared-Memory.. - Shekhar, Ravada..   (Correct)

....Process S 1 DLB Small Pool Large Pool IDLE (a) b) Figure 4: A Small pool may result in high a static load imbalance. A Large pool may result in processor idling. Granularity of Transfers and Data Partitioning Method Several strategies like self scheduling [4] factoring scheduling [6], and chunk scheduling [7] exist for determining the amount of work to be transferred during DLB. The first two scheduling strategies are mostly used in pool based DLB methods while chunk scheduling is applicable in both peer based and pool based DLB methods. The cost of synchronization effects ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring - a method for scheduling parallel loops. Communications of the ACM, pages 35--90, August 1992.


Load-Balancing in High Performance GIS.. - Shekhar, Ravada.. (1995)   (Correct)

....determines how much work is transferred between a donor processor and an idle processor. This granularity may depend on the size of the remaining work, the number of processors, and the accuracy in estimating the remaining work. Several strategies like self scheduling [11] factoring scheduling [15], and chunk scheduling [18] exist for determining the amount of work to be transferred. In case of a work transfer, the number of messages and the amount of information exchanged between the processors determines the communication overhead for that work transfer. Since the data needed to ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring - a method for scheduling parallel loops. Communications of the ACM, pages 35--90, August 1992.


Interprocedural Analysis to Support Static Scheduling of.. - Nguyen, Li   (Correct)

....iterations to processors at run time. Moreover, dynamic scheduling is known to incur a heavier scheduling overhead due to its need to access and to modify a global iteration queue. Dynamic scheduling, on the other hand, can potentially achieve a more balanced workload among the processors [KW85, HSF92, PK87, FYTZ87, TN91, ML92, LTSS93] In applications where the workload is irregularly distributed among the parallel loop iterations, it is quite possible that a program s total execution time can be shorter under dynamic scheduling. However, many parallel loops in well known benchmarking ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. CACM, 35(8):90--101, August 1992.


Dynamic Scheduling Techniques for Heterogeneous Computing.. - Hamidzadeh, Lilja, Atif (1995)   (7 citations)  (Correct)

....iterations to the first processor, N N p ) p iterations to the next processor,and so on. As a result, at each scheduling step, the requesting processor is allocated approximately 1 p of the remaining iterations. Several other variations of this declining chunk size strategy have proposed[27,28], including one that combines an initial static allocation with dynamic load rebalancing to improve the memory referencing locality of the processors[29] While these loop scheduling heuristics have been developed for shared memory multiprocessors, there have been related heuristics proposed for ....

S. F. Hummel, E. Schonberg, and L. E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Communications of the ACM,vol. 35, no. 8, pp. 90-101, August 1992.


Declustering and Load-Balancing Methods for.. - Shekhar, Ravada..   (4 citations)  (Correct)

....might not be enough to achieve a good load balance. In such a case, both static partitioning and DLB techniques can be used. Wang [32] used the dynamic allocation of work at different levels (e.g, polygons, edges) for mapoverlay computation. In addition, several dynamic methods have been developed [12, 20, 23, 25] for load balancing in different applications. Data Partitioning for map overlay [32] spatial join, and access methods [18, 19] is not related to the work presented in this paper. Declustering and dynamic load balancing for extended spatial data have not received adequate attention in the ....

....transferred between a donor processor and an idle processor. This granularity may depend on the size of the remaining work, the number of processors, the cost of the work transfer, and the accuracy in estimating the remaining work. Several strategies like self scheduling [12] factoring scheduling [20], and chunk scheduling [23] exist for determining the amount of work to be transferred. Also, the simplest case of transferring one chunk of work at a time is also considered in some cases. In our discussion, we assume that the chunks contain the object IDs and not the actual objects themselves. ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring - a method for scheduling parallel loops. Communications of the ACM, pages 35--90, August 1992.


A Comparative Study of Load Sharing on Networks of.. - Anatol Piotrowski And (1997)   (Correct)

....plays an important role in improving performance of parallel applications in NOW based systems. Load sharing and parallel task scheduling in distributedmemory parallel systems have received considerable attention [1, 3] In multiprocessors, parallel loop scheduling has been studied extensively [6, 10, 12]. However, we are not aware of any studies in the context of NOW based systems. The purpose of this study is to evaluate performance of various load balancing algorithms in a NOW based parallel system environment. We study performance of five load sharing algorithms that use fixed, variable, and ....

....the data in Figure 2 suggests that the variable task granularity algorithms tend to create more tasks than the fixed granularity algorithm, we show in Section 4 that the opposite is true for the parameters used in our experiments. Several strategies have been proposed to decrease the task sizes [6, 10, 12]. The following three strategies were implemented: guided self scheduling [10] factoring [6] and trapezoidal self scheduling [12] 3.2.1 Guided Self Scheduling In guided self scheduling (GSS) task size is a function of the remaining columns. Typically, task size is set to 1=G of the remaining ....

[Article contains additional citation context not shown here]

S. F. Hummel, E. Schonberg, and L. E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Comm. ACM, Vol. 35, No. 8, August 1992, pp. 90--101.


Performance of a Parallel Application on a Network of.. - Anatol Piotrowski And (1997)   (Correct)

....no load sharing. We will discuss the effect task granularity in Section 4.5. We can divide load sharing algorithms into two groups depending on whether the algorithm uses fixed or variable granularity. We have implemented three variable granularity algorithms: guided self scheduling [9] factoring [5], and trapezoidal self scheduling [11] The focus of this paper is on fixed granularity algorithms. Performance of variable granularity algorithms is reported in [8] We now describe three fixed granularity algorithms that are based on the pool of tasks paradigm. Broadcast In this algorithm, the ....

S. F. Hummel, E. Schonberg, and L. E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Comm. ACM, Vol. 35, No. 8, August 1992, pp. 90--101.


Scheduling of Wavefront Parallelism on Scalable.. - Manjikian, Abdelrahman (1996)   (14 citations)  (Correct)

....exists a large body of work dealing with the scheduling of parallel DOALL loops. Many scheduling strategies have been proposed to strike a balance between load balance and scheduling overhead. Examples include static scheduling [1] self scheduling [1] guided self scheduling [11] and factoring [6] to name a few. However, these strategies are not applicable when loops carry dependences, such as in wavefront parallelism. Markatos and LeBlanc [10] propose Affinity based Scheduling (AFS) and Li et al. 7] propose Locality based Dynamic Scheduling (LDS) to schedule parallel DOALL loop ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. Comm. of the ACM, 35(8):90--101, August 1992.


Job Scheduling in Multiprogrammed Parallel Systems - Feitelson (1997)   (16 citations)  (Correct)

....cannot be used as effectively [487, 388, 389, 106] This includes use of the local cache on each PE. It might seem possible to improve the performance of a global queue by removing a number of threads at once each time the queue is accessed, thus amortizing the cost of contention and overhead [468, 426, 281]. However, this is much more relevant to second level scheduling by the application, after PEs have been allocated, than to scheduling from a global queue by the operating system (see Section 4.4) The reason is that an application knows that all the threads belong to the same application, and ....

.... that the distribution of task completion times was quite wide, due to memory contention effects [170] 58 static 250 guided 250 taper 190 factoring 125 trapezoid 125 Figure 19: The chunk sizes created by different self scheduling schemes, for 1000 chores and P = 4 (data partly from [281]) the time to compute each chore may differ from the others, the first large chunks may take more time than expected, leading to load imbalance. Taper tries to solve this problem by using runtime statistics of the mean and standard deviation of chore execution times to set the chunk size [379] ....

[Article contains additional citation context not shown here]

S. F. Hummel, E. Schonberg, and L. E. Flynn, "Factoring: a method for scheduling parallel loops". Comm. ACM 35(8), pp. 90--101, Aug 1992.


Job Scheduling in Multiprogrammed Parallel Systems - Feitelson (1997)   (16 citations)  (Correct)

....allow a PE to be idle if there is work waiting at another PE [101, 205] The term load balancing is only meaningful when PEs have individual loads, as when local queues are used. 34 number of threads at once each time the queue is accessed, thus amortizing the cost of contention and overhead [284, 263, 173]. However, this is much more relevant to second level scheduling by the application, after PEs have been allocated, than to scheduling from a global queue by the operating system. The reason is that an application knows that all the threads belong to the same application, and knows the number of ....

S. F. Hummel, E. Schonberg, and L. E. Flynn, "Factoring: a method for scheduling parallel loops". Comm. ACM 35(8), pp. 90--101, Aug 1992.


Metacomputing with MILAN - Baratloo, Dasgupta, Karamcheti, Kedem (1999)   (6 citations)  (Correct)

....execute the next task (of a bunch) while the results of the previous task are being sent back on the network. Finally, bunching allows the programmer to write fine grained parallel programs that are automatically and transparently executed in a coarse grained manner. We have implemented factoring [19], an algorithm that computes the bunch size based on the number of remaining tasks and the number of currently available machines. Eager scheduling provides load balancing and fault isolation in a dynamic environment. However, our description so far has considered only non preemptive tasks which ....

S. F. Hummel, E. Edith Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. Communications of the ACM, 35(8):90--101, Aug. 1992.


Affinity Scheduling of Unbalanced Workloads - Saskatoon (1993)   (Correct)

....2.2.2 Loop Scheduling Loops are a rich source of parallelism in scientific code. Parallelizing compilers for sequential programs have been particularly successful in determining when loop iterations can be executed in parallel. Thus, loop scheduling has received considerable 23 attention [22] [30] [34] 39] 43] 49] 56] 62] The fundamental trade off in scheduling loop iterations on multiple processors is that of maintaining balanced processor workloads, without excessive scheduling overhead. Loop iterations can be scheduled either statically (at compile time) or dynamically (at ....

....iteration space, rather than blocks of consecutive iterations. In this manner, the potential for load imbalance that arises when iteration execution times vary widely, and the execution times of consecutive iterations are correlated, is significantly reduced. Factoring, proposed by Flynn et al. [30], was specifically designed to handle iterations with execution time variance. In factoring, iterations are scheduled in batches each containing P (the number of processors) equal size chunks. The total number of iterations per batch is a fixed ratio of those remaining and hence the name ....

S. F. Hummel, E. Schonberg, L. E. Flynn, "Factoring: A Method for Scheduling Parallel Loops", Communications of the ACM, Vol. 35, No. 8 (August 1992), pp. 90-101.


Scheduling Schemes for Data Farming - Fleury, Downton, Clark   (Correct)

....1 processors is that time at the end of the final round while the last processor finishes. Suppose k i = n py i ; 16) with fixed y i = 2. With k i so set the task size decreases exponentially and half the work is statically allocated in the first round, which is the Factoring regime [ 34 ] Assume that the finishing times at the first scheduling round are distributed as E[X i:p ] i = 1; 2; p, with the r.v. X being the execution times. An approximate solution to setting y i , thus restricting the idle time in the general case, is shown in Appendix B. 4.3 Heuristic ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. Communications of the ACM, 35(8):90--101, 1992.


The Importance of Locality in Scheduling and Load Balancing for.. - Keckler (1994)   (1 citation)  (Correct)

....K is selected to limit the number of total accesses to the global iteration variable and cause all processors to finish at approximately the same time. Polychronopoulos and Kuck s Guided Self Scheduling (GSS) 28] allows K to decrease, but assume that the task length is fixed. Hummel, et al. [11] propose a refinement that provides better load balancing by dividing the iterations up into phases. Each chunk in a given phase contains the same number of tasks, but the chunk size in successive phases decreases. Tzen and Ni propose Trapezoid Self Scheduling [34] in which chunk size decreases ....

....have been proposed to reduce load imbalance when tasks have variable execution lengths. Lucco [21] computes the best task chunk size based on a measure of the task length variance. As discussed in more detail in the next section, when variance is large, chunk size is reduced. Hummel, et. al [11] divide the iterations into batches of p equal sized chunks. This guarantees that all processors initially get the same number of iterations. Like GSS, the chunk size in each batch is reduced during execution of the loop. Both of these techniques improve load balance by decreasing the chunk size, ....

Hummel, S. F., Schoenberg, E., and Flynn, L. E. Factoring: A method for scheduling parallel loops. Communications of the ACM 35, 8 (August 1992), 90--101.


Scheduling Policies to Support Distributed 3D Multimedia.. - Thu Nguyen (1998)   (1 citation)  (Correct)

....Additionally, we do not make any assumption about the maximum service time of individual tasks (other than that it is shorter than the frame time) and do not use task service time information in any of our policies. Our work is related to efforts in loop scheduling for parallel processors (e.g. [22, 7, 26, 14, 16, 20, 29]) in that the basic problem a loop scheduling discipline must solve is how to balance the performance loss due to processors going idle when there is work left to be done against the overhead of finding that work. Our environment differs from loop scheduling, however, as our overheads are ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Communications of the ACM, 35(8):90--101, Aug. 1992.


Structured Performability Analysis Of Fault Tolerant Parallel.. - Dougherty (1998)   (Correct)

....2.2.1 Continuous Time Markov Chains (CTMCs) Failure and repair rates are assumed homogeneous, meaning that each component fails at rate l and is repaired at rate . Component failure rate is obtained from the Mean Time To Failure (MTTF) assuming the system lifetime is exponentially distributed [50] l = 1 MTTF The component repair rate is derived from the Mean Time To Repair (MTTR) 1 MTTR 23 It is also assumed that an unlimited repair facility exists, implying that i = i , where i counts the number of components which have failed of a p component system (0 i p) 0 1 l Figure 6: ....

....time found among the subcomponents. SIMD implies that all subcomponents are running the same program. Execution times should be consistent among these subcomponents. Also, load balancing techniques reduce synchronization penalties, further supporting the expectation of consistent execution times [27, 50]. There is no IPC among SIMD subcomponents. The IPC for the SIMD component is stated in section 3.3.1 for parallel type constructs. These observations lead to the following mean function for expected SIMD execution time. TSIMD (N, p) 1 p i#=#1 #p #T i (N,#p) The peak processing ....

[Article contains additional citation context not shown here]

Hummel, S.F., Schonberg, E., and Flynn, L.E. "Factoring: A method for scheduling parallel loops." Communications of the ACM, Vol. 35, No. 8, August 1992, pp. 90 - 101.


Scheduling Non-Uniform Parallel Loops on Highly Parallel.. - Orlando, Perego   (Correct)

....fields such as sparse matrix computations, image processing, Montecarlo calculations. Although the problem of finding an optimal schedule for non uniform parallel loops is NP hard [7] many effective dynamic scheduling heuristics for sharedmemory multiprocessors have been proposed and experimented [10, 11, 8, 5, 2]. A less frequently studied problem is loop scheduling for distributed memory architectures [7, 2] which involves strong relationships with other factors such as data partitioning and locality of references. In this paper we consider distributed memory machines that only supply mechanisms for ....

....low scalability. The effective dynamic techniques that have been developed for shared memory machines cannot be, in fact, simply adapted to distributed memory environments. The single queue, which is used on shared memory implementations to store the iteration indexes (Self Scheduling techniques [10, 11, 8, 5, 2]) should introduce a centralization point in the distributed implementation, thus jeopardizing scalability. Another problem is related to the allocation of data. In fact, to avoid data transfer overheads, it should be needed to replicate the full data set in the local memories of all the ....

[Article contains additional citation context not shown here]

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Communications of the ACM, 35(8):90-- 101, August 1992.


Evaluation of Loop Scheduling Algorithms on Distributed Memory .. - Teebu Philip   (Correct)

....an independent loop across processors in a parallel processing system. The potential benefit of loop parallelism on system performance has been analyzed by a number of researchers. Dynamic DOALL loop scheduling methods such as SelfScheduling (SS) 8] Guided Self Scheduling(GSS) 10] Factoring (FAC)[6] and Trapezoid Self Scheduling(TSS) 13] have been proposed for data independent loops that can be easily implemented into parallel compilers. A closer look at all the prior loop scheduling methods reveals that these dynamic algorithms have been proposed and studied only on shared memory platforms. ....

....c min iterations are assigned to each processor per scheduling step. A problem with GSS is that it allocates too many iterations to the first few chunks, resulting in poor load balancing at the beginning and longer overall execution times in distributed systems. 3.2. 3 Factoring Factoring(FAC)[5, 6] is also a decreasing chunk size allocation algorithm. Factoring tries to allocate half of the remaining iterations evenly among all p processors for every scheduling step. All processors will execute the same number of iterations at approximately the same time. In other words, all processors are ....

S.F. Hummel, E. Schonberg, and L.E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Communications of the ACM, Vol. 35, No. 8, pp. 90-101, August 1992.


A Template for Non-Uniform Parallel Loops Based on Dynamic.. - Orlando, Perego (1995)   (3 citations)  (Correct)

....respect to the schedule of iteration. In UMA shared memory multiprocessors, where, in principle, shared data are at the same distance from any processor, the usual implementation schemes rely on completely dynamic selfscheduling techniques, based on the existence of a global queue of iterations [14, 16, 12, 7, 4]. The introduction of caches makes this scheme unsuitable even for UMA multiprocessors, since it does not guarantee the exploitation of locality [12] In distributed memory multiprocessors, data distribution and iteration scheduling are also more strictly related. Data parallel languages such as ....

....iterations and P is the number of processors involved, are fetched at each time by an idle processor [14] Trapezoid Self Scheduling [16] was proposed by Tzen and Ni to reduce the number of synchronizations by linearly decreasing the chunk size. Hummel, Schonberg and Flynn presented Factoring [7], which requires that P consecutive chunks of size k, where k u 2 DeltaP , are inserted into the shared queue when it becomes empty. Due to improvements in processor architectures with the exploitation of fine grain parallelism, processors are getting faster at a higher rate than memories and ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Communications of the ACM, 35(8):90--101, August 1992.


Parallel Loop Scheduling for High Performance Computers - Yue, Lilja (1994)   (3 citations)  (Correct)

....reason for uneven workload is system interrupts, such as page faults or context switches, which cause some iterations to take longer to execute than others. To balance the processors workloads, a variety of algorithms have been developed to dynamically assign iterations to idle processors [7,9,15,16]. Since the performance of these algorithms depends on the execution environment, the loop structure, and the specific implementation details, finding the best algorithm for This work was supported in part by the National Science Foundation under grant no. MIP 9221900, and by Army Research ....

....these algorithms using the analytical models. To make the comparisons more realistic, the algorithms were also implemented on a Silicon Graphics Onyx multiprocessor system. The scheduling algorithms we study include chunk scheduling [9] self scheduling [3] guided self scheduling [15] factoring [7], and trapezoid self scheduling [16] This article is organized as follows: Section 2 reviews some techniques for exploiting loop level parallelism and provides some useful background information. Section 3 describes the scheduling algorithms and develops the analytical models. It also discusses ....

[Article contains additional citation context not shown here]

S. Hummel, E. Schonberg, and L. Flynn. Factoring: A method for scheduling parallel loops. Communciations of the ACM, 35(8):90-- 101, Aug. 1992.


An Efficient Template for the Highly Parallel Implementation.. - Orlando, Perego (1996)   (Correct)

....respect to the schedule of iteration. In UMA shared memory multiprocessors, where, in principle, shared data are at the same distance from any processor, the usual implementation schemes rely on completely dynamic self scheduling techniques, based on the existence of a global queue of iterations [8, 9, 3]. The introduction of caches makes this scheme unsuitable even for UMA multiprocessors, since it does not guarantee the exploitation of locality [6] In distributed memory multiprocessors, data distribution and iteration scheduling are also more strictly related. Data parallel languages such as ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. CACM, 35(8):90--101, Aug. 1992.


SUPPLE: an Efficient Run-Time Support for Non-Uniform Parallel .. - Orlando, Perego (1996)   (2 citations)  (Correct)

....for effectively balancing the load, and send them to the destinations that asked for load migration. In terms of our support, this means choosing the most appropriate number of chunks that must be moved to grant a given migration request. We use a modified Factoring scheme to determine this number [16]. Factoring is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared memory multiprocessors. It provides a way to determine the appropriate number of iterations that each processor must fetch at each access from a central queue storing ....

....and P is the number of processors involved, are fetched at each time by an idle processor [23] Trapezoid Self Scheduling [24] has been proposed by Tzen and Ni to reduce the number of synchronizations by linearly decreasing the chunk size. Hummel, Schonberg and Flynn have presented Factoring [16], the policy adopted also in SUPPLE to implement the task selection strategy of our load balancer. As mentioned in Section 3.2, Factoring requires that P consecutive chunks of size k, where k u 2 DeltaP , are inserted into the shared queue when it becomes empty. The introduction of large ....

S.F. Hummel, E. Schonberg, and L.E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Comm. of the ACM, vol. 35, no. 8, pp. 90--101, Aug. 1992.


Scheduling Data-Parallel Computations on Heterogeneous and.. - Orlando, Perego (1997)   (1 citation)  (Correct)

....of load imbalance when this is introduced into uniform data parallel computations by heterogeneous and or time shared environments. Weighted Factoring [6] has been conceived for heterogeneous distributed environments (NOWs) and is a variant of the well known factoring self scheduling scheme [7] designed for shared memory environments. The main drawback of the proposal is still the presence of a centralized scheduler processor. Moreover, it considers the relative speeds of processors involved as an input of the load balancing technique, even though these speeds may change during the ....

....used by underloaded processors to choose a partner to be asked for further work skips terminated processors. Once an overloaded processor decides to grant a migration request, it must choose the most appropriate number of chunks to be migrated. To this end, SUPPLE uses a modified Factoring scheme [7], which is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared memory multiprocessors, where a single queues of iterations is concurrently accessed. In our case, rather than a single queue, we have multiple shared queues, one for each ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Comm. of the ACM, 35(8):90--101, Aug. 1992.


A Framework for Space and Time Efficient Scheduling of.. - Narlikar, Blelloch (1996)   (Correct)

....currently considering methods to further improve the scheduling algorithm, particularly to provide better support for fine grained computations. At present, fine grained iterations of innermost loops are statically grouped into fixed size chunks. A dynamic, decreasing size chunking scheme such as [24, 25, 34] can be used instead. We are considering ways of automatically introducing such coarsening at runtime through the scheduling algorithm; for example, by allowing the execution order to differ to a limited extent from the order dictated by the 1DF numbers. We also plan to reduce contention due to a ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: a method for scheduling parallel loops. Communications of the ACM, 35(8):90--101, Aug 1992.


Space-Efficient Implementation of Nested Parallelism - Narlikar, Blelloch (1996)   (8 citations)  (Correct)

....techniques. These results show that delaying big allocations significantly changes the order of execution of the threads, and results in much lower memory usage, especially as the number of processors increases. into fixed size chunks. A dynamic, decreasing size chunking scheme such as [24, 26, 38] can be used instead. We are considering ways of automatically introducing such coarsening at runtime through the scheduling algorithm; for example, by allowing the execution order to differ to a limited extent from the order dictated by the 1df numbers. We also plan to reduce contention for the ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: a method for scheduling parallel loops. Communications of the ACM, 35(8):90--101, Aug 1992.


Space-Efficient Scheduling of Nested Parallelism - Narlikar, Blelloch (1999)   (4 citations)  (Correct)

....1998] We are currently working on methods to further improve the scheduling algorithm, particularly to provide better support for fine grained threads. At present, fine grained iterations of innermost loops are statically grouped into fixed size chunks. A dynamic, decreasing size chunking scheme [Hummel et al. 1992; Kuck 1987; Tzen and Ni 1993] can be used instead. We are working on an algorithm to automatically coarsen the computations at runtime by allowing the execution order to differ to a limited extent from the 1df numbers, and by using ordered, per processor queues. Preliminary results indicate that ....

Hummel, S. F., Schonberg, E., and Flynn, L. E. 1992. Factoring: a method for scheduling parallel loops. Commun. ACM 35, 8 (Aug.), 90--101.


A Support for Non-Uniform Parallel Loops and its Application.. - Orlando, Perego (1997)   (Correct)

....for effectively balancing the load, and send them to the destinations that asked for load migration. In terms of our support, this means choosing the most appropriate number of chunks that must be moved to grant a given migration request. We use a modified Factoring scheme to determine this number [6]. Factoring is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared memory multiprocessors. It provides a way to determine the appropriate number of iterations that each processor must fetch from a central queue, and improves other ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Comm. of the ACM, 35(8):90--101, Aug. 1992.


Combining Static and Dynamic Scheduling on Distributed-Memory.. - Oscar Plata (1994)   (8 citations)  (Correct)

....the high synchronization overhead found in SS by scheduling chunks of more than one iteration as units. GSS [PK87] schedules large chunks at the beginning of the loop (low synchronization overhead) and small ones toward the end of the loop, trying to balance the workload. In Factoring scheduling [HSF92] the allocation of iterations to processors proceeds in phases. In each phase, a part of the remaining iterations is divided equally among the available processors. Trapezoid scheduling [TN93] is a variation of GSS where the size of the successive chunks decreases linearly instead of ....

Hummel, S. F., Schonberg, E. and Flynn, L. E., "Factoring, a Method for Scheduling Parallel Loops", Communications of the ACM, vol. 35, no. 8, Aug. 1992, pp. 90--101.


A Performability Model for Applications using Checkpointing - John Dougherty (1996)   (Correct)

....Again the discussion begins with expected execution time, assuming no failures. The computational density and disk density remain constant; however, the problem space is partitioned such that each processors receives n p units, assuming an appropriate scheduling load balancing approach [9, 14, 17]. This is witnessed in the following: Tsimd = Tcmp Tdsk Tcmm Tsyn = an pw bn pd gn n Tsyn (18) where Tcmp and Tdsk are defined as with sequential processing, Tcmm represents communication overhead time, and Tsyn represents synchronization overhead time. Tcmp can be viewed as the time ....

....which is generated by the relative behaviors of the processing elements which as assumed heterogeneous and nondedicated. For this study, Tsyn will be set to zero, and all results are assumed to be lower bounds on time. Research to capture and understand this elusive term has been reported in [6, 9, 14, 17]. Using the above assumptions of local disks and no synchronization penalty, then Tsimd = andn# #bnwn# #gnpd pwdn (19) 9 A Performability Model for Applications using Checkpointing J.P. Dougherty resulting in the performance equation Psimd = apwdn adn# #bwn# #gpd (20) There is no fault tolerance ....

Hummel, S.F., Schonberg, E., and Flynn, L.E. "Factoring: a method for scheduling parallel loops." Communications of the ACM, Vol. 35, No. 8, August 1992, pp. 90 - 101.


Exploiting Partial Replication in Unbalanced Parallel Loop.. - Orlando, Perego (1995)   (2 citations)  (Correct)

....fields such as sparse matrix computations, image processing, and Montecarlo calculations. Although the problem of finding an optimal schedule for non uniform parallel loops is NP hard [10] many effective dynamic scheduling heuristics for shared memory multiprocessors have been proposed and tested [12,14,11,7,4]. A less frequently studied problem is loop scheduling for distributed memory architectures [10,4] which involves strong relationships with other important issues such as data partitioning and locality of references. On the other hand, the general problem of load balancing for these ....

....iterations and P is the number of processors involved, are fetched at each time by an idle processor [12] Trapezoid Self Scheduling [14] was proposed by Tzen and Ni to reduce the number of synchronizations by linearly decreasing the chunk size. Hummel, Schonberg and Flynn presented Factoring [7], which requires that P consecutive chunks of size k, where k u 2 DeltaP , are inserted into the shared queue when it becomes empty. Due to improvements in processor architectures with the exploitation of fine grain parallelism, processors are getting faster at a higher rate than memories and ....

[Article contains additional citation context not shown here]

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Communications of the ACM, 35(8):90--101, August 1992.


Building Worthy Parallel Applications Using Networked Computers - A .. - Shi   (Correct)

.... Gi: ith tuple size Ri 1 = Ri Gi. Until Ri = 1. For example, if N=1000, P=2, we have tuples of the following sizes: 500,250,125,63,32,16,8,4,2,1 GSS puts too much work in the beginning. It performs poorly even when processors are of the same power and work distribution is relatively uniform [8,9]. Factoring [8] Assuming there are P parallel workers, a threshold t 0 and a real value (0 f =1) the factored tuple sizes are calculated as follows: R0 = N. Gi = Ri f P Ri 1 = Ri (P Gi) until Ri t. For example, if N=1000, P=2, f=0.5, t=1, we have the following tuple sizes: ....

.... Ri 1 = Ri Gi. Until Ri = 1. For example, if N=1000, P=2, we have tuples of the following sizes: 500,250,125,63,32,16,8,4,2,1 GSS puts too much work in the beginning. It performs poorly even when processors are of the same power and work distribution is relatively uniform [8,9] Factoring [8]. Assuming there are P parallel workers, a threshold t 0 and a real value (0 f =1) the factored tuple sizes are calculated as follows: R0 = N. Gi = Ri f P Ri 1 = Ri (P Gi) until Ri t. For example, if N=1000, P=2, f=0.5, t=1, we have the following tuple sizes: ....

[Article contains additional citation context not shown here]

S.F.Hummel, E. Schonberg and L. E. Flynn ., "Factoring -- A Method for Scheduling Parallel Loops," CACM, Vol., 35, No.8 (August 1992), 90-101


A Comparison of Implementation Strategies for Non-Uniform.. - Orlando, Perego (1997)   (Correct)

....i.e. the memory coherent local cache, is in this case implicit, and occurs at run time when a given data element is actually accessed. Many Self Scheduling policies have been proposed that are aimed at reducing synchronizations and contention overheads while achieving a good load balance [18, 24, 8]. Recently, however, it has been proved that, also on UMA parallel architectures, dynamic iteration assignment should be guided not only by considering the load balance goal, but also data reuse and locality exploitation. Markatos and LeBlanc [12] have investigated a scheduling strategy, based on ....

....only from those that have not yet communicated their termination. ffl Once an overloaded processor decides to grant a migration request, it must choose the most appropriate number of chunks that must be sent to the asking processor. SUPPLE uses a modified Factoring scheme to determine this number [8]. Factoring is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared memory multiprocessors. It provides a way to determine the appropriate number of iterations that each processor must fetch at each access from a central queue which ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Comm. of the ACM, 35(8):90--101, Aug. 1992.


Load Balancing in Software Distributed Shared Memory Systems - Lai, Shieh, Ueng, Kok, Kung (1997)   (Correct)

....memory of target processor in a DSM system just as that in shared memory systems. Accordingly, barrier synchronization point of parallel program for DSM systems is one of the best selection with similarity to the loop end of parallel loop in the loop scheduling for shared memory machines [10] 11][16][17] 18] such that rescheduling is infrequent but sufficient. Meanwhile, it is 4 easy for the system to detect the idleness if a lightly loaded or fast processor has all its subtasks arrived at the synchronization barrier. It is noted that the terms lightly loaded and fast are synonyms ....

....scheduling in DSM is emulated by message exchanging on the network, a number of network overhead and management bottleneck 5 compared to the local thread queue in affinity scheduling will probably be imposed. At the other extreme, the algorithms such as the guided scheduling [11] factoring [16], and distributed self scheduling, 17] whose intention is to reschedule loop iterations to idle processors, are not practicable because the individual concerned in DSM systems are as large as a thread. Particularly, in this algorithm, each processor in the system possesses a local scheduler ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. Communications of the ACM, Vol. 35, No. 8, August 1992, pages 90-101.


An Empirical Study of the Workload Distribution Under Static.. - Zhiyuan Li (1994)   (2 citations)  (Correct)

....affinity dynamic scheduling algorithm report similar results for small kernels [8] Although these kernels are small, their results indicate that our conclusion may well extend to the NUMA architectures. The LDS authors also compare their results with previous results obtained on the RP3 machine [4] and suggest that the memory allocation on NUMA architectures may considerably affect the comparison between static scheduling and dynamic scheduling. 2.3 Variants of scheduling policies Under static scheduling, the compiler assigns a set of DOALL loop iterations to each processor by ....

....workload than static scheduling where the operation counts of individual iterations vary significantly. Dynamic scheduling has several variants whose main difference is in the number of iterations each processor may request. The guided self scheduling (GSS) 10] and some of its close variants [4], 6] exhibit the following optimality: assuming each iteration takes exactly the same amount of time to execute, then all processors finish executing the parallel loop within the time difference of one iteration from each other. Obviously, if different iterations may take different amount of time ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. CACM, 35(8):90--101, August 1992.


Scheduling Data-Parallel Computations on Heterogeneous and.. - Orlando, Perego (1997)   (1 citation)  (Correct)

....problem of load imbalance when this is introduced into uniform data parallel computations by heterogeneous and or time shared environments. Weighted Factoring [8] has been conceived for heterogeneous distributed environments, and is a variant of the well known factoring self scheduling scheme [9] designed for shared memory environments. The main drawback of the proposal is still the presence of a centralized scheduler processor which may jeopardize the scalability of the approach. Moreover, it considers the relative speeds of processors involved as an input of the load balancing ....

....from those that have not yet communicated their termination. ffl Once an overloaded processor decides to grant a migration request, it must choose the most appropriate number of chunks that have to be sent to the asking processor. SUPPLE uses a modified Factoring scheme to determine this number [9]. Factoring is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared memory multiprocessors. It provides a way to determine the appropriate number of iterations that each processor must fetch at each access from a central queue which ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. Comm. of the ACM, 35(8):90--101, Aug. 1992.


Using Networks of Workstations for Database Query Operations - Dandamudi (1997)   (Correct)

....done with the 200 tuples, it will be given the next set of, say 150, tuples and so on. Several variable task granularity algorithms have been proposed for loop scheduling in multiprocessor systems. Some examples are the guided self scheduling [17] trapezoidal self scheduling [23] and factoring [14]. The advantage of these schemes is that, if a workstation is particularly slow, work can be diverted to other workstations dynamically. Last, we can develop learning based algorithms that set the task granularity proportional to the amount time taken to the last task sent to the workstation. ....

S. F. Hummel, E. Schonberg, and L. E. Flynn, Factoring: A Method for Scheduling Parallel Loops," Comm. ACM, Vol. 35, No. 8, August 1992, pp.~90-- 101.


A Comparison of Implementation Strategies for Non-Uniform.. - Orlando, Perego (1998)   (Correct)

....processor, i.e. the coherent local cache, is in this case implicit, and occurs at run time when a given data element is actually accessed. Many Self Scheduling policies have been proposed that are aimed at reducing synchronizations and contention overheads while achieving a good load balance [17, 22, 7]. Also on UMA multiprocessors, however, dynamic iteration assignment should be guided not only by considering the load balance goal, but also data reuse and locality exploitation [11] Dynamic approaches can also be adopted on distributed memory, message passing machines. At run time a ....

....processors skips those processors that have already communicated their termination. ffl Once an overloaded processor decides to grant a migration request, it must choose the most appropriate number of chunks that must be sent to the asking processor. SUPPLE uses a slightly modified Factoring [7] scheme to locally determine this number: an overloaded processor replies to a request for further work by sending k 2 DeltaP chunks, where k is the number of chunks currently stored in Q, and P is the number of processors. The policy exploited by SUPPLE to manage data coherence and termination ....

S.F. Hummel, E. Schonberg, and L.E. Flynn. Factoring: A Method for Scheduling Parallel Loops. CACM, 35(8):90--101, 1992.


Hardware And Software For Functional And Fine Grain Parallelism - Beckmann (1993)   (16 citations)  (Correct)

....that scheduling is nonpreemptive, i.e. tasks are run to completion. 2.2. 1 Scheduling parallel loops Loops are generally regarded as the largest source of parallelism within ordinary programs, and hence dynamic scheduling of parallel loops has received much attention in the literature to date [88, 13, 108, 51]. Dynamic scheduling of loops is not a topic of this thesis, since an array of existing techniques is available for our use. The loop scheduling problem consists of assigning iterations of the loop to processors for execution. Since all iterations of a parallel loop are independent, guaranteeing ....

....within C iteration times of each other, as opposed to within a single iteration time in the case of self scheduling. To remedy this, several algorithms decrease the chunk size as the loop nears completion. Guided self scheduling (GSS) 88] trapezoidal self scheduling (TSS) 108] and factoring [51] are all examples of this. These techniques differ in the initial chunk size, the way in which chunk size is decreased, robustness to certain anomalies, and total number of scheduling operations. They generally guarantee load balance within one or two iteration times, while incurring significantly ....

S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring: A method for scheduling parallel loops. Communications of the ACM, 35(8):90--101, August 1992.


SUPPLE: an Efficient Run-Time Support for Non-Uniform Parallel .. - Orlando, Perego (1996)   (2 citations)  (Correct)

....for effectively balancing the load, and send them to the destinations that asked for load migration. In terms of our support, this means choosing the most appropriate number of chunks that must be moved to grant a given migration request. We use a modified Factoring scheme to determine this number [16]. Factoring is a Self Scheduling heuristics formerly proposed to address the efficient implementation of parallel loops on shared memory multiprocessors. It provides a way to determine the appropriate number of iterations that each processor must fetch at each access from a central queue storing ....

....and P is the number of processors involved, are fetched at each time by an idle processor [20] Trapezoid Self Scheduling [21] has been proposed by Tzen and Ni to reduce the number of synchronizations by linearly decreasing the chunk size. Hummel, Schonberg and Flynn have presented Factoring [16], the policy adopted also in SUPPLE to implement the task selection strategy of our load balancer. As mentioned in Section 3.2, Factoring requires that P consecutive chunks of size k, where k u 2 DeltaP , are inserted into the shared queue when it becomes empty. The introduction of large caches ....

S.F. Hummel, E. Schonberg, and L.E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Comm. of the ACM, vol. 35, no. 8, pp. 90--101, Aug. 1992.


Increasing Chunk Size Loop Scheduling Algorithms for Data.. - Philip (1995)   (Correct)

....of iterations sampled uniformly throughout the iteration space rather than blocks of consecutive iterations. Guided Self Scheduling was designed for shared memory models. Efficient MessagePassing Schedulers[35] have extended GSS for distributed memory models also. 2.2. 4 Factoring Factoring(FAC)[22] is also a decreasing chunk size allocation algorithm. Factoring tries to allocate half of the remaining iterations evenly among all p processors for every scheduling step. All processors will execute the same number of iterations at approximately the same time. In other words, all processors are ....

S.F. Hummel, E. Schonberg, and L.E. Flynn, "Factoring: A Method for Scheduling Parallel Loops," Communications of the ACM, Vol. 35, No. 8, pp. 90-101, August 1992.


Parallel Query Processing - Yu, Chen, Wolf, Turek (1993)   (7 citations)  (Correct)

No context found.

S. Hummel, E. Schonberg, and L. Flynn. Factoring: A Method for Scheduling Parallel Loops. Communications of the ACM, pages 90--101, August 1992.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC