| V. Kumar, A.Y. Grama, and N. R. Vempaty, "Scalable load balancing techniques for parallel computers," Journal of Parallel and Distributed Computing, vol. 22, pp. 60--79, 1994. |
....runs out of work, it makes a request to another 48 processor for work. Other schemes use a farming model, where one processor is used to manage the search space while other processors perform tasks that are farmed out by the manager. For a complete summary of load balancing schemes see [KGR,94] Most load balancing schemes were originally designed for distributed systems and are message based. The schemes have been successfully ported to tightly coupled systems by using shared memory and additional synchronization, for sending or receiving work. However, these message based models are ....
....a free thread would either request work, steal work [BLL,94] or be assigned work. Instead, the inexpensive cost of thread creation is exploited within an allowable limit, as required by the run time conditions of a search. The scheme has similarities to multi level load balancing described in [KGR,94] since the threads become arranged in the form of a tree. However, it differs because the Dynamic Thread Creation is completely dynamic. The threads are not statically ordered. Instead, the structure of the thread tree is formed automatically and will change with the execution of the program. ....
Kumar, V.; Grama, A.; Rao, V. N.: Scalable load balancing techniques for parallel computers, Journal of Distributed Computing, 22(1):60-79, March, 1994.
....asynchronous, workload moved without global synchronization and without a global view of the system; key issues were how transfers are initiated, and what information is used to govern those initiations. More recent work has a di erent view of workload, but has continued in the asynchronous vein [24, 12]. A synchronous view of remapping was taken in work on decision policies that focus on when to remap [18, 19, 20] balancing the delay cost of remapping against the anticipated performance gain is the essence of these policies. Globally synchronous remapping techniques are developed in [5, 25] ....
V. Kumar, A. Grama, and N. Vempaty. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22:60-79, 1994.
....attempts to determine appropriate shares of work to be distributed to processors at application startup, while dynamic load balancing may redistribute part of the work during the execution to compensate for di erences in processors performance. Classical dynamic load balancing strategies [9] seem well suitable for grid environment as computing and network resources always change. However, such dynamic load balancing strategies may fail because of the many redistributions they imply. Moreover, data dependencies may slow down fastest processor to the slowest one [3] Static strategies ....
V. Kumar, A. Y. Grama, V. N. Rao, Scalable Load Balancing Techniques for Parallel Computers, Journal of Parallel and Distributed Computing, 1994
....attempts to determine appropriate shares of work to ditribute to processors at application startup, while dynamic load balancing may redistribute part of the work during the execution to compensate for di erences in processors performance. Classical dynamic load balancing strategies [8] seem well suitable for grid environment as computing and network ressources always change. Unfortunately, they may fail because of the many re distributions they imply. Moreover, data dependences may slow down fastest processor to the slowest one [2] Static strategies may be very useful when the ....
V. Kumar, A. Y. Grama, V. N. Rao, Scalable Load Balancing Techniques for Parallel Computers, Journal of Parallel and Distributed Computing, 1994
....that can exploit these possibilities. Many researchers have tried to find parallelization techniques for AI applications and they have mainly focused on ways to distribute the search tree among the existing processors [3,4,10,12,13] These techniques, which have been enriched with load balancing [9] and operator reordering [5,12] produce quite efficient parallel algorithms. In this paper, we show that GRT examines only a small subpart of the search tree and thus methods relying on tree distribution cannot be applied efficiently to this planner. We present a different approach, which ....
....be working on promising parts of the search tree while the others contribute little or nothing to the process of finding a solution. Moreover, the communication is necessary for load balancing, since the local agenda of a processor may become empty if many non expandable states have been examined [9]. Load balancing includes the transfer of states from one local agenda to another, in order to equalize the workload in all processors. This transfer can be performed directly or via a global memory structure, called blackboard. In [9] Kumar et al. review a number of receiver and sender initiated ....
[Article contains additional citation context not shown here]
V. Kumar, A. Y. Grama and V. N. Rao, "Scalable Load Balancing Techniques for Parallel Computers", Journal of Parallel and Distributed Computing, Volume 22, Number 1, pp. 6079, 1994.
....supplies new subproblems by requesting other PEs to split their subproblem. Idle PEs receiving a request either reject the request or redirect it to another PE. Figure 1 shows pseudocode for such a generic tree splitting algorithm. This approach has proved useful under a variety of circumstances [7, 24, 25, 13, 5, 21, 14, 29, 27, 2, 10, 28]. A major advantage of receiver induced tree splitting is that load balancing only takes place when necessary. If the sequential execution time T seq is large, the average size of a transmitted subproblem is also fairly large (i.e. it represents a large execution time) Productive work done on a ....
....these neighborhood polling schemes is that highly loaded PEs are quickly surrounded by a cluster of busy PEs and are therefore unable to transmit work; subproblem transmissions at the border of these clusters only involve small subproblems which are not worth the effort of communicating them. In [13, 14] a variety of other partner selection schemes is analyzed. There seems to be a dilemma between schemes based on local information on the one hand which may produce many vain requests to idle PEs, and global selection schemes on the other hand which incur additional message traffic and often suffer ....
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
....its subproblem and transmits one part to the idle PE. Our approach towards analyzing RP is somewhat atypical. We do not present implementation results or comparisons between a variety of new and old algorithms as it is usual in many papers with practical orientation. This has been done elsewhere [3, 5, 2, 6, 15, 14]. Nor do we delve deep into a particular detail that is most beautiful, difficult, interesting or tractable from a theoretical point of view. Instead, we try to throw light on as many aspects of the algorithm as possible; hoping that this approach helps to understand and implement the algorithm in ....
....probability is a stronger notion than the more customary notion of average case behavior for our purposes. 1.2 Related work In [4] it is proved that for d(n) 2 O(1) and fl(w) g(w)g O(1) RP has an isoefficiency function in O(n log n) with high probability. Much tighter is the result in [5]: If fl(w) g(w)g O(1) and h(w) 2 O(log w) the isoefficiency function of RP is in O(nd(n) log n) on the average. This already indicates a quite good scalability. But it falls short of explaining why RP is in practice more efficient than a deterministic algorithm introduced in the same ....
[Article contains additional citation context not shown here]
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
.... algorithms use a simpler approach regarding tree decomposition by requiring all splits to occur before calls to work (in our terminology) However, this is only efficient for some applications since in the worst case a huge number of subproblems may have to be generated or communicated (e.g. [14, 5, 24]) Random polling belongs to a family of receiver initiated load balancing algorithms which have the advantage to split subproblems only on demand by idle PEs. This adaptive approach has been used successfully for a variety of purposes such as parallel functional [1] and logic programming [12] or ....
....strategy turns out to be crucial. The apparently economic option to poll neighbors in the interconnection network can be extremely inefficient since it leads to a buildup of clusters of busy PEs shielding large subproblems from being split [23] Polling PEs in a global round robin fashion [14] avoids this because no large subproblems can hide . Execution times T par 2 O P hT count can be achieved where T count is the time for incrementing a global counter. However, even sophisticated distributed counting algorithms have T count (T rout log P= log log P ) 31] It was long ....
[Article contains additional citation context not shown here]
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
.... algorithms use a simpler approach regarding tree decomposition by requiring all splits to occur before calls to work (in our terminology) However, this is only efficient for some applications since in the worst case a huge number of subproblems may have to be generated or communicated (e.g. [18, 7, 27]) Random polling belongs to a family of receiver initiated load balancing algorithms which have the advantage to split subproblems only on demand by idle PEs. This adaptive approach has been used successfully for a variety of purposes such as parallel functional [1] and logic programming [16] or ....
....strategy turns out to be crucial. The apparently economic option to poll neighbors in the interconnection network can be extremely inefficient since it leads to a buildup of clusters of busy PEs shielding large subproblems from being split [26] Polling PEs in a global round robin fashion [18] avoids this because no large subproblems can hide . Execution times T par 2 O( hT count ) can be achieved where T count is the time for incrementing a global counter. However, even sophisticated distributed counting algorithms have T count (T rout log P= log log P ) 36] It was long ....
[Article contains additional citation context not shown here]
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
....new work from load balancer WHILE subproblem is not empty DO IF there is a load request THEN split subproblem send one part to the initiator of the request do some work on subproblem Fig. 1. Receiver induced tree splitting. This approach has proved useful under a variety of circumstances [6, 14, 5, 22, 15, 25, 2, 11, 26, 7]. A major advantage of receiver induced tree splitting is that load balancing only takes place when necessary. If the sequential execution time T seq is large, the average size of a transmitted subproblem is also fairly large (i.e. it represents a large execution time) Productive work done on a ....
....a desired efficiency. In this respect there are large differences between different load balancing strategies. We use the behavior of the lower order terms as a measure for the scalability of an algorithm the smaller these terms the smaller is the problem size required for good efficiency. In [14] it is shown that sending requests to neighboring PEs has a quite poor scalability except for the combination of low diameter interconnection networks (e.g. hypercubes) and a work splitting function which produces subproblems of nearly equal size. The basic problem of these neighborhood polling ....
[Article contains additional citation context not shown here]
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
....The detection of load imbalance may also be expensive since this is often requires some form of global communication. A good survey of techniques is given in [40] Kumar et al. have analyzed the scalability properties of a number of dynamic load balancing schemes on a range of architectures [47]. Near optimal load balancing strategies are presented and analyzed for the hypercube, mesh, and networks of workstations. Nicol and Reynolds have analyzed the dynamic load balancing problem at a much coarser level [66] The authors present a decision model for the application of dynamic load ....
....balancing may also be needed if the 141 problem is irregular and the workload distribution unpredictable. We could rerun the partitioning and placement algorithm in these cases. However this is a global strategy that is not scalable. More scalable dynamic load balancing strategies are given in [47]. An important part of dynamic load balancing is to determine when it is beneficial to perform the load rebalancing. We could extend the callback mechanism to add additional information that would be useful in making this decision. A callback such as cycles left could return the iterations ....
V. Kumar, A.Y. Grama, and V.N. Rao, "Scalable Load Balancing Techniques for Parallel Computers," Journal of Parallel and Distributed Computing, Vol. 22, July 1994.
....Compared to RA, two communications are needed per load balancing step. However, the rst is a small request message. Compared to RA, work gets transferred only if warranted according to the target processor s (non )idle status. 4 Simple Scalability Analysis We will use the isoeciency model of [4] to address scalability issues, particularly with respect to load balancing. In this model, all of the load W is initially in one processor; the analysis focuses on the work needed to distribute the load over all the processors. Note that, for our application, this may be thought of as the type of ....
.... arbitrarily small) such that 1 ; 0:5; so that the load balancing step leaves both workers workers with a portion bounded by (1 )w: With respect to receiver initiated load balancing, let V (p) denote the number of requests needed for each worker to receive at least one request, as in [4]. If the total error would remain constant, then as a result of load balancing, the amount of error at any processor does not exceed (1 )W after V (p) load balancing steps; this is used in [4] to estimate the number of steps needed to attain a certain threshold, since under the given assumptions ....
[Article contains additional citation context not shown here]
V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1):60-79, 1994.
....poor performance. Another option is to divide the workload dynamically so that a thread with no work can obtain work from a busy thread. This is often referred to as dynamic load balancing. Several dynamic load balancing strategies exist and can be easily adapted for multithreading. See [KGR94] for an in depth discussion. The approaches described here use the Dynamic Thread Creation scheme [ZY98a] Informally, this scheme removes the general communication paradigms that are associated with tradition load balancing schemes by exploiting the inexpensive cost of thread creation. Here, ....
V. Kumar, A. Grama, and V.N. Rao, Scalable load balancing techniques for parallel computers, Journal of Distributed Computing, 7 (1994).
....[10] can lead to load imbalances and consequently a poor speedup. Another option is to divide the search space dynamically so that a thread with no work can obtain work from a busy thread. This is often referred to as dynamic load balancing. Several dynamic load balancing schemes exist [4] 5] 6][11], for example receiver initiated schemes like Nearest Neighbor, Global Round Robin and Random Polling. Most of these schemes were originally designed for distributed systems and are message based. The schemes have been successfully ported to tightly coupled systems by using shared memory and ....
....either request work, steal work [7] or be assigned work. Instead, we exploit the inexpensive cost of thread creation, and allow thread creation, within an allowable limit, as required by the run time conditions of a search. The scheme has similarities to multi level load balancing described in [11], since the threads become arranged in the form of a tree. However, it differs because the scheme is completely dynamic. The threads are not statically ordered. Instead, the structure of the thread tree is formed automatically and will change with the execution of the program. This continually ....
V. Kumar, A. Grama, and V.N. Rao, Scalable load balancing techniques for parallel computers, Journal of Distributed Computing, 22 (1), 1994, 60-79.
....introduced by the algorithm reaches 36 for 8 processors. This is determined by three factors: 1) the relatively high communication costs considered (1:5 0:005 L milliseconds for messages of size L bytes) 2) the cost of the dynamic load balancing mechanism for a network of workstations [16]; and (3) the small granularity of the subproblems. We shall see that for a larger problem (Table 1) the overhead is much lower (15.58 of the total execution for 100 processors, from which 13.67 are load balancing costs, 0.78 communication time and 1.13 list contraction time) Furthermore, ....
V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1):60--79, July 1994.
....Figure 1 includes the case where a global priority queue is used (across the processors) if get region( and put regions( manipulate the global queue. A global priority queue such as the one described in [13,19] can be used. 2. 2 Load Balancing Our load balancing approach is receiver initiated [15], in which the controller acts as an intermediator, keeping a list of the identifiers of idle workers and passing these to workers with work to share. The actual negotiation aspect of sharing work is handled by pairs of workers. In particular, when a worker detects that its local estimate of the ....
V. Kumar, A. Y. Grama and N. R. Vempaty, Scalable load balancing techniques for parallel computers, Journal of Parallel and Distributed Computing, 22 (1) (1994), pp. 60-79.
.... view of the nature of our application: it is easier to determine a processor that is close to idle, which can be done independent of the other processors, than to determine how busy a processor is (relative to the loads of other processors) A survey of dynamic load balancing schemes was given in [6]. We give a brief overview in Table 1. ARR (Asynchronous Round Robin) idle processor requests work from processor with id = target where target is local and incremented by 1 (mod p) at each request. GRR (Global Round Robin) shared target variable kept at processor 0; idle processor ....
....for work, every processor is requested at least once; and TO = the total communication overhead (summed over all processors) assuming that processor 0 has W units of work initially and all other processors are idle. Using W = O(D log D) total work done by sequential algorithm) and V (p) from [6], we calculated the communication overhead TO UcommV (p) log W for the load balancing methods ARR, GRR M, RP and SB and for our global heap (GH) and list their asymptotic values (for the hypercube) in Table 2. The correARR p 2 log p log D GRR M p log p log D RP p log 2 p log D (avg. SB ....
A. Y. Grama, V. Kumar and V. N. Rao. Scalable Load Balancing Techniques for Parallel Computers. Journal of Parallel and Distributed Computing, Vol. 22, No. 1, pp.60-79, 1994.
....to another is small compared to its processing time. 3) It is not possible or is very difficult to estimate the processing time for a piece of work. These are characteristics of many search algorithms used in artificial intelligence and operations research and many divide and conquer algorithms [6]. Randomized methods are of interest because of their simplicity, ease of implementation, and good performance. Since in the above applications different work pieces can be of widely differing and unpredictable sizes and or quality, in general, they may require either combined dynamic quantitative ....
V. Kumar, A. Grama, and V. N. Rao. "Scalable load balancing techniques for parallel computers" TR 91-55, CSci. Dept., U. of Minnesota, 1991.
....pattern, for N = 31, c = 4, f = 1 and S = 4, in PLA LOHA QE. Finally, in Fig. 4 we plot the isoefficiency curves for the various algorithms for normally distributed data. A lower bound on the isoefficiency of any load balancing scheme for the hypercube architecture is Theta(P logP ) [7]. It can be seen that the isoefficiency functions for PLA GOHA QE and PLA LOHA QE are close to the lower bound of Theta(P logP ) and are far better than those of other algorithms. Note that the isoefficiency curves for PLA LOHA QE and PLA GOHA QE meet each other at 1024 processors. From the ....
V. Kumar, G.Y. Ananth and V.N. Rao, "Scalable Load Balancing Techniques for Parallel Computers," Technical Report, No.TR 91-55, Comp. Sci. Dept., Univ. of Minnesota, Mpls, MN 55455.
....introduced by the algorithm reaches 36 for 8 processors. This is determined by three factors: 1) the 11 relatively high communication costs considered (1:5 0:005 L milliseconds for messages of size L bytes) 2) the cost of the dynamic load balancing mechanism for a network of workstations [18]; and (3) the small granularity of the subproblems. We will see that for a larger problem (Table 1) the overhead is much lower (15.58 of the total execution for 100 processors, from which 13.67 are load balancing costs, 0.78 communication time and 1.13 list contraction time) Furthermore, this ....
V. Kumar, A. Grama, V. Rao. 1994. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1):60-79.
....away actor creations since Fibonacci actors are purely functional. The computation tree of the Fibonacci program has a great deal of load imbalance. Table 6.2 and Figure 6. 3 compare two execution results with and without dynamic load balancing (DLB) A receiver initiated random polling scheme [83] is used for dynamic load balancing. As Figure 6.3 shows, the version with DLB performs worse on partitions of a small size due to the overhead for extra book keeping. However, it eventually outperforms the version without DLB as the size increases. 6.4 Systolic Matrix Multiplication The systolic ....
V. Kumar, A. Y. Grama, and V. N. Rao. Scalable Load Balancing Techniques for Parallel Computers. Technical Report 91-55, CS Dept., University of Minnesota, 1991. available via ftp ftp.cs.umn.edu:/users/kumar/lbMIMD.ps.Z.
....get new work from load balancer WHILE subproblem is not empty DO IF there is a load request THEN split subproblem send one part to the initiator of the request do some work on subproblem Figure 1: Receiver induced tree splitting. This approach has proved useful under a variety of circumstances [3, 16, 17, 6, 2, 13, 7, 5, 20, 18, 19]. A major advantage of receiver induced tree splitting is that load balancing only takes place when necessary. Also, in the beginning, the size (execution time) of transmitted subproblem will be fairly large; subsequent productive work done on the migrated subprolem will make up for the expense of ....
....neighborhood polling schemes is that highly loaded PEs will quickly be surrounded by a cluster of busy PEs and are therefore unable to transmit work; subproblem transmissions at the border of these clusters only involve small subproblems which are not worth the effort of communicating them. In [6, 7] a variety of other partner selection schemes is analyzed. There is a tradeoff between schemes based on local information which may produce many vain requests to idle processors, and global selection schemes which incur additional message traffic and often suffer from contention at centralized ....
[Article contains additional citation context not shown here]
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
....to be redistributed such that a maximum number of PEs can continue processing. Load balancing is the task of equilibrating the load as evenly as possible. The easier task of load sharing is to supply each PE at least with some load. For MIMD computers, many approaches have already been studied [4]. Recently, parallel B B has been investigated for SIMD computers. A global approach is to map busy to idle PEs in a one to one correspondence and to transfer nodes by global communication. This approach has been applied to the IDA algorithm in [6, 7] Static or dynamic trigger conditions ....
Kumar V., Ananth G. Y., Rao V. N., 1991, "Scalable Load balancing techniques for parallel computers", Technical Report 91-55, Department of Computer Science, University of Minnesota.
.... size to the number of processors required to maintain a system s efficiency, and enables the determination of scalability with respect to the number of processors [14] Isoefficiency analysis has been used to characterize the scalability of some load distribution policies for different machines [20]. Another important property is stability (see section 4.1) Formally a stable load distribution algorithm is defined as one in which a bounded input produces a bounded output. More informally we can say that stability is the ability of the algorithm to detect when the effects of further actions ....
....However, both referred papers present policies with better results. 3 LOAD DISTRIBUTION ALGORITHMS 22 In [24, 36] a similar approach is presented in which the destination nodes are randomly choosen from a subdomain of the system (possibly nearest neighbours) instead of the whole system. [20] presents an isoefficiency study of a random location policy in several different machine architectures. 3.5 Probing This is a demand driven location policy: a node willing to participate in a task transfer probes another node to find out if it is a suitable partner. Nodes can be probed either ....
[Article contains additional citation context not shown here]
Kumar, V., and Grama, A. Scalable load balancing techniques for parallel computers. Tech. rep., University of Minnesota, 1992.
....= U i (T n ) Gamma U i (T n Gamma1 ) Ainsi si : Phi(i) 0 ou Phi(i) 0; 8i 2 f1; Ng alors le processus d equilibrage de charge est avort e. Laboratoire PRiSM 16 M. Benaichouche Dans la majorit e des etudes faites sur le probl eme de r egulation, en l occurrence celles de Kumar [18], Abderahman [1] Felten [12] Gengler [13] ainsi que celle de Luling et Monien [22, 23] le cout de communication engendr e par l equilibrage de la charge a une forte influence sur les performances attendues. Par cons equent, nous nous sommes orient es vers une etude analytique de ce facteur, en ....
V. Kumar, G.Y. Ananth, and V.N. Rao. Scalable load balancing techniques for parallel computers. Journal Of Parallel And Distributed Computing., 1991.
....algorithm which finds all the solutions of CSPs (total exploration of the search tree) and splits different parts of the tree to allocate them to processes in a network of computers. There has been a lot of researches on parallel processing of backtracking search for general tree search problem[3, 2, 6, 4]. Good experimental results have been reported and theoretical works about the load balancing among processes are numerous but we have found very few theoretical work about the minimization of the communication cost and we have decided to focus our interest on it. 2. The Distributed Algorithm The ....
V. Kumar, A. Grama, and V. Rao. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1), 1994.
....when the local load is below a given threshold. Their technique, however, needs to maintain global knowledge of the load either on all the nodes involved, or on a subset of the nodes called domain. Another interesting work that presents general load balancing techniques is the one by Kumar et al. [10]. Among others, they introduced a technique that adopts a global round robin policy to select the processing node to which a further task must be requested. The technique does not assume any knowledge of the load, and thus it might not be very accurate in scheduling decisions. On the other hand, ....
....of local chunks not yet scheduled, 2. the average cost of each local chunk, and 3. the cost of chunk migration which depends on the specific parallel machine. Since it is the receiver of a remote chunk that starts chunk migration, our technique can be defined as a receiver initiated technique [10, 18]. According to the framework proposed by Willebeek LeMair and Reeves [18] the technique can be characterized as follows: ffl Processor Load Evaluation. Since the first part of our technique is static, the average cost of local chunks can be determined by a very simple code instrumentation, which ....
[Article contains additional citation context not shown here]
V. Kumar, A.Y. Grama, and N. Rao Vempaty. Scalable Load Balancing Techniques for Parallel Computers. Journal of Parallel and Distributed Computing, 22:60--79, 1994.
....supplies new subproblems by requesting other PEs to split their subproblem. Idle PEs receiving a request either reject the request or redirect it to another PE. Figure 1 shows pseudocode for such a generic tree splitting algorithm. This approach has proved useful under a variety of circumstances [7, 24, 25, 13, 5, 21, 14, 29, 27, 2, 10, 28]. A major advantage of receiver induced tree splitting is that load balancing only takes place when necessary. If the sequential execution time T seq is large, the average size of a transmitted subproblem is also fairly large (i.e. it represents a large execution time) Productive work done on a ....
....these neighborhood polling schemes is that highly loaded PEs are quickly surrounded by a cluster of busy PEs and are therefore unable to transmit work; subproblem transmissions at the border of these clusters only involve small subproblems which are not worth the effort of communicating them. In [13, 14] a variety of other partner selection schemes is analyzed. There seems to be a dilemma between schemes based on local information on the one hand which may produce many vain requests to idle PEs, and global selection schemes on the other hand which incur additional message traffic and often suffer ....
V. Kumar and G. Y. Ananth. Scalable load balancing techniques for parallel computers. Technical Report TR 91-55, University of Minnesota, 1991.
....and scaling. These results show that convergence is rapid, accuracy is limited only by machine precision, and superlinear speedup can be achieved for cases of practical interest in CFD. 2 A Parabolic Model A number of articles have proposed solutions to the load balancing problem in recent years [2, 5, 10, 12, 13, 15, 17, 18, 22]. Many of these solutions are reliable and efficient for computer systems with small numbers of processors. Unfortunately many of them do not scale well to systems with large numbers of processors. It is well known that in a scalable algorithm the amount of work performed in parallel grows more ....
Kumar, V., Ananth, G. Y. & Rao, V. N. Scalable load balancing techniques for parallel computers. Preprint 92-021. Army High Performance Computing Research Center, Minneapolis, MN, 1992.
....owner computes rule. Each processor has a local queue Q to store its own chunks. At the beginning, and until a load imbalance has been detected, chunks are statically scheduled by fetching them from Q. The dynamic part of the scheduling policy can be classified as a receiver initiated technique [5] and starts when a processor understands that queue Q is becoming empty. Another queue QR, initially empty, is used at this point to store the chunks received from more loaded partners. Figure 1 sketches the SPMD implementation of an iterated parallel loop. In particular, the figure shows the ....
V. Kumar, A.Y. Grama, and N. Rao Vempaty. Scalable Load Balancing Techniques for Parallel Computers. J. of Parallel and Distr. Comp., 22:60--79, 1994.
....algorithm is fully distributed and asynchronous. Chunk migration decisions are taken on the basis of local information only. It can be classified as a receiver initiated load balancing technique, since it is the receiver of remote chunks that asks overloaded processors for work migration [14, 15]. According to the framework proposed by Willebeek LeMair and Reeves [15] our load balancing technique can be characterized by considering the strategies exploited for processor load evaluation, load balancing profitability determination, task migration, and task selection. Processor Load ....
....by the destination (receiver) of the chunks. The receiver selects the processor to be asked for further chunks by using a round robin policy. This criterion was chosen for its simplicity and also because tests showed that it was the best policy to adopt. As discussed in the paper by Kumar et al. [14], it is more important not to spend too much time in making a decision than always making the best decision. SUPPLE uses the same Threshold parameter mentioned above to reduce overheads of our task migration strategy. Overheads should derive from requests for remote chunks which cannot be served ....
[Article contains additional citation context not shown here]
V. Kumar, A.Y. Grama, and N. Rao Vempaty, "Scalable Load Balancing Techniques for Parallel Computers," J. of Parallel and Distr. Comp., vol. 22, pp. 60--79, 1994.
....is needed. When all the solutions of a problem have to be found by expanding the whole search tree, satisfactory performances have been achieved: the speedup can be close to linearity, that is, the resolution time is almost divided by the number of engaged processes. It is established in theory [KGR94,Prc96] and veri ed in practice [FM87,FK88,RKR87] However, an additional diOEculty has to be faced when only one or the best solution is expected: some processes running in parallel may explore a part of the search tree which would not have been examined during a sequential search, and this may produce ....
....to the search for all solutions. As to parallelize the expansion of the whole tree, the root nodes of the highest subtrees are allocated to the processes. Each process runs a DFS of his own subtree(s) As soon as it terminates, it asks another process for one or several new root nodes. In [KGR94] is proven that we can make the communication overheads only grow in a polynomial way while guaranteeing an equitable work sharing. This con rms previous experimental results that show a speedup close to linearity for problems solved by this kind of algorithms [FM87,FK88,RKR87] Applied to CSPs, ....
V. Kumar, A. Grama, and V. Rao. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1), 1994.
....by a surrounding area, whose size depends on the specific stencil features. Migrated chunks and data tiles are stored by each receiving processor in a queue RQ, called remote. Since load balancing is started by underloaded processors, our technique can be classified as receiver initiated [8, 19]. In the following we detail our hybrid scheduling algorithm. During the initial static phase, each processor only executes local chunks in Q and measures their computational cost. Note that, since the possible load imbalance may only derive from different speeds of the processors involved, ....
V. Kumar, A.Y. Grama, and N. Rao Vempaty. Scalable Load Balancing Techniques for Parallel Computers. J. of Parallel and Distr. Comp., 22:60--79, 1994.
....Processor A initializes the visibility computations of packet P, sends it around the ring and returns it to Processor B afterwards. Processor A repeats the random polling until all processors have finished the current iteration. The method of random polling was chosen as the work of Kumar et al. [Kuma94] shows that this method is in general superior to all other methods considered by them in a comprehensive survey, especially when used with massively parallel computer systems. As idle processors do not generate any packets themselves but only introduce one packet from another ring at a time this ....
Vipin Kumar, Ananth Y. Grama, Nageshwara Rao Vempaty, "Scalable Load Balancing Techniques for Parallel Computers", Journal of Parallel and Distributed Computing 22, 60-79 (1994).
....assignments can be found in the subdirectories C and F90, see below in section 3.1. 2 Dynamic load balancing The sections below briefly discusses techniques for load balancing. More details about about these techniques can be found in the articles discussed in connection with lecture 13 ( 1] and [2]) The algorithms in sections 2.1, 2.2 and 2.3 requires a method to spit the work. In our simulated problem this is quite easy, we just send half of the work that is left to the processor requiring more work. In the final section below, 2.5, a different type of load balancing scheme is considered. ....
Vipin Kumar, Ananth Y. Grama and Vempaty Nageshwara Rao, Scalable Load Balancing Techniques for Parallel Computers. Journal of Parallel and Distributed Computing, 22(1):60--79, 1994.
....de stabilit e Phi Phi(i) U i (T n ) Gamma U i (T n Gamma1 ) Ainsi, si pour tous les processeurs(i 1;N ) Phi(i) 0 ou Phi(i) 0 alors le processus d equilibrage de charge est avort e. Dans la majorit e des etudes faites sur le probl eme de r egulation, en occurrence celles de Kumar [KAR91], Abderahman [AM88] Felten [Fel88] Gengler [GC92] ainsi que celle de Luling et Monien [LM89, LM92] le sur cout de communication engendr e par l equilibrage de la charge a une forte influence sur les performances attendues. Par cons equent, nous nous sommes orient es vers une etude analytique ....
V. Kumar, G.Y. Ananth, and V.N. Rao. Scalable load balancing techniques for parallel computers. Journal Of Parallel And Distributed Computing., 1991.
....of 11,405,773 actors. Moreover, its computation tree has a great deal of load imbalance. Table 4 shows the execution results when using a dynamic load balancing scheme. Since Fibonacci actors are purely functional, actor creations were optimized away. Receiver initiated random polling scheme [25] is used for dynamic load balancing. As a point of comparison, executing the Fibonacci of 33 using the Cilk system [6] takes 73.16 seconds on the same Sparc processor and an optimized C version completes in 8.49 seconds. 7.3 Systolic Matrix Multiplication The systolic matrix multiplication ....
V. Kumar, A. Y. Grama, and V. N. Rao. Scalable Load Balancing Techniques for Parallel Computers. Technical Report 91-55, CS Dept., University of Minnesota, 1991. available via ftp ftp.cs.umn.edu:/users/kumar/lbMIMD.ps.Z.
....such that w cw and (1 )w cw. The role of c is to set a bound on the load imbalance resulting from work splitting. 16 Instances of applications which conform to these characteristics are found in depth first search of large unstructured trees used for solving discrete optimization problems [6]. Some parallel algorithms for solving this problem employ the following dynamic load balancing strategy. All work is initially assigned to one processor. An idle processor Pi selects a processor Pa using some selection criterion and sends it a work request. If processor Pa has no work, then it ....
....it a work request. If processor Pa has no work, then it responds with a reject message; else, it partitions its work into two parts and sends one of the pieces to Pi. This process continues until all processors exhaust the available work. Various selection criteria have been proposed in literature [6, 8]. One technique referred to as Global Round Robin (GRR) maintains a global pointer G located at one of the processors. This pointer initially points to the first processor in the ensemble. Each time an idle processor needs to select P, it reads the current value of G, and requests work from P. ....
[Article contains additional citation context not shown here]
Vipin Kumar, Ananth Grama, and V. Nageshwara Rao. Scalable load balancing techniques for parallel computers. Technical Report 91-55, Computer Science Department, University of Minnesota, 1991.
....the location policy seeks out a receiver node to receive the job selected by the selection policy (described below) If the node is an eligible receiver, the location policy looks for an eligible sender node. Load balancing approaches which focus on the use of the location policy are described in [8, 9]. Once a node becomes an eligible sender, a selection policy is used to pick which of the queued jobs is to be transferred to the receiver node. The selection policy uses several criteria to evaluate the queued jobs. Its goal is to select a job that reduces the local load, incurs as little cost ....
V. Kumar, A. Gramma, and V. Rao. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1):60--79, July 1994.
....to result in significant load imbalance among processors. The core of parallel formulations of DFS algorithms is thus a dynamic load balancing technique that minimizes inter processor communication and processor idling. A number of load distribution techniques have been developed for parallel DFS [25, 41, 22, 17, 21]. Load balancing techniques may be receiver initiated or sender initiated. In receiver initiated techniques, when a processor becomes idle, it requests a selected processor for work. Many different selection policies have been proposed such as use of a centralized server, random polling [12] ....
....techniques may be receiver initiated or sender initiated. In receiver initiated techniques, when a processor becomes idle, it requests a selected processor for work. Many different selection policies have been proposed such as use of a centralized server, random polling [12] nearest neighbor [25], global roundrobin [25] and asynchronous round robin [12, 25] In contrast to receiver initiated load balancing, in sender initiated schemes, a processor sends work to selected processors as it is generated. Selection strategies for sender initiated work transfers include hierarchical techniques ....
[Article contains additional citation context not shown here]
Vipin Kumar, Ananth Grama, and V. Nageshwara Rao. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1):60--79, July 1994.
....the real time requirements imposed by many interactive applications. Hence, parallelization of GIS is essential in meeting the high performance requirements of several real time applications. A GIS operation can be parallelized either by function partitioning [2, 3, 5, 30] or by datapartitioning [4, 8, 13, 17, 19, 25, 32, 33]. Function Partitioning uses specialized data structures (e.g. distributed data structures) and algorithms which may be different from their sequential counterparts. Data Partitioning techniques divide the data among different processors and independently execute the sequential algorithm on each ....
....be enough to achieve good load balance. In such a case, both static partitioning and DLB techniques can be used. Wang [32] used dynamic allocation of work at different levels (e.g, polygons, edges) for map overlay computation. In addition, several dynamic load balancing methods have been developed [12, 20, 23, 25] for load balancing in different applications. Data Partitioning for map overlay [32] spatial join, and access methods [18, 19] is not related to the work presented in this paper. Declustering and dynamic load balancing for extended spatial data have not received adequate attention in the ....
[Article contains additional citation context not shown here]
V. Kumar, A. Grama, and V. N. Rao. Scalable load balancing techniques for parallel computers. Journal of Distributed Computing, 7, March 1994.
....HP GIS applications are at least an order of magnitude larger than these simple maps. Hence, we need more refined approaches like parallel algorithms, which deliver the required performance. Processing of GIS range query can be parallelized by function partitioning [1, 2, 4] or datapartitioning [3, 7, 9, 13, 14, 20, 26, 27]. Function partitioning uses specialized parallel datastructures and algorithms which may be different from their sequential counterparts. Datapartitioning techniques divide the spatial data (e.g. points, lines, polygons) among different processors, and independently execute the sequential ....
....the sequential algorithm on each processor. In addition, the processors may exchange partial results during the run time. In this paper, we only focus on data partitioning techniques. Spatial data can be partitioned and allocated statically [3, 7, 8, 9, 10, 13, 14, 17, 22, 27] or dynamically [20, 26]. Static partitioning (or declustering) and load balancing methods divide and allocated the data prior to the computation process. In contrast, dynamic load balancing techniques divide and or allocate work at run time. It has been shown that customized declustering techniques based on space ....
[Article contains additional citation context not shown here]
V. Kumar, A. Grama, and V. N. Rao. Scalable load balancing techniques for parallel computers. Journal of Distributed Computing, 7, March 1994.
....problem involving higher dimensional Cspace. Fortunately a great deal of work has been done in developing parallel search algorithms capable of solving similar problems [10, 9] Many of the algorithms developed deliver linear speedup with increasing problem and processor size on various problems [2, 8]. It would seem that parallel motion planning methods which use such parallel search schemes should be able to deliver such performance as well. This is due to the following observation. Amdahl s law states that if s is the serial fraction of an algorithm then, no matter how many processors are ....
V. Kumar, G. Ananth, and V. Rao. Scalable load balancing techniques for parallel computers. The Journal of Parallel and Distributed Computing, (to appear), 1993.
....Search overhead is caused because sequential and parallel DFS search the nodes in a different order. Communication overhead is dependent upon the target architecture and the load balancing technique. The communication overhead in parallel DFS was analyzed in our previously published papers[17, 16, 4, 14], and was experimentally validated on a variety of problems and architectures. 1 In this paper, we only analyze search overhead. However, in the experiments, which were run only on real multiprocessors, both overheads are incurred. Hence the overall speedup observed in experiments may be less ....
....avoided under these nodes. This method is also referred to as ordered DFS[13] 3 Parallel DFS There are many different parallel formulations of DFS[7, 15, 19, 34, 2, 8, 23] that are suitable for execution on asynchronous MIMD multiprocessors. The formulation discussed here is used quite commonly [22, 23, 2, 4, 14]. In this formulation, each processor searches a disjoint part of the search space. Whenever a processor completely searches its assigned part, it requests a busy processor for work. The busy processor splits its remaining search space into two pieces and gives one piece to the requesting ....
Vipin Kumar, Ananth Grama, and V. Nageshwara Rao. Scalable load balancing techniques for parallel computers. Technical report, Tech Report 91-55, Computer Science Department, University of Minnesota, 1991.
....the real time requirements imposed by many interactive applications. Hence, parallelization of GIS is essential in meeting the high performance requirements of several realtime applications. A GIS operation can be parallelized either by function partitioning [2, 3, 5, 30] or by datapartitioning [4, 8, 13, 17, 19, 25, 32, 33]. Function Partitioning uses specialized data structures (e.g. distributed data structures) and algorithms which may be different from their sequential counterparts. Data Partitioning techniques divide the data among different processors and independently execute the sequential algorithm on each ....
....might not be enough to achieve a good load balance. In such a case, both static partitioning and DLB techniques can be used. Wang [32] used the dynamic allocation of work at different levels (e.g, polygons, edges) for mapoverlay computation. In addition, several dynamic methods have been developed [12, 20, 23, 25] for load balancing in different applications. Data Partitioning for map overlay [32] spatial join, and access methods [18, 19] is not related to the work presented in this paper. Declustering and dynamic load balancing for extended spatial data have not received adequate attention in the ....
[Article contains additional citation context not shown here]
V. Kumar, A. Grama, and V. N. Rao. Scalable load balancing techniques for parallel computers. Journal of Distributed Computing, 7, March 1994.
.... conclusions [10, 11, 18, 20] The isoefficiency metric has been found to be quite useful in characterizing scalability of a number of algorithms [9, 21, 32, 38, 41, 42] In particular, it has helped determine optimal load balancing schemes for tree search for a variety of MIMD architectures [20, 8, 17]. In this paper, we present new methods for load balancing of unstructured tree computations on large scale SIMD machines, and analyze the scalability of these and pre existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: i) a triggering ....
....assumption for the splitting mechanism: if work w at one processor is split into two parts w and (1 Gamma )w, then 1 Gamma ff ff, where ff is an arbitrarily small constant. We call this splitting mechanism the alpha splitting mechanism. As demonstrated by experiments on MIMD machines [25, 1, 8, 17, 23] it is possible to find alpha splitting mechanisms for most tree search problems. The total number of nodes expanded in parallel search can often be higher or lower than the number of nodes expanded by serial search [33, 30, 23] leading to speedup anomalies. Here we study the performance of these ....
[Article contains additional citation context not shown here]
Vipin Kumar, Ananth Grama, and V. Nageshwara Rao. Scalable Load Balancing Techniques for Parallel Computers. Technical report, Tech Report 91-55, Computer Science Department, University of Minnesota, 1991.
No context found.
V. Kumar, A.Y. Grama, and N. R. Vempaty, "Scalable load balancing techniques for parallel computers," Journal of Parallel and Distributed Computing, vol. 22, pp. 60--79, 1994.
No context found.
V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. Journal of Parallel and Distributed Computing, 22(1):60--79, 1994.
No context found.
Kumar, V., Grama, A., and Rao, V., 1994, Scalable Load Balancing Techniques for Parallel Computers. Journal of Parallel and Distributed Computing, 22: 60-79.
No context found.
Kumar, V., Ananth, G. Y. & Rao, V. N. Scalable load balancing techniques for parallel computers. Preprint 92-021. Army High Performance Computing Research Center, Minneapolis, MN, 1992.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC