| A. C. Dusseau, et al., "Effective Distributed Scheduling of Parallel Workloads", ACM Sigmetrics 96, Philadelphia, May 1996, pp. 25-36. |
....migrations makes the migration too complex and fragile, and as a result, very few projects have made it into commercial products. VCPU migration in our Virtual Clusters approach is much simpler because it does not require any support from the operating system. Most of the previous CPU managers [20, 22, 65] have been designed for messagepassing based distributed systems, not shared memory multiprocessors. The increased level of sharing in shared memory machines drastically reduces the cost of migrating tasks and maintaining load information, making these other policies suboptimal for such systems. ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In Proceedings of 1996.
.... Though such algorithms will work in some multiprogrammed environments, in particular those that employ static space partitioning [15, 30] or coscheduling [18, 30, 33] they do not work in the multiprogrammed environments being supported by modern shared memory multiprocessors and operating systems [9, 15, 17, 23]. The problem lies in the assumption that a fixed collection of processors are fully available to perform a given computation. This research is supported in part by the Defense Advanced Research Projects Agency (DARPA) under Grant F30602 97 10150 from the U.S. Air Force Research Laboratory. In ....
....are scheduled simultaneously, thereby giving the computation the illusion of running on a dedicated machine. Interestingly, it has recently been shown that in networks of workstations coscheduling can be achieved with little or no modification to existing multiprocessor operating systems [17, 35]. Unfortunately, for some job mixes, coscheduling is not appropriate. For example, a job mix consisting of one parallel computation and one serial computation cannot be coscheduled efficiently. With process control [36] processors are dynamically partitioned among the running computations so that ....
[Article contains additional citation context not shown here]
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 25--36, Philadelphia, Pennsylvania, May 1996.
....things like program tracing, error reporting or even letting user dynamically interfere with the system scheduler. C 3.1.2. Scheduler The scheduler module is aimed at management of the set of interpretive frames generated by the frame processor, determination of the execution modes of them [24], and coordination of the execution sequence. As mentioned earlier, a frame may be executed in one of the three distinct ways: locally interpreted by interpretive module, locally executed by an invocation to a local runtime function, or sent to some registered remote system, which has specific ....
Andrea C. Dusseau, Remzi H. Arpaci, David E. Culler, Effective Distributed Scheduling of Parallel Workloads, Sigmetrics'96, May 1996, Philadelphia, PA
....prediction Resource supply prediction i.e. load prediction is found in some form or the other in every adaptive systems. Typically, prediction is implicit in these systems: offered load or other system indicators are used as control signals that drive load balancing [58] scheduling [27], process migration [28, 45] or resource allocation [89] A few systems use explicit resource prediction. The Running Time Advisor [25] predicts the running time of a job on any host in a distributed system, given its nominal execution time: i.e. it predicts resource supply and performance, but ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS '96), pages 25--36, Philadelphia, PA, May 1996.
....a nonscheduled process for synchronization communication and will minimize the waiting time at the synchronization points. However, the nondedicated feature of workstations and the low communication bandwidth make the implementation of complete coscheduling on NOWs expensive and not realistic [7]. Using quantified and deterministic system information such as a power weight, and the power preservation in each workstation, we address the three NOW scheduling issues by designing a scheduling scheme called self coordinated local scheduling. This scheme coordinates parallel processes ....
....Predicative coscheduling uses the recent history of communications among processes to predict coming communication activities of each process. When a process is scheduled on one node, an attempt is made to schedule its correspondents on other nodes for simultaneous execution. Dusseau et al. [7] address the scheduling issue from another perspective. In their study, they find that local scheduling is a feasible alternative to coscheduling for parallel applications with barrier synchronization. They propose a blocking algorithm, called the two phased fixed spin policy to avoid the ....
Dusseau, A. C., Arpaci, R., and Culler, D. Effective distributed scheduling of parallel workloads. Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. May 1996, pp. 25--36.
....network load prediction to be addressed. Moreover, the higher overhead due to non local job scheduling is more suitable for coarser grain guest jobs than the ones addressed in this study. Finally, scheduling parallel computations on batch parallel systems has attracted considerable attention [3, 6, 13, 10]. The usual metric to be optimized here is global batch throughput. However, Subholk at al. 13] proposes strategies to minimize response time for individual applications. They take both communication load and computation load into account and select a pool of workstation and communication links ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective distributed scheduling of parallel workloads. In Proc. 1996 ACM SIGMETRICS Intl. Conf. on Measurement & Modeling of Computer Systems, Philadelphia, PA, 1996.
.... one amortize the cost of establishing trust across many surrogates in a neighborhood How is load balancing on surrogates done Is surrogate allocation to be done based on an admission control or best effort approach How relevant is previous work on load balancing on networks of workstations [2] What are the implications for scalability How dense does the fixed infrastructure have to be to avoid overloads during periods of peak demand How does one discover the presence of surrogates Of the many proposed service discovery mechanisms such as JINI [10] UPnP [6] and BlueTooth ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of 1996.
....to support adaptive applications, has a surprisingly short (a) fraction of deadlines met (b) fraction of deadlines met when predicted Figure 9. RTSA scheduling performance versus slack factor. history. The parallel computing community has studied application level load balancing for some time [23, 25, 10], but this work has treated prediction only implicitly. The operating systems community has studied existing workloads [20, 11, 16, 13] to support distributed load sharing, and developed innovative system level scheduling policies based on queueing theoretic models [13] In contrast to these two ....
Dusseau, A. C., R. H. Arpaci, and D. E. Culler: 1996, `Effective Distributed Scheduling of Parallel Workloads'. In: Proceedings of the 1996 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. pp. 25--36.
....is typically even higher. 8. Related work Work on the explicit prediction of the dynamic behavior of distributed systems, particularly to support adaptive applications, has a surprisingly short history. The parallel computing community has studied application level load balancing for some time [20, 22, 10], but this work has treated prediction only implicitly. The operating systems community has studied existing workloads [18, 11, 15, 13] to support distributed load sharing, and developed innovative system level scheduling policies based on queueing theoretic models [13] In contrast to these two ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective distributed scheduling of parallel workloads. In Proceedings of the 1996.
....processes that communicate frequently are identified, and it is assured that the corresponding threads are all activated at the same time. Similar schemes in which co scheduling is triggered by communication events were described by Sobalvarro and Weihl [83] and by Dusseau, Arpaci, and Culler [15]. Taking system load and minimum and maximum parallelism of each job into account as well, still higher throughputs can be sustained [77] Chiang et al. 8] show that use of knowledge of some job characteristics plus permission to use a single preemption per job allows run to completion policies ....
A. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads". In SIGMETRICS Conf. Measurement & Modeling of Comput. Syst., pp. 25--36, May 1996.
....scheduling is not a very attractive scalable option for off the shelf clusters, since it requires periodic synchronization across the nodes to coordinate the effort. Longer time quanta to offset this cost can decrease the responsiveness of the system. As a result, there have been recent efforts [5, 19, 15] to design a new class of scheduling mechanisms broadly referred to as dynamic coscheduling which approximate coscheduled execution without explicitly synchronizing the nodes. These techniques use local events (e.g. message arrival) to estimate what is happening at remote nodes, and ....
....difficult to perform an extensive performance evaluation across the entire spectrum. As a result, it is difficult to say which approaches perform best, and under what conditions, and thus many important design and performance questions remain open. Previous studies have used a set of static jobs [19, 5, 15] or a very specific dynamic arrival pattern [24] on a small scale parallel system. While one could use experimentation [15] and simulation [1, 24] to study small scale systems, the suitability of these mechanisms for large scale systems has not been explored, which is one of the motivating factors ....
[Article contains additional citation context not shown here]
A.C. Dusseau, R.H. Arpaci, D.E. Culler. Effective distributed scheduling of parallel workloads. Proceedings of ACM SIGMETRICS Conference on Measurement & Modeling of Computer Systems, 25--36, 1996.
....related to local task control, as performed by the NLS. Depending on the level of inter NLS synchronization, different scheduling variations are possible. Explicit gang scheduling systems always run all the tasks of a job simultaneously [5 10, 17, 21] In contrast, communication driven systems [3, 19, 20] are more loosely coupled, and schedule tasks based on message arrival. Independent of the inter task scheduling approach, we address those cases in which all processes within a task are closely coupled, like the csh example of the first paragraph. In these situations, all processes of a task must ....
A. C. Dusseau,R. H. Arpaci, and D. E. Culler. Effective Distributed Scheduling of Parallel Workloads. In ACM SIGMETRICS'96 Conference on the Measurement and Modeling of Computer Society, 1996.
....CVM problem. In Section 4, we then fully investigated implications from our results and the technique we used to reduce message delay. Finally, we presented our conclusion for similar polling structures and execution environments in Section 5. Previous studies of the message notification delay [2, 3] reduce its effect by applying special algorithms to re schedule and synchronize incoming messages with polling processes. Many target message passing parallel programs, in which synchronization delay is highly important. Significant improvements can be achieved with these approaches, but they ....
A.C. Dusseau, R.H. Arpaci, and D.E. Culler. Effective Distributed Scheduling of Parallel Workloads.inSigmetrics'96 Conference on the Measurement and Modeling of Computer Systems. 1996.
....each other and then some processes have to be blocked when communicating or synchronising with non scheduled processes on other processors. This effect can lead to a great degradation in overall system performance [4, 6, 9, 11, 13] One method to alleviate this problem is to use two phase blocking [8, 22] which is also called implicit coscheduling in [8] In this method a process waiting for communication spins for some time in the hope that the process to be communicated with on the other processor is also scheduled, and then blocks if a response has not been received. The reported experimental ....
....when communicating or synchronising with non scheduled processes on other processors. This effect can lead to a great degradation in overall system performance [4, 6, 9, 11, 13] One method to alleviate this problem is to use two phase blocking [8, 22] which is also called implicit coscheduling in [8]. In this method a process waiting for communication spins for some time in the hope that the process to be communicated with on the other processor is also scheduled, and then blocks if a response has not been received. The reported experimental results show that for parallel workloads this ....
A. C. Dusseau, R. H. Arpaci and D. E. Culler, Effective distributed scheduling of parallel workloads, Proceedings of ACM SIGMETRICS'96 International Conference, 1996.
.... to develop different techniques in an attempt to adapt the traditional uniprocessor time shared scheduler to the new situation of mixing local and parallel workloads [1, 10] Basically, there are two methods of making use of these CPU idle cycles, task migration [6, 7] and time slicing scheduling [8, 9]. In a NOW, in accordance with the research carried out by Arpaci [10] task migration overheads and the unpredictable behavior of local users may lower the effectiveness of this method. In a time slicing environment, two issues must be addressed: how to coordinate the simultaneous execution of ....
A.C. Dusseau, R.H. Arpaci and D.E. Culler. "Effective Distributed Scheduling of Parallel Workloads". ACM SIGMETRICS'96, 1996.
....of the application to determine how the allocated partition will be used and whether multiple threads will be interleaved on individual processors or not. The program developer makes the appropriate decision according to the computation, synchronization and communication needs of the problem [7, 20]. The 3 Application Characteristics Maximum Parallelism p max Execution Time on one Processor T (1) Execution Time on p 0 Processors T (p) Load Parameters Number of Waiting Jobs Number of Running Jobs Table 1: The application characteristics and load parameters used by the scheduling ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective Distributed Scheduling of Parallel Workloads. In ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, May 1996. To appear.
....avenue for achieving coordinated parallel scheduling in an environment that coexists with autonomous node schedulers. 1 Introduction Coordinated scheduling of parallel jobs across the nodes of a multiprocessor is well known to produce benefits in both system and individual job efficiency [12, 18, 4, 5, 17, 3]. Without coordinated scheduling, the processes constituting a parallel job suffer high communication latencies because of processor thrashing [12] While multiprocessor systems typically address these problems with a mix of batch, gang, and timesharing scheduling (based on kernel scheduler ....
....scheduler with spinning and spin block synchronization. These results indicate that dynamic coscheduling, spin block, and the combination of dynamic coscheduling with spin block synchronization can effectively achieve coscheduling. The effectiveness of spin block has been previously documented in [3] (where it was called implicit scheduling) and our measurements confirm their results. In addition, our work demonstrates that DCS achieves coscheduling with both spinning and spin block synchro nization, where implicit scheduling requires processes to block awaiting message arrivals for ....
[Article contains additional citation context not shown here]
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In ACM SIGMETRICS '96 Conference on the Measurement and Modeling of Computer Systems, 1996. Available from http://www.cs.berkeley.edu/~dusseau/Papers/sigmetrics96.ps.
....Msgs Msgs Proc Quantum Enum 50 Non blocking 254 Water 10 Blocking 12 LU 10 Blocking 3 Barnes 50 Blocking 28 Table 2. Characteristics of the real applications ments with more exhaustive sets of parameter values to quantify some of these effects, but such studies have been done before before [6, 9, 18], and here we are more interested in the qualitative difference in behavior at extreme ends of the application spectrum. 4.2 Real Applications The takeaway giveaway experiments are applied to four real applications as well. One, Enum, finds the total number of solutions to the triangle puzzle (a ....
A. Dusseau, R. Arpaci, , and D. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of ACM SIGMETRICS
....row a quantum, and then repeat. Each pattern in the figure is a different job, white space is packing loss. New jobs form new rows if they don t fit into existing ones, and departing jobs will cause rows to be collapsed together if possible. implement forms of time slicing in NOW environments [DAC96, ADCM98, SPWC98] Time slicing techniques can be distinguished by whether they use a global queue for the entire system, or local queues on each processor. Local queues are more common in distributed memory machines, because of the communication overhead involved in maintaining a shared global ....
....have a much lower cost per processor than large parallel computers. For this reason, they have been touted as the parallel computing platform of the future [ACP95] There is certainly evidence that parallel computing is feasible on this type of platform [CADAD 97] Much research [ADCM98, DAC96, ADC97, WZ97, KL97, LS93, ADV 95, SPWC98, DZ97, AS97, AFKT98] has been devoted to determining how to schedule parallel applications effectively in a NOW environment. In theory a NOW could be as powerful as a large, tightly coupled parallel machine, but there are some problems with the ....
[Article contains additional citation context not shown here]
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of
.... Feitelson proposed the name lazy timesharing for this kind of preemption [3] There is consensus in the scheduling research community that gang scheduling is a necessary feature for timesharing parallel applications with significant communication (although Dusseau et al. argue to the contrary [2]) Gang scheduling is available with PScheD, but not under the vanilla UNICOS mk scheduler. However, the timesharing strategy used by PScheD is very different from lazy timesharing; it is based on fixed quantum lengths that are shorter than lazy timesharing calls for on the order of seconds, ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 25--36, May 1996.
....the concept of wide area distributed computing. iv Chapter 1 Introduction The ever growing need for more computing power and the advent of distributed computing has caused a shift from mainframes, supercomputers and individual computing resources to networks of shared computing resources [1, 2, 6, 11, 13]. Taking these developments one step further, we can envision the formation of a world wide metacomputer offering vast amounts of heterogeneous computing resources to users all over the world in a universal way. Examples of attempts to build such a metacomputer are the worldwide flock of Condors ....
A. C. Dusseau, R. H. Arpaci, D. E. Culler, "Effective Distributed Scheduling of Parallel Workloads," Proceedings of the ACM Sigmetrics 1996, pp 25-36
....by context switching, since it takes a long time to wait for messages by busy wait. There is a communication method that deals with both fine grain and coarse grain parallel processes appropriately, with which context switching is conducted after waiting for messages several times by busy wait [3]. The communication library of AP Linux is equipped with this method, and its effect has already been confirmed [4] Even if this method is adopted, however, it is still important in the case of fine grain parallel processes to conduct co scheduling for using the busy wait effectively. A method ....
....0.4 ms for 8 processes. Although they are added to parallel processing as overhead every time, they are considered negligible because they account for 0.4 of 200 ms. This is also considered negligible when compared with the time lag of synchronization. 5 Related Research Implicit co scheduling[3] and dynamic co scheduling[9, 10] are resemble to our moderate co scheduling. Implicit co scheduling and dynamic co scheduling are demand based co scheduling, that is, they relay on a local scheduler to run the parallel process when it receive a message. They suppose that the scheduler quickly ....
A.C. Dusseau, R.H. Arpaci, and D.E. Culler. Effective Distributed Scheduling of Parallel Workloads. SIGMETRICS'96, 1996.
....can reserve and use in preference to interactive jobs. This strategy is almost opposite of our approach, which promotes interactive jobs. Prior studies that investigated running parallel programs on shared workstation clusters also employed fairly conservative eviction policies. Dusseau, et al. [7] used a policy based on immediate eviction. They were able to use a cluster of 60 machines to achieve the performance of a dedicated parallel computer with 32 processors. Acha et al. 1] used a different approach that reconfigured the parallel job to use fewer nodes when one became unavailable. ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads," SIGMETIRCS. May 1996, Philadelphia, PA, pp. 25-36.
....as described in Appendix A will be less expensive than the execution of instruction sequences to record accesses to shared data structures. 1 A description of dynamic coscheduling was first published in [21] 22 2. 5 Implicit Scheduling Implicit scheduling is the name given by Dusseau et al.[6] to their algorithms for adaptively modifying the spin times in spin block message receipt to achieve good performance on bulk synchronous applications (those which perform regular barriers, possibly with other communication taking place in between barriers) 2 This work has some bearing on ....
....communication taking place in between barriers) 2 This work has some bearing on our own, because of the similarity of problems and experimental platforms, and so we will treat it here at some length and revisit it in sections describing our experimental results. The experiments described in [6] were performed used the Solaris 2.4 scheduler code in a simulation of 32 workstations running 3 parallel jobs, each having one process residing on each of the 32 workstations. The workloads were SPMD, consisting of loops of four phases each: in the first phase, a variable amount of computation ....
[Article contains additional citation context not shown here]
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In ACM SIGMETRICS '96 Conference on the Measurement and Modeling of Computer Systems, 1996. Available from http://www.cs.berkeley.edu/~dusseau/Papers/sigmetrics96.ps.
....approach to time sharing a NOW is to leave to each workstation the task of scheduling its processes (both sequential and parallel) independently from the other machines in the NOW. Unfortunately, this form of uncoordinated scheduling severely hurts parallel applications performance [1, 6]: satisfactory performance can be achieved only if communicating processes are simultaneously scheduled (coscheduled [14] on the respective workstations. For instance, the completion time of a barrier synchronization is greatly reduced if participating processes arrive simultaneously at the ....
....of a dedicated cluster, coscheduling ensures that such a coordination is achieved, while uncoordinated scheduling cannot guarantee a good coordination among processes. A class of coscheduling strategies (that we call implicitly controlled or implicit for brevity) that have been recently proposed [6, 7, 13, 15] rely on local schedulers (to properly handle interactive and I O bound jobs mixes) and use the communication behavior of parallel processes to make scheduling decisions aimed at achieving a satisfactory approximation of coscheduling (of course, being the scheduling decision based on local ....
[Article contains additional citation context not shown here]
A. Dusseau, R. Arpaci, and D. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of ACM SIGMETRICS'96, 1996.
....in such a system. The performance of the various scheduling disciplines for shared clusters depends on the characteristics of the workstation cluster such as time of day, day of week, and schedule of the primary users. We use the traces of the utilization patterns of existing workstations. Dusseau[4] and Acha[1] have also used this approach. Also, to evaluate our scheduling policy, we need data about individual requests for processors, at the granularity of scheduler dispatch records because of the fine grained interaction between foreground and background processes. It is not practical to ....
....The upper right graph shows the variance in the run burst. The bottom two graphs in Figure 3 show the idle duration mean and variance, respectively. 3.2 Coarse Grain Workload Analysis To generate the long term variations in processor utilization, we use the traces collected by Arpaci et. al[4]. These traces cover data from 132 machines measured over 40 days, and contain samples every two seconds of: CPU usage, memory usage, keyboard activity, and a Boolean indicating idle non idle state. An idle interval is a period of time with the CPU less than 10 used and no keyboard action for 1 ....
[Article contains additional citation context not shown here]
A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads," SIGMETIRCS. May 1996, Philadelphia, PA, pp. 2536.
....11 a network of workstations for a mixture of parallel and sequential jobs. They have measured contention effects in order to show the necessity of mechanisms like co scheduling and migration of parallel jobs to idle workstations, but they do not predict contention effects. In fact, both [16] and [17] propose co scheduling strategies (or an approximation) for parallel applications executing in network of workstations. In [26] Harchol Balter and Downey show that preemptive migration, in which running processes may be suspended, moved to a remote host, and restarted, can minimize the effect ....
....the same priority as one another and to be scheduled locally in round robin fashion on the timeshared systems. In practice, most operating systems executing on workstations employ a priority based scheduling strategy that reduces to a roundrobin policy when the executing applications are CPU bound [17]. Since we assume the applications to be coarse grain, considering a round robin local scheduler for the time shared systems is a reasonable assumption. Assumptions about the Usage Our calculation of slowdown factors currently requires information about all the applications executing on the ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective Distributed Scheduling of Parallel Workloads, in Proceedings of ACM SIGMETRICS '96, pp. 25-36, May 1996.
.... approach is attractive due to its ease of construction, the performance of fine grain communicating jobs is severely impacted because scheduling is not coordinated across processors [10] An intermediate approach developed at UC Berkeley and MIT in recent years is implicit or dynamic coscheduling [1, 5, 15, 20]. With implicit coscheduling, each local scheduler makes independent decisions that dynamically coordinate the scheduling actions of cooperating processes across processors. These actions are based on local events that occur naturally within communicating applications. For example, on message ....
....of a representative subset of MPI 2 on a detailed (register level) simulation model [18] The simulation environment includes a standard version of MPI 2 and a multitasking one that implements the main features of our proposed methodology. 4. 1 Characteristics of the Synthetic Workloads As in [5], the workloads used consist of a collection of single program multiple data (SPMD) parallel jobs that alternate phases of purely local computation with phases of interprocess communication. A parallel job consists of a group of P processes where each process is mapped onto a processor throughout ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of the 1996 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems, Philadelphia, PA, May 1996.
....and load balancing functions of the front end are contained within the smartclient. It monitors availability and load on servers to choose resources for remote execution. Servers can also push load information to well established clients. For parallel applications, we utilize implicit coscheduling [32], rather than explicit gang scheduling, to achieve an efficient context, so the schedulers on the servers do not need to know in advance that they are being asked to run part of a parallel application. The smart client can determine the envelope of the application, including binaries and libraries ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of 1996 ACM Sigmetrics International Conference on Measurement a nd Modeling of Computer Systems, pages 25--36, May 1996.
.... performance of fine grained communication jobs can be orders of magnitude worse than with explicit coscheduling because the scheduling is not coordinated across processors [8] An intermediate approach initially developed at UC Berkeley and MIT in recent years is implicit or dynamic coscheduling [1, 4, 14]. With implicit coscheduling, each local scheduler makes independent decisions that dynamically coordinate the scheduling actions of cooperating processes across processors. These actions are based on local events that occur naturally within communicating applications. For example, on message ....
....version of MPI 2 and a multitasking version which implements the main features of buffered coscheduling. Because the design space of our problem is too large to explore exhaustively, we fix the workload and system characteristics and vary the computational granularity and load imbalance. As in [4], our workloads consist of a collection of single program multiple data (SPMD) parallel jobs that alternate phases of purely local computation with phases of interprocess communication. A parallel job generated by one such program consists of a group of P processes where each process is mapped on ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of the 1996 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems, Philadelphia, PA, May 1996.
....its processes. Although attractive due to its ease of construction, the performance of finegrain communicating jobs degrades significantly because scheduling is not coordinated across processors [7] An intermediate approach developed at UC Berkeley and MIT is implicit or dynamic coscheduling [1, 4, 12, 16] where each local scheduler makes decisions that dynamically coordinate the scheduling actions of cooperating processes across processors. These actions are based on local events that occur naturally within communicating applications. For example, on message arrival, a processor speculatively ....
....the main features of our proposed methodology. It is worth noting that the multitasking MPI 2 version is actually much simpler than the sequential one because the buffering of the communication primitives greatly simplifies run time support. 3. 1 Characteristics of the Synthetic Workloads As in [4], the workloads used consist of a collection of single program multiple data (SPMD) parallel jobs that alternate phases of purely local computation with phases of interprocess communication. A parallel job consists of a group of P processes where each process is mapped onto a processor throughout ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of the 1996 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems, Philadelphia, PA, May 1996.
....address the problem of coscheduling. There is little point in communicating extremely rapidly to a remote process that must be scheduled before it can respond. Coscheduling refers to techniques that seek to schedule simultaneously the processes constituting a computation on different processors [23], 63] In certain highly integrated parallel computers, coscheduling is achieved by using a batch scheduler: processors are space shared, so that only one computation uses a processor at a time. Alternatively, the schedulers on the different systems can communicate, or the application itself can ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In ACM SIGMETRICS '96 Conference on the Measurement and Modeling of Computer Systems, 1996.
....11 a network of workstations for a mixture of parallel and sequential jobs. They have measured contention effects in order to show the necessity of mechanisms like co scheduling and migration of parallel jobs to idle workstations, but they do not predict contention effects. In fact, both [16] and [17] propose co scheduling strategies (or an approximation) for parallel applications executing in network of workstations. In [26] Harchol Balter and Downey show that preemptive migration, in which running processes may be suspended, moved to a remote host, and restarted, can minimize the effect ....
....the same priority as one another and to be scheduled locally in round robin fashion on the timeshared systems. In practice, most operating systems executing on workstations employ a priority based scheduling strategy that reduces to a roundrobin policy when the executing applications are CPU bound [17]. Since we assume the applications to be coarse grain, considering a round robin local scheduler for the time shared systems is a reasonable assumption. Assumptions about the Usage Our calculation of slowdown factors currently requires information about all the applications executing on the ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective Distributed Scheduling of Parallel Workloads, in Proceedings of ACM SIGMETRICS '96, pp. 25-36, May 1996.
....achieving good parallel application performance in such environments, without sacrificing local scheduling autonomy. Specifically, we present new theoretical results on optimal decision making for systems that use implicit coscheduling [4] There has been a wealth of recent research in this area [3, 4, 7, 5, 6]. We classify methods of scheduling processes in clusters of workstations in the following three categories: ffl Local Process Scheduling Each workstation independently schedules its processes based only on local constraints. This approach is the least complex because it does not require any ....
....each sees similar or related communication behavior by local processes that are part of parallel applications. There are two major forms of implicit coscheduling in the literature. The first is dynamic coscheduling [7] which is based on message arrivals only. The second is two phase spinblocking [3, 4], which makes use of several types of information, such as response time, the nature of message arrivals, and the amount of scheduling progress made by each process. This paper develops a theoretical framework for analyzing the spin blocking implicit coscheduling. Several reports have shown that ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective distributed scheduling of parallel workloads. In The 1996 5 Per-Process Thresholds 1 j 1 2 j 2 3 j 3
....the same priority as one another and are scheduled locally in round robin fashion on the timeshared systems. In practice, most operating systems executing on workstations employ a prioritybased scheduling strategy that reduces to a round robin policy when the executing applications are CPU bound [7]. Since we assume the applications to be coarse grained, considering a roundrobin local scheduler for the time shared systems is a reasonable assumption. We assume that communication between machines involves the transference of large bursts, requires data conversion, and uses TCP IP via ....
A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective Distributed Scheduling of Parallel Workloads, in Proceedings of ACM SIGMETRICS'96, pp. 25--36, May 1996.
....of itself. This is based on an application model where relatively 66 long phases of computation are interleaved with phases of intense communication. The crucial observation is that the first processes entering the communication phase will necessarily block, waiting for the others to catch up [172]. When the last process enters the communication phase, all the rest will have relatively high priorities due to the multi feedback queueing mechanism typically used on workstations. Therefore they will tend to run immediately when unblocked, allowing them to complete the communication phase ....
A. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads". In SIGMETRICS Conf. Measurement & Modeling of Comput. Syst., pp. 25--36, May 1996.
....care of itself. This is based on an application model where relatively long phases of computation are interleaved with phases of intense communication. The crucial observation is that the first processes entering the communication phase will necessarily block, waiting for the others to catch up [98]. When the last process enters the communication phase, all the rest will have relatively high priorities due to the multi feedback queueing 43 mechanism typically used on workstations. Therefore they will tend to run immediately when unblocked, allowing them to complete the communication phase ....
A. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective distributed scheduling of parallel workloads". In SIGMETRICS Conf. Measurement & Modeling of Comput. Syst., pp. 25--36, May 1996.
....their overall effect on the performance is determined. 3.3 Each to some communication Assume there are N workstations in the system, each workstation sends messages to m specific workstations. Examples of such communication patterns are neighbor communications and transpose communications [3]. Let A i j denote the jth (1 j m) workstation to which workstation i sends messages. Workstation A i j receives n i j messages by all other nodes at one time. If there are local user communication traffic flows, and parallel traffic flows are managed properly without congestion at any ....
A.C. Dusseau, R.H. Arpaci, and D.E. Culler, "Effective Distributed Scheduling of Parallel Workloads", Sigmetrics 96, May 23-26, Philadelphia, 1996.
....As the performance provided by networking technologies dramatically increases, solutions for high performance finegrained distributed computing start to emerge. Computing based on clusters, or on networks of workstations, greatly increases the performance of a variety of applications at low costs [1, 2, 3, 4]. The performance of such clusters relies heavily on low communication latency. For example, applications on clusters frequently rely on reliable multicast protocols to disseminate the state of the computation and to manage the state of the system. These protocols typically involve several rounds ....
Dusseau, Arpaci, Culler, Effective Distributed Scheduling of Parallel Workloads, SIGMETRICS '96 Conference on Measurements and Modeling, 1996
....may span multiple computers and or UNIX processes. Communications between threads may be performed through shared memory, message passing, and or other means. Concurrent scheduling of a job s threads has been shown to improve the efficiency of both the individual parallel jobs and the system [3, 13]. The job s perspective is similar to that of a dedicated machine during the time slices of its execution. Some reduction in I O bandwidth may be experienced due to interference from other jobs, but CPU and memory resources should be dedicated. Job efficiency improvements results from a reduction ....
....Most SMP computers schedule each process independent, which works well for a workload consisting of many independent processes. However, the solution of large problems is dependent upon the use of parallel jobs, which suffer significant inefficiencies without the benefit of concurrent scheduling [3, 13]. Parallel job development efforts at the National Energy Research Supercomputer Center (NERSC) illustrates difficulties in parallel job scheduling [2] In order to encourage parallel job development, NERSC provided dedicated time to parallel jobs on a Cray C90 computer. Several of the parallel ....
A. C. Dusseau, R. H. Arpaci and D. E. Culler, Effective Distributed Scheduling of Parallel Workloads. ACM SIGMETRICS `96 Conference on the Measurement and Modeling of Computer Systems, 1996.
.... an analysis of the performance of dynamic partitioning [85] Dussa et al. compared space slicing against no partitioning, and found that space partitioning pays off [15] On the other hand, coscheduling is compared to local scheduling and is found to be superior by Dusseau, Arpaci, and Culler [16]. 2.2.7 Knowledge based scheduling Majumdar, Eager and Bunt showed that, under high variability service time distributions, round robin (RR) was far better than FCFS, but that policies based on knowledge of the processing requirement (such as least work first) were still better. Knowledge of the ....
A. Dusseau, R. H. Arpaci, and D. E. Culler, "Ef- fective distributed scheduling of parallel work21 loads". In SIGMETRICS Conf. Measurement & Modeling of Comput. Syst., pp. 25--36, May 1996.
No context found.
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of 1996.
No context found.
A. Dusseau, R. Arpaci, D. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of 1996 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems, 1996.
No context found.
A. Dusseau, R. Arpaci, D. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of 1996 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems, 1996.
No context found.
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of the 1996 ACM SIGMETRICS Conference, 1996.
....in a linear fashion to the waiting processes. A process at a barrier waits until it has received this completion message before continuing. 1 Note that the behavior of SIMplicity in the experiments presented in this dissertation differs in two primary areas from those originally presented in [47]. First, in the previous simulations, a sleeping process was given kernel priority (above 59) when a message arrived; thus, this process was almost always promptly scheduled. Second, the process immediately slept again if the arriving message was not the event for which the process was waiting. ....
....effectively than applications in which processes communicate in subgroups. As described in Section 6.4.2, all to all communication patterns encourage either all or none of the processes to remain scheduled. This effect matches the simulation results in Figure 8. 7 and the simulations reported in [47]: when processes did not spin the full baseline amount, processes communicating in an all to all Transpose pattern stayed coordinated through the communication phase more often than those communicating in the NEWS pattern. Fourth, the Random pattern with infrequent barriers and infrequent ....
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective Distributed Scheduling of Parallel Workloads. In Proceedings of 1996 ACM Sigmetrics International Conference on Measurement and Modeling of Computer Systems, 1996.
No context found.
A. C. Dusseau, et al., "Effective Distributed Scheduling of Parallel Workloads", ACM Sigmetrics 96, Philadelphia, May 1996, pp. 25-36.
No context found.
A. C. Dusseau, R. H. Arpaci, and D. E. Culler. Effective distributed scheduling of parallel workloads. In Proceedings of ACM SIGMETRICS, 1996.
No context found.
Andrea C. Dusseau, Remzi H. Arpaci, and David E. Culler. Effective distributed scheduling of parallel workloads. In ACM SIGMETRICS '96 Conference on the Measurement and Modeling of Computer Systems, pages 25-- 36, 1996. Available from http://www.cs.berkeley.edu/~dusseau/Papers/ sigmetrics96.ps.
No context found.
A. C. Dusseau, R. H. Arpaci, and D. E. Culler, "Effective Distributed Scheduling of Parallel Workloads," in Sigmetrics'96 Conference on the Measurement and Modeling of Computer Systems, 1996.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC