Results 1 - 10
of
60
Backfilling Using SystemGenerated Predictions Rather Than User Runtime Estimates,”
- IEEE Trans. Parallel & Distributed Syst.,
, 2007
"... ..."
(Show Context)
Towards characterizing cloud backend workloads: insights from google compute clusters
- ACM SIGMETRICS Performance Evaluation Review
, 2010
"... The advent of cloud computing promises highly available, effi-cient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend con ..."
Abstract
-
Cited by 54 (4 self)
- Add to MetaCart
(Show Context)
The advent of cloud computing promises highly available, effi-cient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend consisting of many large compute clusters that are shared by application tasks with diverse service level requirements for throughput, latency, and jitter. These consid-erations impact (a) capacity planning to determine which machine resources must grow and by how much and (b) task scheduling to achieve high machine utilization and to meet service level objec-tives. Both capacity planning and task scheduling require a good un-derstanding of task resource consumption (e.g., CPU and memory usage). This in turn demands simple and accurate approaches to
The impact of data replication on job scheduling performance
- in the Data Grid, Future Generation Computer Systems
, 2006
"... In the Data Grid environment, the primary goal of data replication is to shorten the data access time experienced by the job and consequently reduce the job turnaround time. After introducing a Data Grid architecture that supports efficient data access for the Grid job, the dynamic data replication ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
(Show Context)
In the Data Grid environment, the primary goal of data replication is to shorten the data access time experienced by the job and consequently reduce the job turnaround time. After introducing a Data Grid architecture that supports efficient data access for the Grid job, the dynamic data replication algorithms are put forward. Combined with different Grid scheduling heuristics, the performances of the data replication algorithms are evaluated with various simulations. The simulation results demonstrate that the dynamic replication algorithms can reduce the job turnaround time remarkably. In particular, the combination of shortest turnaround time scheduling heuristic (STT) and centralized dynamic replication with response-time oriented replica placement (CDR RTPlace) exhibits remarkable performance in diverse system environments and job workloads. © 2005 Elsevier B.V. All rights reserved.
Peer-to-peer grid computing with the ourgrid community
- in: Proceedings of the SBRC 2005 - IV Salão de Ferramentas
, 2005
"... Abstract. For a number of research and commercial computational problems, it is pos-sible to use as much computing power as available to speed the resolution of the prob-lem through parallel processing. Grid computing has done much in the direction of enabling users to use the computing power of res ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
(Show Context)
Abstract. For a number of research and commercial computational problems, it is pos-sible to use as much computing power as available to speed the resolution of the prob-lem through parallel processing. Grid computing has done much in the direction of enabling users to use the computing power of resources across administrative bound-aries for solving this kind of problem. However, not much has been done to solve the precedent problem of gaining access to resources spread across several institutions. We have addressed this issue in the OurGrid Toolkit developing the OurGrid Community, a peer-to-peer network for sharing computational power. The goal of this system is to provide easy access to large amounts of computational resources for anyone who needs them. All participants contribute idle resources to form a shared pool from which all can benefit. To motivate the contribution to this pool, the OurGrid Community uses an allocation mechanism that rewards the peers that donate more to the system. This paper describes the OurGrid Community and its first deployment in a grid across Brazil called Pauá, which is presently being used by several Brazilian research institutes. 1.
On the Efficacy, Efficiency and Emergent Behavior of Task Replication
- in Large Distributed Systems, Parallel Computing
, 2007
"... Abstract: Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a variety of scenarios as a way to circumvent ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
Abstract: Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a variety of scenarios as a way to circumvent this problem. Task replication consists of dispatching multiple replicas of a task and using the result from the first replica to finish. Replication schedulers (i.e. schedulers that employ task replication) are able to achieve good performance even in the absence of information on tasks and resources. They are also of smaller complexity than traditional schedulers, making them better suitable for large distributed systems. On the other hand, replication schedulers waste cycles with the replicas that are not the first to finish. Moreover, this extra consumption of resources raises severe concerns about the system-wide performance of a distributed system with multiple, competing replication schedulers. This paper presents a comprehensive study of task replication, comparing replication schedulers against traditional information-based schedulers, and establishing their efficacy (the performance delivered to the application), efficiency (the amount of resources wasted), and emergent behavior (the system-wide behavior of a system with multiple replication schedulers). We also introduce a simple access control strategy that can be implemented locally by each resource and greatly improves overall performance of a system on which multiple replication schedulers compete for resources.
Performance Evaluation of Scheduling Policies for Volunteer Computing
"... BOINC, a middleware system for volunteer computing, allows hosts to be attached to multiple projects. Each host periodically requests jobs from project servers and executes the jobs. This process involves three interrelated policies: 1) of the runnable jobs on a host, which to execute? 2) when and f ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
BOINC, a middleware system for volunteer computing, allows hosts to be attached to multiple projects. Each host periodically requests jobs from project servers and executes the jobs. This process involves three interrelated policies: 1) of the runnable jobs on a host, which to execute? 2) when and from what project should a host request more work? 3) what jobs should a server send in response to a given request? 4) How to estimate the remaining runtime of a job? In this paper, we consider several alternatives for each of these policies. Using simulation, we study various combinations of policies, comparing them on the basis of several performance metrics and over a range of parameters such as job length variability, deadline slack, and number of attached projects. 1
Online scheduling in grids
- In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
, 2008
"... This paper addresses nonclairvoyant and non-preemptive online job scheduling in Grids. In the applied basic model, the Grid system consists of a large number of identical processors that are divided into several machines. Jobs are independent, they have a fixed degree of parallelism, and they are su ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
This paper addresses nonclairvoyant and non-preemptive online job scheduling in Grids. In the applied basic model, the Grid system consists of a large number of identical processors that are divided into several machines. Jobs are independent, they have a fixed degree of parallelism, and they are submitted over time. Further, a job can only be executed on the processors belonging to the same machine. It is our goal to minimize the total makespan. We show that the performance of Garey and Graham’s list scheduling algorithm is significantly worse in Grids than in multiprocessors. Then we present a Grid scheduling algorithm that guarantees a competitive factor of 5. This algorithm can be implemented using a “job stealing ” approach and may be well suited to serve as a starting point for Grid scheduling algorithms in real systems. 1.
Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P
- In 2010 IEEE International Symposium on Parallel Distributed Processing
, 2010
"... Abstract—Backfilling and short-job-first are widely acknowl-edged enhancements to the simple but popular first-come, first-served job scheduling policy. However, both enhancements depend on user-provided estimates of job runtime, which research has repeatedly shown to be inaccurate. We have investig ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Backfilling and short-job-first are widely acknowl-edged enhancements to the simple but popular first-come, first-served job scheduling policy. However, both enhancements depend on user-provided estimates of job runtime, which research has repeatedly shown to be inaccurate. We have investigated the effects of this inaccuracy on backfilling and different queue prioritization policies, determining which part of the scheduling policy is most sensitive. Using these results, we have designed and implemented several estimation-adjusting schemes based on historical data. We have evaluated these schemes using workload traces from the Blue Gene/P system at Argonne National Laboratory. Our experimental results demonstrate that dynamically adjusting job runtime estimates can improve job scheduling performance by up to 20%. Keywords-job scheduling; runtime estimates; Blue Gene I.
Heuristic for resources allocation on utility computing infrastructures, in
- Proc. 6th International Workshop on Middleware for Grid Computing, ACM
, 2008
"... The use of utility on-demand computing infrastructures, such as Amazon’s Elastic Clouds [1], is a viable solution to speed lengthy parallel computing problems to those without access to other cluster or grid infrastructures. With a suitable mid-dleware, bag-of-tasks problems could be easily deployed ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
The use of utility on-demand computing infrastructures, such as Amazon’s Elastic Clouds [1], is a viable solution to speed lengthy parallel computing problems to those without access to other cluster or grid infrastructures. With a suitable mid-dleware, bag-of-tasks problems could be easily deployed over a pool of virtual computers created on such infrastructures. In bag-of-tasks problems, as there is no communication between tasks, the number of concurrent tasks is allowed to vary over time. In a utility computing infrastructure, if too many virtual computers are created, the speedups are high but may not be cost effective; if too few computers are created, the cost is low but speedups fall below expectations. Without previous knowledge of the processing time of each task, it is difficult to determine how many machines should be created. In this paper, we present an heuristic to optimize the num-ber of machines that should be allocated to process tasks so that for a given budget the speedups are maximal. We have simulated the proposed heuristics against real and the-oretical workloads and evaluated the ratios between number of allocated hosts, charged times, speedups and processing times. With the proposed heuristics, it is possible to ob-tain speedups in line with the number of allocated comput-ers, while being charged approximately the same predefined budget.
Dynamic Fractional Resource Scheduling for HPC Workloads
"... Abstract — We propose a novel job scheduling approach for homogeneous cluster computing platforms. Its key feature is the use of virtual machine technology for sharing resources in a precise and controlled manner. We justify our approach and propose several job scheduling algorithms. We present resu ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Abstract — We propose a novel job scheduling approach for homogeneous cluster computing platforms. Its key feature is the use of virtual machine technology for sharing resources in a precise and controlled manner. We justify our approach and propose several job scheduling algorithms. We present results obtained in simulations for synthetic and realworld High Performance Computing (HPC) workloads, in which we compare our proposed algorithms with standard batch scheduling algorithms. We find that our approach provides drastic performance improvements over batch scheduling. In particular, we identify a few promising algorithms that perform well across most experimental scenarios. Our results demonstrate that virtualization technology coupled with lightweight scheduling strategies affords dramatic improvements in performance for HPC workloads. I.