Results 1 - 10
of
35
Case study for running HPC applications in public clouds,”
- in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC ’10.
, 2010
"... ABSTRACT Cloud computing is emerging as an alternative computing platform to bridge the gap between scientists' growing computational demands and their computing capabilities. A scientist who wants to run HPC applications can obtain massive computing resources 'in the cloud' quickly ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT Cloud computing is emerging as an alternative computing platform to bridge the gap between scientists' growing computational demands and their computing capabilities. A scientist who wants to run HPC applications can obtain massive computing resources 'in the cloud' quickly (in minutes), as opposed to days or weeks it normally takes under traditional business processes. Due to the popularity of Amazon EC2, most HPC-in-the-cloud research has been conducted using EC2 as a target platform. Previous work has not investigated how results might depend upon the cloud platform used. In this paper, we extend previous research to three public cloud computing platforms. In addition to running classical benchmarks, we also port a 'full-size' NASA climate prediction application into the cloud, and compare our results with that from dedicated HPC systems. Our results show that 1) virtualization technology, which is widely used by cloud computing, adds little performance overhead; 2) most current public clouds are not designed for running scientific applications primarily due to their poor networking capabilities. However, a cloud with moderately better network (vs. EC2) will deliver a significant performance improvement. Our observations will help to quantify the improvement of using fast networks for running HPC-in-thecloud, and indicate a promising trend of HPC capability in future private science clouds. We also discuss techniques that will help scientists to best utilize public cloud platforms despite current deficiencies.
Modeling and Synthesizing Task Placement Constraints in Google Compute Clusters
"... Evaluating the performance of large compute clusters requires benchmarks with representative workloads. At Google, performance benchmarks are used to obtain performance metrics such as task scheduling delays and machine resource utilizations to assess changes in application codes, machine configurat ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
(Show Context)
Evaluating the performance of large compute clusters requires benchmarks with representative workloads. At Google, performance benchmarks are used to obtain performance metrics such as task scheduling delays and machine resource utilizations to assess changes in application codes, machine configurations, and scheduling algorithms. Existing approaches to workload characterization for high performance computing and grids focus on task resource requirements for CPU, memory, disk, I/O, network, etc. Such resource requirements address how much resource is consumed by a task. However, in addition to resource requirements, Google workloads commonly include task placement constraints that determine which machine resources are consumed by tasks. Task placement constraints arise because of task dependencies such as those related to hardware architecture and kernel version.
Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids
"... Large-scale distributed computing systems such as grids are serving a growing number of scientists. These environments bring about not only the advantages of an economy of scale, but also the challenges of resource and workload heterogeneity. A consequence of these two forms of heterogeneity is that ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
(Show Context)
Large-scale distributed computing systems such as grids are serving a growing number of scientists. These environments bring about not only the advantages of an economy of scale, but also the challenges of resource and workload heterogeneity. A consequence of these two forms of heterogeneity is that job runtimes and queue wait times are highly variable, which generally reduces system performance and makes grids difficult to use by the common scientist. Predicting job runtimes and queue wait times have been widely studied for parallel environments. However, there is no detailed investigationonhowtheproposedpredictionmethodsperform in grids, whose resource structure and workload characteristics are very different from those in parallel systems. In this paper, we assess the performance and benefit of predicting job runtimes and queue wait times in grids based on traces gathered from various research and production grid environments. First, we evaluate the performance of simple yet widely used time series prediction methods and the effect of applying them to different types of job classes (e.g., all jobs submitted by single users or to single sites). Then, we investigate the performance of two kinds of queue wait time prediction methods for grids. Last, we investigate whether prediction-based grid-level scheduling policies can have better performance than policies that do not use predictions.
Modeling job lifespan delays in volunteer computing projects
- In 9th IEEE International Symposium on Cluster Computing and Grid (CCGrid
, 2009
"... Volunteer Computing (VC) projects harness the power of computers owned by volunteers across the Internet to perform hundreds of thousands of independent jobs. In VC projects, the path leading from the generation of jobs to the validation of the job results is characterized by delays hid-den in the j ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
Volunteer Computing (VC) projects harness the power of computers owned by volunteers across the Internet to perform hundreds of thousands of independent jobs. In VC projects, the path leading from the generation of jobs to the validation of the job results is characterized by delays hid-den in the job lifespan, i.e., distribution delay, in-progress delay, and validation delay. These delays are difficult to es-timate because of the dynamic behavior and heterogeneity of VC resources. A wrong estimation of these delays can cause the loss of project throughput and job latency in VC projects. In this paper, we evaluate the accuracy of several prob-abilistic methods to model the upper time bounds of these delays. We show how our selected models predict up-and-down trends in traces from existing VC projects. The use of our models provides valuable insights on selecting project deadlines and taking scheduling decisions. By accurately predicting job lifespan delays, our models lead to more ef-ficient resource use, higher project throughput, and lower job latency in VC projects. 1.
VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance
"... Today’s scientific workflows use distributed heterogeneous resources through diverse grid and cloud interfaces that are often hard to program. In addition, especially for time-sensitive critical applications, predictable quality of service is necessary across these distributed resources. VGrADS ’ vi ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
Today’s scientific workflows use distributed heterogeneous resources through diverse grid and cloud interfaces that are often hard to program. In addition, especially for time-sensitive critical applications, predictable quality of service is necessary across these distributed resources. VGrADS ’ virtual grid execution system (vgES) provides an uniform qualitative resource abstraction over grid and cloud systems. We apply vgES for scheduling a set of deadline sensitive weather forecasting workflows. Specifically, this paper reports on our experiences with (1) virtualized reservations for batchqueue systems, (2) coordinated usage of TeraGrid (batch queue), Amazon EC2 (cloud), our own clusters (batch queue) and Eucalyptus (cloud) resources, and (3) fault tolerance through automated task replication. The combined effect of these techniques was to enable a new workflow planning method to balance performance, reliability and cost considerations. The results point toward improved resource selection and execution management support for a variety of e-Science applications over grids and cloud systems. 1.
SWARM: Scheduling Large-Scale Jobs over the Loosely-Coupled HPC Clusters
- Proc of IEEE Fourth International Conference on eScience '08(eScience, 2008).Indianapolis
, 2008
"... Abstract — Compute-intensive scientific applications are heavily reliant on the available quantity of computing resources. The Grid paradigm provides a large scale computing environment for scientific users. However, conventional Grid job submission tools do not provide a high-level job scheduling e ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
(Show Context)
Abstract — Compute-intensive scientific applications are heavily reliant on the available quantity of computing resources. The Grid paradigm provides a large scale computing environment for scientific users. However, conventional Grid job submission tools do not provide a high-level job scheduling environment for these users across multiple institutions. For extremely large number of jobs, a more scalable job scheduling framework that can leverage highly distributed clusters and supercomputers is required. In this paper, we propose a high-level job scheduling Web service framework, Swarm. Swarm is developed for scientific applications that must submit massive number of high-throughput jobs or workflows to highly distributed computing clusters. The Swarm service itself is designed to be
VARQ: Virtual advance reservations for queues
- in HPDC’08: Proceedings of the 17th International Symposium on High Performance Distributed Computing
, 2008
"... In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized b ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turn-around times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is to allow users who are willing to plan ahead to make “advanced reservations ” for processor resources. To date, however, few if any HPC centers provide an advanced reservation capability to their general user populations for fear (supported by previous research) that diminished machine utilization will occur if and when advanced reservations are introduced. In this work, we describe VARQ, a new method for job scheduling that provides users with probabilistic “virtual ” advanced reservations using only existing best effort batch schedulers and policies. VARQ functions as an overlay, submitting jobs that are indistinguishable from the normal workload serviced by a scheduler. We describe the statistical methods we use to implement VARQ, detail an empirical evaluation of its effectiveness in a number of HPC settings, and explore the potential future impact of VARQ should it become widely used. Without requiring HPC sites to support advanced reservations, we find that VARQ can implement a reservation capability probabilistically and that the effects of this probabilistic approach are unlikely to negatively affect resource utilization. 1.
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Computing Infrastructures
, 2012
"... Exploitation of Best E ort Distributed Computing Infrastructures (BE-DCIs) allow operators to maximize the utilization of the infrastructures, and users to access the unused resources at relatively low cost. Because providers do not guarantee that the computing resources remain available to the user ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Exploitation of Best E ort Distributed Computing Infrastructures (BE-DCIs) allow operators to maximize the utilization of the infrastructures, and users to access the unused resources at relatively low cost. Because providers do not guarantee that the computing resources remain available to the user during the entire execution of their applications, they o er a diminished Quality of Service (QoS) compared to traditional infrastructures. Pro ling the execution of Bag-of-Tasks (BoT) applications on several kinds of BE-DCIs demonstrates that their task completion rate drops near the end of the execution. In this report, we present the SpeQuloS service which enhances the QoS of BoT applications executed on BE-DCIs by reducing the execution time, improving its stability, and reporting to users a predicted completion time. SpeQuloS monitors the execution of the BoT on the BE-DCIs, and dynamically supplies fast and reliable Cloud resources when the critical part of the BoT is executed. We present the design and development of the framework and several strategies to decide when and how Cloud resources should be provisioned. Performance evaluation using simulations shows that SpeQuloS ful ll its objectives. It speeds-up the execution of BoTs, in the best cases by a factor greater than 2, while offloading less than 2.5 % of the workload to the Cloud. We report on preliminary
Enabling Large Scale Scientific Computations for Expressed Sequence Tag Sequencing over Grid
- and Cloud Computing Clusters”, PPAM 2009 EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS Wroclaw
"... Abstract. Compute-intensive biological applications are heavily reliant on the availability of computing resources. Grid based HPC clusters and emerging Cloud computing clusters provide a large scale computing environment for scientific users. However, large scale biological application often involv ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Compute-intensive biological applications are heavily reliant on the availability of computing resources. Grid based HPC clusters and emerging Cloud computing clusters provide a large scale computing environment for scientific users. However, large scale biological application often involves various types of computational tasks which can benefit from different types of computing clusters. Therefore, a high level job scheduling environment which integrates the Grid style HPC clusters and the Cloud computing clusters and manages jobs accordingly based on the characteristics of the jobs is required. In this paper, we propose a Web service framework for high-level job scheduling – Swarm. Swarm is developed for scientific applications that must submit massive number of high-throughput jobs or workflows to highly distributed computing clusters. Swarm allows the users to submit jobs to both Grid HPC and Cloud computing clusters. The Swarm service itself is designed to be extensible, lightweight, and easily installable on a desktop or a small server. As a Web service, derivative services based on Swarm can be straightforwardly integrated with Web portals and science gateways. This paper provides the motivation for this research, the architecture of the Swarm framework, and a performance evaluation of the system prototype.
The QuakeSim Portal and Services: New Approaches to Science Gateway Development Techniques
"... Abstract: Traditional techniques in building science portals and gateways are being challenged by new techniques such as Web 2.0 and Cloud Computing. This paper discusses some of our efforts to evaluate these techniques as we evolve the QuakeSim architecture. We believe that architecturally both tra ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract: Traditional techniques in building science portals and gateways are being challenged by new techniques such as Web 2.0 and Cloud Computing. This paper discusses some of our efforts to evaluate these techniques as we evolve the QuakeSim architecture. We believe that architecturally both traditional and newer approaches for Gateways are very similar, thus giving us a path for moving to hybrid approaches. In this paper, we specifically evaluate techniques for building interactive user interfaces that rely on remote services; architectural approaches for managing massive job submissions that can include both parallel and serial jobs; and an architectural prototype for building component based containers compatible with emerging standards. 1