Results 1 - 10
of
23
S.K.S.: Spatio-temporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers (Elsevier
- Computer Networks, Special Issue on Resource Management in Heterogeneous Data Centers
"... Job scheduling in data centers can be considered from a cyber-physical point of view, as it affects the data center’s computing performance (i.e. the cyber aspect) and energy efficiency (the physical aspect). Driven by the growing needs to green contemporary data centers, this paper uses recent tech ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Job scheduling in data centers can be considered from a cyber-physical point of view, as it affects the data center’s computing performance (i.e. the cyber aspect) and energy efficiency (the physical aspect). Driven by the growing needs to green contemporary data centers, this paper uses recent technological advances in data center virtualization and proposes cyber-physical, spatio-temporal (i.e. start time and servers assigned), thermal-aware job scheduling algorithms that minimize the energy consumption of the data center under performance constraints (i.e. deadlines). Savings are possible by being able to temporally “spread ” the workload, assign it to energy-efficient computing equipment, and further reduce the heat recirculation and therefore the load on the cooling systems. This paper provides three categories of thermal-aware energy-saving scheduling techniques: a) FCFS-Backfill-XInt and FCFS-Backfill-LRH, thermal-aware job placement enhancements to the popular first-come first-serve with back-filling (FCFSbackfill) scheduling policy; b) EDF-LRH, an online earliest-deadline-first scheduling algorithm with thermal-aware placement; and c) an offline genetic algorithm for SCheduling to minimize thermal cross-INTerference (SCINT), which is suited for batch scheduling of backlogs. Simulation results, based on real job logs from the ASU Fulton HPC data center, show that the thermal-aware enhancements to FCFS-backfill achieve up to 25 % savings compared to FCFS-backfill with first-fit placement, depending on the intensity of the incoming workload, while SCINT achieves up to 60 % savings. The performance of EDF-LRH nears that of the offline SCINT for low loads, and it degrades to the performance of FCFS-backfill for high loads. However, EDF-LRH requires milliseconds
Trace-Based Evaluation of Job Runtime and Queue Wait Time Predictions in Grids
"... Large-scale distributed computing systems such as grids are serving a growing number of scientists. These environments bring about not only the advantages of an economy of scale, but also the challenges of resource and workload heterogeneity. A consequence of these two forms of heterogeneity is that ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Large-scale distributed computing systems such as grids are serving a growing number of scientists. These environments bring about not only the advantages of an economy of scale, but also the challenges of resource and workload heterogeneity. A consequence of these two forms of heterogeneity is that job runtimes and queue wait times are highly variable, which generally reduces system performance and makes grids difficult to use by the common scientist. Predicting job runtimes and queue wait times have been widely studied for parallel environments. However, there is no detailed investigationonhowtheproposedpredictionmethodsperform in grids, whose resource structure and workload characteristics are very different from those in parallel systems. In this paper, we assess the performance and benefit of predicting job runtimes and queue wait times in grids based on traces gathered from various research and production grid environments. First, we evaluate the performance of simple yet widely used time series prediction methods and the effect of applying them to different types of job classes (e.g., all jobs submitted by single users or to single sites). Then, we investigate the performance of two kinds of queue wait time prediction methods for grids. Last, we investigate whether prediction-based grid-level scheduling policies can have better performance than policies that do not use predictions.
C-meter: A framework for performance analysis of computing clouds
- in Proceedings of CCGRID’09, 2009
"... Abstract—Cloud computing has emerged as a new technology that provides large amount of computing and data storage capacity to its users with a promise of increased scalability, high availability, and reduced administration and maintenance costs. As the use of cloud computing environments increases, ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract—Cloud computing has emerged as a new technology that provides large amount of computing and data storage capacity to its users with a promise of increased scalability, high availability, and reduced administration and maintenance costs. As the use of cloud computing environments increases, it becomes crucial to understand the performance of these environments. So, it is of great importance to assess the performance of computing clouds in terms of various metrics, such as the overhead of acquiring and releasing the virtual computing resources, and other virtualization and network communications overheads. To address these issues, we have designed and implemented C-Meter, which is a portable, extensible, and easy-to-use framework for generating and submitting test workloads to computing clouds. In this paper, first we state the requirements for frameworks to assess the performance of computing clouds. Then, we present the architecture of the C-Meter framework and discuss several resource management alternatives. Finally, we present our early experiences with C-Meter in Amazon EC2. We show how C-Meter can be used for assessing the overhead of acquiring and releasing the virtual computing resources, for comparing different configurations, for evaluating different scheduling algorithms and for determining the costs of the experiments. I.
Secretly Monopolizing the CPU Without Superuser Privileges
"... We describe a “cheat ” attack, allowing an ordinary process to hijack any desirable percentage of the CPU cycles without requiring superuser/administrator privileges. Moreover, the nature of the attack is such that, at least in some systems, listing the active processes will erroneously show the che ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We describe a “cheat ” attack, allowing an ordinary process to hijack any desirable percentage of the CPU cycles without requiring superuser/administrator privileges. Moreover, the nature of the attack is such that, at least in some systems, listing the active processes will erroneously show the cheating process as not using any CPU resources: the “missing ” cycles would either be attributed to some other process or not be reported at all (if the machine is otherwise idle). Thus, certain malicious operations generally believed to have required overcoming the hardships of obtaining root access and installing a rootkit, can actually be launched by non-privileged users in a straightforward manner, thereby making the job of a malicious adversary that much easier. We show that most major general-purpose operating systems are vulnerable to the cheat attack, due to a combination of how they account for CPU usage and how they use this information to prioritize competing processes. Furthermore, recent scheduler changes attempting to better support interactive workloads increase the vulnerability to the attack, and naive steps taken by certain systems to reduce the danger are easily circumvented. We show that the attack can nevertheless be defeated, and we demonstreate this by implementing a patch for Linux that eliminates the problem with negligible overhead.
Rescheduling co-allocation requests based on flexible advance reservations and processor remapping
- In Proceedings of 9th IEEE/ACM InternationalConferenceonGridComputing(GRID’08),Tsukuba, Japan,2008. IEEEComputer Society
"... Large-scale computing environments, such as TeraGrid, Distributed ASCI Supercomputer (DAS), and Grid’5000, have been using resource co-allocation to execute applicationsonmultiplesites. Theirschedulersworkwithrequests that contain imprecise estimations provided by users. This lack of accuracy genera ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Large-scale computing environments, such as TeraGrid, Distributed ASCI Supercomputer (DAS), and Grid’5000, have been using resource co-allocation to execute applicationsonmultiplesites. Theirschedulersworkwithrequests that contain imprecise estimations provided by users. This lack of accuracy generates fragments inside the scheduling queues that can be filled by rescheduling both local and multi-site requests. Current resource co-allocation solutions rely on advance reservations to ensure that users can access all the resources at the same time. These coallocation requests cannot be rescheduled if they are based on rigid advance reservations. In this work, we investigate the impact of rescheduling co-allocation requests based on flexible advance reservations and processor remapping. The metascheduler can modify the start time of each job component and remap the number of processors they use in each site. The experimental results show that local jobs may not fill all the fragments in the scheduling queues and hencereschedulingco-allocationrequestsreducesresponse time of both local and multi-site jobs. Moreover, we have observed in some scenarios that processor remapping increases the chances of placing the tasks of multi-site jobs into a single cluster, thus eliminating the inter-cluster network overhead. 1
Resource Allocation using Virtual Clusters
"... Abstract — We propose a novel approach for sharing cluster resources among competing jobs. The key advantage of our approach over current solutions is that it increases cluster utilization while optimizing a user-centric metric that captures both notions of performance and fairness. We motivate and ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract — We propose a novel approach for sharing cluster resources among competing jobs. The key advantage of our approach over current solutions is that it increases cluster utilization while optimizing a user-centric metric that captures both notions of performance and fairness. We motivate and formalize the corresponding resource allocation problem, determine its complexity, and propose several algorithms to solve it in the case of a static workload that consists of sequential jobs. Via extensive simulation experiments we identify an algorithm that runs quickly, that is always on par with or better than its competitors, and that produces resource allocations that are close to optimal. We find that the extension of our approach to parallel jobs leads to similarly good results. Finally, we explain how to extend our work to dynamic workloads. I.
Research Directions in Energy-Sustainable Cyber-Physical Systems 1
"... An overview of sustainable computing is provided and different approaches towards design and verification of energy-sustainable computing (i.e. sustainable computing from energy consumption perspective) are discussed for Cyber-Physical Systems (CPSs), i.e. systems with strong coupling between comput ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
An overview of sustainable computing is provided and different approaches towards design and verification of energy-sustainable computing (i.e. sustainable computing from energy consumption perspective) are discussed for Cyber-Physical Systems (CPSs), i.e. systems with strong coupling between computing components and non-computing processes in physical environment. A major issue in this regard is the inter-dependencies of the non-computing processes on the computing components and vice versa, and the verification of the CPSs ’ sustainability without real deployment. The trends and dependencies of energy consumption for both computing and non-computing components are conceptualized. Based on this conceptualization, CPS resource management algorithms are categorized according to: (i) computing workload execution and arrival profiles supported, (ii) knowledge of workload profiles during management decision making, (iii) support of power management in the computing components, and (iv) assumptions on non-computing process behavior. These categories are then discussed along with their pros and cons for two representative CPSs: data centers and Body Sensor Networks (BSNs). Model based engineering is used to verify CPS sustainability before real deployment. Several research directions and open problems are further discussed for design and verification of sustainable CPSs. Key words: cyber-physical systems, sustainability, model-based engineering 1.
Feitelson, Reducing Performance Evaluation Sensitivity and Variability by Input Shaking
, 2007
"... Abstract—Simulations sometimes lead to observed sensitivity to configuration parameters as well as inconsistent performance results. The question is then what is the true effect and what is a coincidental artifact of the evaluation. The shaking methodology answers this by executing multiple simulati ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract—Simulations sometimes lead to observed sensitivity to configuration parameters as well as inconsistent performance results. The question is then what is the true effect and what is a coincidental artifact of the evaluation. The shaking methodology answers this by executing multiple simulations under small perturbations to the input workload, and calculating the average performance result; if the effect persists we can be more confident that it is real, whereas if it disappears it was an artifact. We present several examples where the sensitivity that appears in results based on a single evaluation is eliminated or considerably reduced by the shaking methodology. While our examples come from evaluations of scheduling algorithms for supercomputers, we believe the method has wider applicability.
Using checkpointing to recover from poor multi-site parallel job scheduling decisions
- In The 5th Workshop on Middleware for Grid Computing at the ACM/IFIP/USENIX 8th International Middleware Conference
"... Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found th ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performance. We demonstrate that checkpointing is beneficial even when the overhead of doing so is costly.
The impact of error in user-provided bandwidth estimates on multi-site parallel job scheduling performance
- In The 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2007
, 2007
"... Multi-cluster schedulers can dramatically improve average job turn-around time performance by making use of fragmented node resources available throughout the grid. By carefully mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local cluster resources can ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Multi-cluster schedulers can dramatically improve average job turn-around time performance by making use of fragmented node resources available throughout the grid. By carefully mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local cluster resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in this area leverages user-provided estimates of job communication characteristics to effectively partition the job across cluster boundaries. In this paper, we address the impact of inaccuracies in these estimates on overall system performance. Furthermore, we demonstrate that multi-site job scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy.

