Results 1 -
4 of
4
Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems 29 (8
, 2013
"... Abstract Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from m ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.
Overview
"... A science-gateway workload archive application to the self-healing of workflow incidents ..."
Abstract
- Add to MetaCart
(Show Context)
A science-gateway workload archive application to the self-healing of workflow incidents
Author manuscript, published in "19th International Conference Euro-Par 2013, Aachen: Germany (2013)" Workflow fairness control on online and non-clairvoyant distributed computing platforms
, 2013
"... Abstract. Fairly allocating distributed computing resources among workflow executions is critical to multi-user platforms. However, this problem remains mostly studied in clairvoyant and offline conditions, where task durations on resources are known, or the workload and available resources do not v ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Fairly allocating distributed computing resources among workflow executions is critical to multi-user platforms. However, this problem remains mostly studied in clairvoyant and offline conditions, where task durations on resources are known, or the workload and available resources do not vary along time. We consider a non-clairvoyant, online fairness problem where the platform workload, task costs and resource characteristics are unknown and not stationary. We propose a fairness control loop which assigns task priorities based on the fraction of pending work in the workflows. Workflow characteristics and performance on the target resources are estimated progressively, as information becomes available during the execution. Our method is implemented and evaluated on 4 different applications executed in production conditions on the European Grid Infrastructure. Results show that our technique reduces slowdown variability by 3 to 7 compared to first-come-first-served. 1
Online Task Resource Consumption Prediction for Scientific Workflows
"... Estimates of task runtime, disk space usage, and memory consumption, are commonly used by scheduling and resource provisioning algorithms to support efficient and reliable workflow executions. Such algorithms often assume that accurate estimates are available, but such estimates are difficult to gen ..."
Abstract
- Add to MetaCart
(Show Context)
Estimates of task runtime, disk space usage, and memory consumption, are commonly used by scheduling and resource provisioning algorithms to support efficient and reliable workflow executions. Such algorithms often assume that accurate estimates are available, but such estimates are difficult to gener-ate in practice. In this work, we first profile five real scientific workflows, collecting fine-grained informa-tion such as process I/O, runtime, memory usage, and CPU utilization. We then propose a method to au-tomatically characterize workflow task requirements based on these profiles. Our method estimates task runtime, disk space, and peak memory consumption based on the size of the tasks ’ input data. It looks for correlations between the parameters of a dataset, and if no correlation is found, the dataset is divided into smaller subsets using a clustering technique. Task estimates are generated based on the ratio parame-ter/input data size if they are correlated, or based on the probability distribution function of the param-eter. We then propose an online estimation process based on the MAPE-K loop where task executions are monitored and estimates are updated as more infor-mation becomes available. Experimental results show that our online estimation process outcomes much more accurate predictions than an offline approach, where all task requirements are estimated prior to workflow execution.