Results 11 - 20
of
22
By
, 2006
"... The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation an ..."
Abstract
- Add to MetaCart
The Grid is rapidly emerging as the means for coordinated resource sharing and problem solving in multi-institutional virtual organizations while providing dependable, consistent, pervasive access to global resources. The emergence of computational Grids and the potential for seamless aggregation and interactions between distributed services and resources, has led to the start of new era of computing. Tremendously large number and the heterogeneous nature of grid computing resource make the resource management a significantly challenging job. Resource management scenarios often include resource discovery, resource monitoring, resource inventories, resource provisioning, fault isolation, variety of autonomic capabilities and service level management activities. Out of these scenarios, fault tolerance is one of the main research areas. The probability of fault occurrence increases, as the number of resources involved in grid increases. Till today there is no system that can be fully fault tolerant. In this research our main focus is on the development of fault tolerance system for
Job-Site Level Fault Tolerance for Cluster and Grid environments
- in Proceedings of IEEE International Conference on Cluster Computing (Cluster) 2005
, 2005
"... In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system ..."
Abstract
- Add to MetaCart
In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called "Smart Failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state.
A Taxonomy of Desktop Grids and its Mapping to State of the Art Systems
"... Desktop Grid has emerged as an attractive computing paradigm for high throughput applications. However, building such systems is complicated due to resources ’ heterogeneity, failures, nondedication, volatility, and lack of trust, since they (that is, desktop computers) are at the edge of the Intern ..."
Abstract
- Add to MetaCart
Desktop Grid has emerged as an attractive computing paradigm for high throughput applications. However, building such systems is complicated due to resources ’ heterogeneity, failures, nondedication, volatility, and lack of trust, since they (that is, desktop computers) are at the edge of the Internet and owned by different individuals. Therefore, it is important to understand how these distinct characteristics impact on architecture, execution model, resource management, and scheduling. In this article, we investigate architectural elements and then provide a new taxonomy
QoS-based Scheduling of Workflows on Global Grids
, 2007
"... Grid computing has emerged as a global cyber-infrastructure for the next-generation of e-Science applications by integrating large-scale, distributed and heterogeneous resources. Scientific communities are utilizing Grids to share, manage and process large data sets. In order to support complex scie ..."
Abstract
- Add to MetaCart
Grid computing has emerged as a global cyber-infrastructure for the next-generation of e-Science applications by integrating large-scale, distributed and heterogeneous resources. Scientific communities are utilizing Grids to share, manage and process large data sets. In order to support complex scientific experiments, distributed resources such as computational devices, data, applications, and scientific instruments need to be orchestrated while managing the application workflow operations within Grid environments. This thesis investigates properties of Grid workflow management systems, presents a workflow engine and algorithms for mapping scientific workflow applications to Grid resources based on specified QoS (Quality of Service) constraints. To address the field of Grid computing of workflow application scheduling, the thesis has made the following contributions: • proposed a taxonomy of workflow management systems for Grid computing. • developed a workflow engine which leverages tuple spaces to provide event-based execution management. • developed deadline and budget distribution strategies based on the workload and dependency of tasks. • developed algorithms for scheduling workflows with QoS constraints using genetic algorithms. • leveraged multi-objective evolutionary algorithms (MOEAs) for workflow execution planning to generate a set of trade-off alternative scheduling solutions.
Who needs a scheduler?
, 2008
"... This position paper advocates the need for scheduling. Even if resources at our disposal would become abundant and cheap, not to say unlimited and free (a perspective that is not granted), we would still need to assign the right task to the right device. We give several simple examples of such situa ..."
Abstract
- Add to MetaCart
This position paper advocates the need for scheduling. Even if resources at our disposal would become abundant and cheap, not to say unlimited and free (a perspective that is not granted), we would still need to assign the right task to the right device. We give several simple examples of such situations where resource selection and allocation is mandatory. Finally we expose our views on the important algorithmic challenges that need be addressed in the future. 1
A Taxonomy of Desktop Grids and its Mapping to State-of-the-Art Systems
"... Desktop Grid has emerged as an attractive computing paradigm for high throughput applications. In Desktop Grid systems, numerous desktop computers owned by different individuals are employed as computational resources at the edge of the Internet. Accordingly, building such systems is complicated due ..."
Abstract
- Add to MetaCart
Desktop Grid has emerged as an attractive computing paradigm for high throughput applications. In Desktop Grid systems, numerous desktop computers owned by different individuals are employed as computational resources at the edge of the Internet. Accordingly, building such systems is complicated due to resources’ heterogeneity, failures, non-dedication, volatility and lack of trust. Therefore, it is important to comprehend how these distinct characteristics impact on architecture, execution model, resource management, and scheduling. In this article, architectural elements are investigated and then a new taxonomy of Desktop Grids is proposed. The taxonomy
Static Worksharing Strategies for Heterogeneous Computers with Unrecoverable
"... Abstract. One has a large workload that is “divisible ” (its constituent work’s granularity can be adjusted arbitrarily) and one has access to p remote computers that can assist in computing the workload. How can one best utilize the computers? Two features complicate this question. First, the remot ..."
Abstract
- Add to MetaCart
Abstract. One has a large workload that is “divisible ” (its constituent work’s granularity can be adjusted arbitrarily) and one has access to p remote computers that can assist in computing the workload. How can one best utilize the computers? Two features complicate this question. First, the remote computers may differ from one another in speed. Second, each remote computer is subject to interruptions of known likelihood that kill all work in progress on it. One wishes to orchestrate sharing the workload with the remote computers in a way that maximizes the expected amount of work completed. We deal with three distinct problem instances. The simplest problem ignores communication costs, but considers a heterogeneous set of resources that may differ in speed. The other two problems account for communication costs, first with identical remote computers, and then with computers that may differ in speed. We provide exact expressions for the optimal work expectation for all three problems. For the first two problems we provide explicit, closed-form expressions; for the last (and most general) problem, we provide a recurrence for computing this optimal value. 1
A Taxonomy of Autonomic Application Management in Grids
"... In this paper, we propose a taxonomy that characterizes and classifies different components of autonomic application management in Grids. We also survey several representative Grid systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not ..."
Abstract
- Add to MetaCart
In this paper, we propose a taxonomy that characterizes and classifies different components of autonomic application management in Grids. We also survey several representative Grid systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the similarities and differences of state-of-the-art technologies utilized in autonomic application management from the perspective of Grid computing, but also identifies the areas that require further research initiatives. 1
Fault TOLERANCE IN GRID COMPUTING: STATE OF THE ART AND OPEN ISSUES
"... Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes co-operate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of ..."
Abstract
- Add to MetaCart
Fault tolerance is an important property for large scale computational grid systems, where geographically distributed nodes co-operate to execute a task. In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QOS requirement in grid computing. Commonly utilized techniques for providing fault tolerance are job checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. In case of complex scientific workflows where tasks can execute in well defined order reliability is another biggest challenge because of the unreliable nature of the grid resources.
Reliability in Grid Computing Systems
"... In recent years, grid technology has emerged as an important tool for solving computeintensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce s ..."
Abstract
- Add to MetaCart
In recent years, grid technology has emerged as an important tool for solving computeintensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce standard specifications for implementing largescale, interoperable grid systems. The focus of this activity has been the Open Grid Forum, but other standards development organizations have also produced specifications that are used in grid systems. To date, these specifications have provided the basis for a growing number of operational grid systems used in scientific and industrial applications. However, if the growth of grid technology is to continue, it will be important that grid systems also provide high reliability. In particular, it will be critical to ensure that grid systems are reliable as they continue to grow in scale, exhibit greater dynamism, and become more heterogeneous in composition. Ensuring grid system reliability in turn requires that the specifications used to build these systems fully support reliable grid services. This study surveys work on grid reliability that has been done in recent years and reviews progress made toward achieving these goals. The survey identifies important issues and problems that researchers are working

