Results 1 - 10
of
50
Distributed Computing in Practice: The Condor Experience
, 2005
"... Since 1984, the Condor project has enabled ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational Grid. In this paper, we provide the history a ..."
Abstract
-
Cited by 551 (8 self)
- Add to MetaCart
Since 1984, the Condor project has enabled ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational Grid. In this paper, we provide the history and philosophy of the Condor project and describe how it has interacted with other projects and evolved along with the field of distributed computing. We outline the core components of the Condor system and describe how the technology of computing must correspond to social structures. Throughout, we reflect on the lessons of experience and chart the course travelled by research ideas as they grow into production
Explicit Control in a Batch-Aware Distributed File System
"... We present the design, implementation, and evaluation of the Batch-Aware Distributed File System (BAD-FS), a system designed to orchestrate large, I/O-intensive batch workloads on remote computing clusters distributed across the wide area. BAD-FS consists of two novel components: a storage layer whi ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
(Show Context)
We present the design, implementation, and evaluation of the Batch-Aware Distributed File System (BAD-FS), a system designed to orchestrate large, I/O-intensive batch workloads on remote computing clusters distributed across the wide area. BAD-FS consists of two novel components: a storage layer which exposes control of traditionally fixed policies such as caching, consistency, and replication; and a scheduler that exploits this control as needed for different users and workloads. By extracting these controls from the storage layer and placing them in an external scheduler, BAD-FS manages both storage and computation in a coordinated way while gracefully dealing with cache consistency, fault-tolerance, and space management issues in an application-specific manner. Using both microbenchmarks and real applications, we demonstrate the performance benefits of explicit control, delivering excellent end-to-end performance across the wide-area.
Chirp: A practical global file system for cluster and grid computing
- Journal of Grid Computing
"... Abstract. Traditional distributed filesystem technologies designed for local and campus area networks do not adapt well to wide area grid computing environments. To address this problem, we have designed the Chirp distributed filesystem, which is designed from the ground up to meet the needs of grid ..."
Abstract
-
Cited by 30 (13 self)
- Add to MetaCart
(Show Context)
Abstract. Traditional distributed filesystem technologies designed for local and campus area networks do not adapt well to wide area grid computing environments. To address this problem, we have designed the Chirp distributed filesystem, which is designed from the ground up to meet the needs of grid computing. Chirp is easily deployed without special privileges, provides strong and flexible security mechanisms, tunable consistency semantics, and clustering to increase capacity and throughput. We demonstrate that many of these features also provide order-of-magnitude performance increases over wide area networks. We describe three applications in bioinformatics, biometrics, and gamma ray physics that each employ Chirp to attack large scale data intensive problems.
stdchk: A checkpoint storage system for desktop grid computing
- In Proc. ICDCS
, 2008
"... ..."
(Show Context)
CODO: Firewall Traversal by Cooperative On-Demand Opening, 14th IEEE Symposium on High Performance Distributed Computing, (HPDC14), Research Triangle Park, July 2005 http://www.cs.wisc.edu/ sschang/papers/CODO-hpdc.pdf [D-Grid]. M.Meier, E.Gruenter, R.Nie
- Grid Computing, 2004. Proceedings. Fifth IEEE/ACM International Workshop on Grid Computing, ISBN 0-7695-2256-4
"... Firewalls and network address translators (NATs) cause significant connectivity problems together with their benefits. Many ideas to solve these problems have been explored both in academia and in industry. Yet, no single system solves the problem entirely. Considering diverse and even conflicting u ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
Firewalls and network address translators (NATs) cause significant connectivity problems together with their benefits. Many ideas to solve these problems have been explored both in academia and in industry. Yet, no single system solves the problem entirely. Considering diverse and even conflicting use cases and requirements from organizations, we propose an integrated approach that provides a suite of mechanisms and allows communicating peers to choose the best available mechanism. As an important step toward the final goal, we categorize previous efforts and briefly analyze each category in terms of use cases supported, security impacts, performance, and so forth. We then introduce a new firewall traversal system, called CODO, that solves the connectivity problem more securely than other systems in its category. 1.
PDS: A Virtual Execution Environment for Software Deployment
- In Proceedings of the 1st International Conference on Virtual Execution Environments
, 2005
"... The Progressive Deployment System (PDS) is a virtual execution environment and infrastructure designed specifically for deploying software, or “assets”, on demand while enabling management from a central location. PDS intercepts a select subset of system calls on the target machine to provide a part ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
The Progressive Deployment System (PDS) is a virtual execution environment and infrastructure designed specifically for deploying software, or “assets”, on demand while enabling management from a central location. PDS intercepts a select subset of system calls on the target machine to provide a partial virtualization at the operating system level. This enables an asset’s install-time environment to be reproduced virtually while otherwise not isolating the asset from peer applications on the target machine. Asset components, or “shards”, are fetched as they are needed (or they may be pre-fetched), enabling the asset to be progressively deployed by overlapping deployment with execution. Cryptographic digests are used to eliminate redundant shards within and among assets, which enables more efficient deployment. A framework is provided for intercepting interfaces above the operating system (e.g., Java class loading), enabling optimizations requiring semantic awareness not present at the OS level. The paper presents the design of PDS, motivates its “porous isolation model ” with respect to the challenges of software deployment, and presents measurements of PDS’s execution characteristics.
A Checkpoint Storage System for Desktop Grid Computing
, 2007
"... Abstract — Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This article argues that a checkpoint storage system, optimized to operate in these environments, can offer multiple benefits: reduce t ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
Abstract — Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This article argues that a checkpoint storage system, optimized to operate in these environments, can offer multiple benefits: reduce the load on a traditional file system, offer high-performance through specialization, and, finally, optimize data management by taking into account checkpoint application semantics. Such a storage system can present a unifying abstraction to checkpoint operations, while hiding the fact that there are no dedicated resources to store the checkpoint data. We prototype stdchk, a checkpoint storage system that uses scavenged disk space from participating desktops to build a low-cost storage system, offering a traditional file system interface for easy integration with applications. This article presents the stdchk architecture, key performance optimizations, and its support for incremental checkpointing and increased data availability. Our evaluation confirms that the stdchk approach is viable in a desktop grid setting and offers a low-cost storage system with desirable performance characteristics: high write throughput as well as reduced storage space and network effort to save checkpoint images. I.
Distributed File System Virtualization Techniques Supporting On-Demand Virtual Machine Environments for Grid
- Computing”, Cluster Computing
, 2006
"... Abstract. This paper presents a data management solution which allows fast Virtual Machine (VM) instantiation and efficient run-time execution to support VMs as execution environments in Grid computing. It is based on novel distributed file system virtualization techniques and is unique in that: (1) ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
(Show Context)
Abstract. This paper presents a data management solution which allows fast Virtual Machine (VM) instantiation and efficient run-time execution to support VMs as execution environments in Grid computing. It is based on novel distributed file system virtualization techniques and is unique in that: (1) it provides on-demand cross-domain access to VM state for unmodified VM monitors; (2) it enables private file system channels for VM instantiation by secure tunneling and session-key based authentication; (3) it supports user-level and write-back disk caches, per-application caching policies and middleware-driven consistency models; and (4) it leverages application-specific meta-data associated with files to expedite data transfers. The paper reports on its performance in wide-area setups using VMware-based VMs. Results show that the solution delivers performance over 30 % better than native NFS and with warm caches it can bring the application-perceived overheads below 10 % compared to a local-disk setup. The solution also allows a VM with 1.6 GB virtual disk and 320 MB virtual memory to be cloned within 160 seconds for the first clone and within 25 seconds for subsequent clones.
C-MPI: A DHT Implementation for Grid and HPC Environments. (Preprint ANL/MCS-P1746-0410
, 2010
"... We describe a new implementation of a distributed hash table for use as a distributed data service for grid and high performance computing. The distributed data structure can offer an alternative to existing checkpointing, caching, and communication strategies due to its inherent survivability and s ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
We describe a new implementation of a distributed hash table for use as a distributed data service for grid and high performance computing. The distributed data structure can offer an alternative to existing checkpointing, caching, and communication strategies due to its inherent survivability and scalability. The effective use of such an implementation in a high performance setting faces many challenges, in-cluding maintaining good performance, offering wide com-patibility with diverse architectures, and handling multiple fault modes. The implementation described here, called Content-MPI (C-MPI), employs a layered software design built on MPI functionality, and offers a scalable data store that is fault tolerant to the extent of the capability of the MPI implementation. 1
Data-Driven Batch Scheduling
, 2005
"... In this paper, we develop data-driven strategies for batch computing schedulers. Current CPU-centric batch schedulers ignore the data needs within workloads and execute them by linking them transparently and directly to their needed data. When scheduled on remote computational resources, this elegan ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we develop data-driven strategies for batch computing schedulers. Current CPU-centric batch schedulers ignore the data needs within workloads and execute them by linking them transparently and directly to their needed data. When scheduled on remote computational resources, this elegant solution of direct data access can incur an order of magnitude performance penalty for data-intensive workloads. Adding data-awareness to batch schedulers allows a careful coordination of data and CPU allocation thereby reducing the cost of remote execution. We offer here new techniques by which batch schedulers can become data-driven. Such systems can use our analytical predictive models to select one of the four data-driven scheduling policies that we have created. Through simulation, we demonstrate the accuracy of our predictive models and show how they can reduce time to completion for some workloads by as much as 80%.