Results 1 - 10
of
16
Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids
"... In recent years, there has been a renewed interest in languages and systems for large scale distributed computing. Unfortunately, most systems available to the end user use a custom description language tightly coupled to a specific runtime implementation, making it difficult to transfer application ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
In recent years, there has been a renewed interest in languages and systems for large scale distributed computing. Unfortunately, most systems available to the end user use a custom description language tightly coupled to a specific runtime implementation, making it difficult to transfer applications between systems. To address this problem we introduce Makeflow, a simple system for expressing and running a data-intensive workflow across multiple execution engines without requiring changes to the application or workflow description. Makeflow allows any user familiar with basic Unix Make syntax to generate a workflow and run it on one of many supported execution systems. Furthermore, in order to assess the performance characteristics of the various execution engines available to users and assist them in selecting one for use we introduce Workbench, a suite of benchmarks designed for analyzing common workflow patterns. We evaluate Workbench on two physical architectures – the first a storage cluster with local disks and a slower network and the second a high performance computing cluster with a central parallel filesystem and fast network – using a variety of execution engines. We conclude by demonstrating three applications that use Makeflow to execute data intensive applications consisting of thousands of jobs. 1 1.
Grid Deployment of Legacy Bioinformatics Applications with Transparent Data Access
"... Although grid computing offers great potential for executing large-scale bioinformatics applications, practical deployment is constrained by legacy interfaces. Most widely deployed bioinformatics were designed long before grid computing arose, and thus are created, tested, and validated in the fami ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Although grid computing offers great potential for executing large-scale bioinformatics applications, practical deployment is constrained by legacy interfaces. Most widely deployed bioinformatics were designed long before grid computing arose, and thus are created, tested, and validated in the familiar environment of a workstation. Most perform simple local I/O and have no facility for interfacing with a distributed system. Because of these limitations, users of bioinformatics applications are generally constrained to creating large local clustered systems in order to perform data analysis. In order to deploy these applications in wide-area grid systems, users require a transparent mechanism of attaching legacy interfaces to grid I/O systems. We have explored this problem by deploying several bioinformatics databases and programs for protein sequence analysis on the European EGEE grid. Using tools for transparent adaptation, we have connected legacy applications to the logical namespace provided by a replica manager, and compared the performance of remote access versus file staging. For common bioinformatics applications, we find that remote access has performance equal or better than simple file staging, with the added advantage that users are freed from stating the data needs of applications in advance.
Web & Grid Technologies in Bioinformatics, Computational and Systems Biology: A Review
"... Abstract: The acquisition of biological data, ranging from molecular characterization and simulations (e.g. protein folding dynamics), to systems biology endeavors (e.g. whole organ simulations) all the way up to ecological observations (e.g. as to ascertain climate change’s impact on the biota) is ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
Abstract: The acquisition of biological data, ranging from molecular characterization and simulations (e.g. protein folding dynamics), to systems biology endeavors (e.g. whole organ simulations) all the way up to ecological observations (e.g. as to ascertain climate change’s impact on the biota) is growing at unprecedented speed. The use of computational and networking resources is thus unavoidable. As the datasets become bigger and the acquisition technology more refined, the biologist is empowered to ask deeper and more complex questions. These, in turn, drive a runoff effect where large research consortia emerge that span beyond organizations and national boundaries. Thus the need for reliable, robust, certified, curated, accessible, secure and timely data processing and management becomes entrenched within, and crucial to, 21 st century biology. Furthermore, the proliferation of biotechnologies and advances in biological sciences has produced a strong drive for new informatics solutions, both at the basic science and technological levels. The previously unknown situation of dealing with, on one hand, (potentially) exabytes of data, much of which is noisy, has large experimental errors or theoretical uncertainties associated with it, or on the other hand, large quantities of data that require automated computationally intense analysis and processing, have produced important innovations in web and grid technology. In this paper we present a trace of these technological changes in Web and Grid technology, including details of emerging infrastructures, standards, languages and tools, as they apply to bioinformatics, computational biology and systems biology. A major focus of this technological review is to collate up-to-date information regarding the design and implementation of
Makeflow: A portable abstraction for cluster, cloud, and grid computing
, 2011
"... In recent years, there has been a renewed interest in languages and systems for large scale distributed computing. Unfortunately, most systems available to the end user use a custom description language tightly coupled to a specific runtime implementation, making it difficult to transfer application ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
In recent years, there has been a renewed interest in languages and systems for large scale distributed computing. Unfortunately, most systems available to the end user use a custom description language tightly coupled to a specific runtime implementation, making it difficult to transfer applications between systems. To address this problem we introduce Makeflow, a simple system for expressing and running a data-intensive workflow across multiple execution engines without requiring changes to the application or workflow description. Makeflow allows any user familiar with basic Unix Make syntax to generate a workflow and run it on one of many supported execution systems. Furthermore, in order to assess the performance characteristics of the various execution engines available to users and to assist them in determining which engine to use we introduce Workbench, a suite of benchmarks designed to compare the performance of various execution engines. We evaluate Workbench on two physical architectures – a storage cluster and a high performance computing cluster – using a variety of execution engines. We conclude by demonstrating three applications that use Makeflow to execute data intensive applications consisting of thousands of jobs. 1 1.
A COMPILER TOOLCHAIN FOR DISTRIBUTED DATA INTENSIVE SCIENTIFIC WORKFLOWS
, 2012
"... SCIENTIFIC WORKFLOWS by ..."
(Show Context)
Murphy: An Environment for Advance Identification of Run-time Failures∗
, 2012
"... Applications do not typically view the kernel as a source of bad input. However, the kernel can behave in unusual (yet permissible) ways for which applications are badly unprepared. We present Murphy, a language-agnostic tool that helps developers discover and isolate run-time failures in their prog ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Applications do not typically view the kernel as a source of bad input. However, the kernel can behave in unusual (yet permissible) ways for which applications are badly unprepared. We present Murphy, a language-agnostic tool that helps developers discover and isolate run-time failures in their programs by simulating difficult-to-reproduce but completely-legitimate interactions between the application and the kernel. Murphy makes it easy to enable or disable sets of kernel interactions, called gremlins, so develop-ers can focus on the failure scenarios that are important to them. Gremlins are implemented using the ptrace interface, intercepting and potentially modifying an appli-cation’s system call invocation while requiring no invasive changes to the host machine. We show how to use Murphy in a variety of modes to find different classes of errors, present examples of the kernel interactions that are tested, and explain how to apply delta debugging techniques to isolate the code causing the failure. While our primary goal was the development of a tool to assist in new software development, we successfully demonstrate that Murphy also has the capability to find bugs in hardened, widely-deployed software.
Enforcing Murphy’s Law for Advance Identification of Run-time Failures∗
"... Applications do not typically view the kernel as a source of bad input. However, the kernel can behave in unusual (yet permissible) ways for which applications are badly unprepared. We present Murphy, a language-agnostic tool that helps developers discover and isolate run-time fail-ures in their pro ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Applications do not typically view the kernel as a source of bad input. However, the kernel can behave in unusual (yet permissible) ways for which applications are badly unprepared. We present Murphy, a language-agnostic tool that helps developers discover and isolate run-time fail-ures in their programs by simulating difficult-to-reproduce but completely-legitimate interactions between the appli-cation and the kernel. Murphy makes it easy to enable or disable sets of kernel interactions, called gremlins, so developers can focus on the failure scenarios that are im-portant to them. Gremlins are implemented using the ptrace interface, intercepting and potentially modify-ing an application’s system call invocation while requiring no invasive changes to the host machine. We show how to use Murphy in a variety of modes to find different classes of errors, present examples of the kernel interactions that are tested, and explain how to ap-ply delta debugging techniques to isolate the code causing the failure. While our primary goal was the development of a tool to assist in new software development, we suc-cessfully demonstrate that Murphy also has the capability to find bugs in hardened, widely-deployed software.
Fine-GrainedAccessControlin theChirpDistributedFile System
"... Abstract—Although the distributed filesystem is a widely used technology in local area networks, it has seen less use on the wide area networks that connect clusters, clouds, and grids. One reason for this is access control: existing filesystem technologies require either the client machine to be fu ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Although the distributed filesystem is a widely used technology in local area networks, it has seen less use on the wide area networks that connect clusters, clouds, and grids. One reason for this is access control: existing filesystem technologies require either the client machine to be fully trusted, or the client process to hold a high value user credential, neither of which is practical in large scale systems. To address this problem, we have designed a system for fine-grained access control which dramatically reduces the amount of trust required of a batch job accessing a distributed filesystem. We have implemented this system in the context of the Chirp user-level distributed filesystem used in clusters, clouds, and grids, but the concepts can be applied to almost any other storage system. The system is evaluated to show that performance and scalability are similar to other authentication methods. The paper concludes with a discussion of integrating the authentication system into workflow systems. Keywords-distributed; filesystem; authentication; grid; proxy; ticket; I.
SearchShould Be aSystem Call
"... Conventional operating systems are designed with theassumptionthatapplicationsarerelativelyclose totheirdata. However,asvirtualizationofsystems, networks, and storage becomes more widespread, the typical latency between applications and data has increased significantly. This increase in latency can ..."
Abstract
- Add to MetaCart
(Show Context)
Conventional operating systems are designed with theassumptionthatapplicationsarerelativelyclose totheirdata. However,asvirtualizationofsystems, networks, and storage becomes more widespread, the typical latency between applications and data has increased significantly. This increase in latency can have a dramatic effect on application performance, particularly on common system-level tools that perform frequent metadata searches. To addressthisproblem,weproposethatmetadatasearch beelevatedtoafirst-classsystemcallwithinthekernel. We present three implementations of the concept: inanoperatingsystemkernel,inaptracesandbox,andinauser-levelfilesystem. Wedemonstrate that there are many opportunities to easily exploit the searchcapability in standardsystem-level tools with modest coding effort and no change in user behavior. We evaluatethe performanceof common applicationsusingthesearchsystemcallwithvarying levels of virtualization, showing reductions in systemcall trafficranging from5to95percent. 1