• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Adapting bioinformatics applications for heterogeneous systems: a case study, Concurrency and Computation: Practice and Experience (2012)

by I Lanc, P Bui, D Thain, S Emrich
Add To MetaCart

Tools

Sorted by:
Results 1 - 3 of 3

A COMPILER TOOLCHAIN FOR DISTRIBUTED DATA INTENSIVE SCIENTIFIC WORKFLOWS

by Peter Bui , 2012
"... SCIENTIFIC WORKFLOWS by ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
SCIENTIFIC WORKFLOWS by
(Show Context)

Citation Context

...toolchain that enables users to package complex executables for use in distributed workflows. It is currently being used by collaborators to simplify and package a variety of bioinformatics workflows =-=[67, 114]-=- which incorporate many executables and libraries that need to work across multiple distributed systems. 7380 Execution Time (seconds) 70 60 50 40 30 20 10 0 Convert Starch Starch_Keep Starch Benchma...

Notre Dame, IN

by Andrew Thrasher, Douglas Thain, Scott Emrich, Zachary Musgrave
"... Abstract—Next generation sequencing technologies have enabled various entities, ranging from large sequencing centers to individual laboratories, to sequence organisms of choice and analyze them on demand. Sequencing and analysis, however, is only part of the equation: to learn about a certain organ ..."
Abstract - Add to MetaCart
Abstract—Next generation sequencing technologies have enabled various entities, ranging from large sequencing centers to individual laboratories, to sequence organisms of choice and analyze them on demand. Sequencing and analysis, however, is only part of the equation: to learn about a certain organism, scientists need to annotate it. Each of these problems is highly parallel at a basic level of computation; however, only a few applications support even a single parallelization framework such as MPI. Ideally, because of overall increasing demand for computational analysis and the inherent parallelism available in these problems, applications should utilize a generic parallel framework to take advantage of a large variety of computing systems; this would enable labs of various sizes to harness the computing power available to them without forcing them to invest in a particular type of batch system. Here we describe modifications made to one particular tool, MAKER. MAKER is a tool for genome annotation that is provided as both a serial application and as an MPI application. We make modifications to enable it to run without MPI and to utilize a wide variety of distributed computing platforms. Furthermore, our proposed parallel framework allows for easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that generally rely on a shared filesystem. The distributed computing framework we choose to utilize can be used, even during early stages of development, to run bioinformatics tools on clusters, grids, and clouds. Keywords-Distributed computing; Bioinformatics I.

ACCELERATING COMPARATIVE GENOMICS WORKFLOWS IN A DISTRIBUTED ENVIRONMENT WITH OPTIMIZED DATA PARTITIONING AND WORKFLOW FUSION

by Olivia Choudhury, Nicholas L. Hazekamp, Douglas Thain, Scott J. Emrich
"... Abstract. The advent of next generation sequencing technology has generated massive amounts of biological data at unprecen-dented rates. Comparative genomics applications often require compute-intensive tools for subsequent analysis of high throughput data. Although cloud computing infrastructure pl ..."
Abstract - Add to MetaCart
Abstract. The advent of next generation sequencing technology has generated massive amounts of biological data at unprecen-dented rates. Comparative genomics applications often require compute-intensive tools for subsequent analysis of high throughput data. Although cloud computing infrastructure plays an important role in this respect, the pressure from such computationally expensive tasks can be further alleviated using efficient data partitioning and workflow fusion. Here, we implement a workflow-based model for parallelizing the data-intensive tasks of genome alignment and variant calling with BWA and GATK’s HaplotypeCaller. We explore three different approaches of partitioning data, granularity-based, individual-based, and alignment-based, and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for HaplotypeCaller to be the optimal choices for the pipeline. We further discuss the methods and impact of workflow fusion on per-formance by considering different levels of fusion and how it affects our results. We identify the various open problems encountered, such as understanding the extent of parallelism, using heterogenous environments without a shared file system, and determining the granularity of inputs, and provide insights into addressing them. Finally, we report significant performance improvements, from 12 days to under 2 hours while running the BWA-GATK pipeline using partitioning and fusion. Key words: genome alignment, variant calling, workflow fusion, data partitioning, performance
(Show Context)

Citation Context

...rably reduce run time. As the underlying algorithms of both tools support coarse-grained parallelism, the trade off between accuracy and speedup can be mitigated by creating a parallelizable workflow =-=[14]-=-. Such a workflow can harness resources from distributed systems by dividing the workload into independent tasks and executing the tasks parallely. In this regard, one of the key factors responsible f...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University