Results 1 - 10
of
28
Twister: A runtime for iterative MapReduce
- In The First International Workshop on MapReduce and its Applications
, 2010
"... MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of e ..."
Abstract
-
Cited by 159 (13 self)
- Add to MetaCart
(Show Context)
MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.
High Performance Parallel Computing with Cloud and Cloud Technologies
"... We present our experiences in applying, developing, and evaluating cloud and cloud technologies. First, we present our experience in applying Hadoop and DryadLINQ to a series of data/compute intensive applications and then compare them with a novel MapReduce runtime developed by us, named CGL-MapRed ..."
Abstract
-
Cited by 49 (14 self)
- Add to MetaCart
(Show Context)
We present our experiences in applying, developing, and evaluating cloud and cloud technologies. First, we present our experience in applying Hadoop and DryadLINQ to a series of data/compute intensive applications and then compare them with a novel MapReduce runtime developed by us, named CGL-MapReduce, and MPI. Preliminary applications are developed for particle physics, bioinformatics, clustering, and matrix multiplication. We identify the basic execution units of the MapReduce programming model and categorize the runtimes according to their characteristics. MPI versions of the applications are used where the contrast in performance needs to be highlighted. We discuss the application structure and their mapping to parallel architectures of different types, and look at the performance of these applications. Next, we present a performance analysis of MPI parallel applications on virtualized resources.
Cloud Technologies for Bioinformatics Applications
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2010
"... Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute ..."
Abstract
-
Cited by 46 (12 self)
- Add to MetaCart
Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute these “many-tasks” in parallel. In this paper, we present our experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the above characteristics. The applications are a pairwise Alu sequence alignment application and an EST (Expressed Sequence Tag) sequence assembly program. First we compare the performance of these cloud technologies using the above case and also compare them with traditional MPI implementation in one application. Next we analyze the effect of inhomogeneous data on the scheduling mechanisms of the cloud technologies. Finally we present a comparison of performance of the cloud technologies under virtual and non-virtual hardware platforms.
Load Balancing for MapReduce-based Entity Resolution
"... Abstract — The effectiveness and scalability of MapReducebased implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancin ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
(Show Context)
Abstract — The effectiveness and scalability of MapReducebased implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches. I.
Coordinating Computation and I/O in Massively Parallel Sequence Search
"... With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the most computation- and data-intensive scientific applications. Our previous studies suggested that parallel genomic sequence-search possesses highly irregular computation and I/O patterns. ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the most computation- and data-intensive scientific applications. Our previous studies suggested that parallel genomic sequence-search possesses highly irregular computation and I/O patterns. Effectively addressing these run-time irregularities is thus the key to designing scalable sequence-search tools on massively parallel computers. While the computation scheduling for irregular scientific applications and the optimization of noncontiguous file accesses have been wellstudied independently, littleattentionhasbeenpaidtotheinterplay between the two. In this paper, we systematically investigate the computation and I/O scheduling for data-intensive, irregular scientific applications within the context of genomic sequence search. Our study revealsthatthelackofcoordinationbetweencomputation scheduling and I/O optimization could result in severe performance issues. We then propose an integrated scheduling approach that effectively improves sequence-search throughput by gracefully coordinating the dynamic load-balancing of computation and highperformance noncontiguous I/O.
Challenges and Approaches for Distributed Workflow-Driven Analysis of Large-Scale Biological Data [Vision Paper]
"... Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented de-mands on traditional single-processor bioinformatics algo-rithms. Middleware and technologies for scientific work-flows and data-in ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented de-mands on traditional single-processor bioinformatics algo-rithms. Middleware and technologies for scientific work-flows and data-intensive computing promise new capabil-ities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experi-ences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System mod-ule, called “bioKepler”, that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequenc-ing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.
Optimizing Load Balancing and Data-Locality with Data-aware Scheduling
"... Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, fo ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, for data-intensive applications where tasks are dependent and task execution involves processing a large amount of data, migrating tasks blindly yields poor data-locality and incurs significant data-transferring overhead. This work improves work stealing by using both dedicated and shared queues. Tasks are organized in queues based on task data size and location. We implement our technique in MATRIX, a distributed task scheduler for many-task computing. We leverage distributed key-value store to organize and scale the task metadata, task dependency, and data-locality. We evaluate the improved work stealing technique with both applications and micro-benchmarks structured as direct acyclic graphs. Results show that the proposed data-aware work stealing technique performs well. Keywords—data-intensive computing; data-aware scheduling; work stealing; key-value stores; many-task computing I.
Design Patterns for Scientific Applications in DryadLINQ CTP, to appear
- in Proceedings of The Second International Workshop on Data Intensive Computing in the Clouds ( DataCloud-2) 2011, The International Conference for High Performance Computing, Networking, Storage and Analysis (SC11
"... The design and implementation of higher level data flow programming language interfaces are becoming increasingly important for data intensive computation. DryadLINQ is a declarative, data-centric language that enables programmers to address the Big Data issue in the Windows Platform. DryadLINQ has ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
The design and implementation of higher level data flow programming language interfaces are becoming increasingly important for data intensive computation. DryadLINQ is a declarative, data-centric language that enables programmers to address the Big Data issue in the Windows Platform. DryadLINQ has been successfully used in a wide range of applications for the last five years. The latest release of DryadLINQ was published as a Community Technology Preview (CTP) in December 2010 and contains new features and interfaces that can be customized in order to achieve better performances within applications and in regard to usability for developers. This paper presents three design patterns in DryadLINQ CTP that are applicable to a large class of scientific applications, exemplified by SW-G, Matrix-Matrix Multiplication and PageRank with real data.
HyMR: a Hybrid MapReduce Workflow System
- Proceedings of the Third ECMLS Workshop of ACM HPDC 2012 conference
, 2012
"... Various distributed computing models have been developed for high performance computing to process increasing computational data. Among them, MapReduce is one of the most popular choices and widely used. Several distributed workflow systems already exist to solve the problem which contains several M ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Various distributed computing models have been developed for high performance computing to process increasing computational data. Among them, MapReduce is one of the most popular choices and widely used. Several distributed workflow systems already exist to solve the problem which contains several MapReduce jobs. However, they have limited supports for some features such as fault tolerance and efficient execution for iterative applications inside the workflow. In this paper, we described HyMR: a hybrid MapReduce workflow system based on two different MapReduce frameworks. HyMR greatly improved the performance of data processing over the workflow systems based on a single MapReduce framework. HyMR optimized scheduling for individual jobs and supports fault tolerance for the entire workflow pipeline. A distributed file system is used for fast data sharing between jobs. We also compared a pipeline using HyMR to the workflow model based on a single MapReduce framework. The result proves that the hybrid model has a higher efficiency.
to appear
- in Proceedings of 5th International Linear Collider Workshop (LCWS 2000), Fermilab
, 2000
"... ..."
(Show Context)