• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids (0)

by Moretti
Venue:IEEE Trans. Parallel Distrib. Syst
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 28
Next 10 →

Twister: A runtime for iterative MapReduce

by Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-hee Bae, Judy Qiu, Geoffrey Fox - In The First International Workshop on MapReduce and its Applications , 2010
"... MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of e ..."
Abstract - Cited by 159 (13 self) - Add to MetaCart
MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.
(Show Context)

Citation Context

...airwise Distance Calculation Calculating similarity or dissimilarity between each element of a data set with each element in another data set is a common problem and is generally known as an All-pairs=-=[22]-=- problem. The application we have selected calculates the Smith Waterman Gotoh(SW-G)[23] distance (say ...

High Performance Parallel Computing with Cloud and Cloud Technologies

by Jaliya Ekanayake, Xiaohong Qiu, Thilina Gunarathne, Scott Beason, Geoffrey Fox
"... We present our experiences in applying, developing, and evaluating cloud and cloud technologies. First, we present our experience in applying Hadoop and DryadLINQ to a series of data/compute intensive applications and then compare them with a novel MapReduce runtime developed by us, named CGL-MapRed ..."
Abstract - Cited by 49 (14 self) - Add to MetaCart
We present our experiences in applying, developing, and evaluating cloud and cloud technologies. First, we present our experience in applying Hadoop and DryadLINQ to a series of data/compute intensive applications and then compare them with a novel MapReduce runtime developed by us, named CGL-MapReduce, and MPI. Preliminary applications are developed for particle physics, bioinformatics, clustering, and matrix multiplication. We identify the basic execution units of the MapReduce programming model and categorize the runtimes according to their characteristics. MPI versions of the applications are used where the contrast in performance needs to be highlighted. We discuss the application structure and their mapping to parallel architectures of different types, and look at the performance of these applications. Next, we present a performance analysis of MPI parallel applications on virtualized resources.
(Show Context)

Citation Context

...the authors compare the performance with Hadoop for tera-sort application. Sphere stores intermediate data on files, and hence is susceptible to higher overheads for iterative applications. All-Paris =-=[26]-=- is an abstraction that can be used to solve a common problem of comparing all the elements in a data set with all the elements in another data set by applying a given function. This problem can be im...

Cloud Technologies for Bioinformatics Applications

by Jaliya Ekanayake, Thilina Gunarathne, Judy Qiu - IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS , 2010
"... Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute ..."
Abstract - Cited by 46 (12 self) - Add to MetaCart
Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to latest cloud technologies such as MapReduce can be used to execute these “many-tasks” in parallel. In this paper, we present our experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the above characteristics. The applications are a pairwise Alu sequence alignment application and an EST (Expressed Sequence Tag) sequence assembly program. First we compare the performance of these cloud technologies using the above case and also compare them with traditional MPI implementation in one application. Next we analyze the effect of inhomogeneous data on the scheduling mechanisms of the cloud technologies. Finally we present a comparison of performance of the cloud technologies under virtual and non-virtual hardware platforms.

Load Balancing for MapReduce-based Entity Resolution

by Lars Kolb, Andreas Thor, Erhard Rahm
"... Abstract — The effectiveness and scalability of MapReducebased implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancin ..."
Abstract - Cited by 15 (5 self) - Add to MetaCart
Abstract — The effectiveness and scalability of MapReducebased implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus become necessary to achieve load balancing among all reduce tasks to be executed in parallel. For the complex problem of entity resolution, we propose and evaluate two approaches for such skew handling and load balancing. The approaches support blocking techniques to reduce the search space of entity resolution, utilize a preprocessing MapReduce job to analyze the data distribution, and distribute the entities of large blocks among multiple reduce tasks. The evaluation on a real cloud infrastructure shows the value and effectiveness of the proposed load balancing approaches. I.
(Show Context)

Citation Context

...milar documents, set-similarity joins [6] for efficient string similarity computation in databases, pairwise distance computation [16] for clustering complex objects, and all-pairs matrix computation =-=[17]-=- for scientific computing. All approaches follow a similar idea like ER using blocking: One or more signatures (e.g., tokens or terms) are generated per object (e.g., document) to avoid the computatio...

Coordinating Computation and I/O in Massively Parallel Sequence Search

by Heshan Lin, Xiaosong Ma, Wuchun Feng, Nagiza F. Samatova
"... With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the most computation- and data-intensive scientific applications. Our previous studies suggested that parallel genomic sequence-search possesses highly irregular computation and I/O patterns. ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
With the explosive growth of genomic information, the searching of sequence databases has emerged as one of the most computation- and data-intensive scientific applications. Our previous studies suggested that parallel genomic sequence-search possesses highly irregular computation and I/O patterns. Effectively addressing these run-time irregularities is thus the key to designing scalable sequence-search tools on massively parallel computers. While the computation scheduling for irregular scientific applications and the optimization of noncontiguous file accesses have been wellstudied independently, littleattentionhasbeenpaidtotheinterplay between the two. In this paper, we systematically investigate the computation and I/O scheduling for data-intensive, irregular scientific applications within the context of genomic sequence search. Our study revealsthatthelackofcoordinationbetweencomputation scheduling and I/O optimization could result in severe performance issues. We then propose an integrated scheduling approach that effectively improves sequence-search throughput by gracefully coordinating the dynamic load-balancing of computation and highperformance noncontiguous I/O.

Challenges and Approaches for Distributed Workflow-Driven Analysis of Large-Scale Biological Data [Vision Paper]

by Ilkay Altintas, Jianwu Wang, Daniel Crawl, Weizhong Li
"... Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented de-mands on traditional single-processor bioinformatics algo-rithms. Middleware and technologies for scientific work-flows and data-in ..."
Abstract - Cited by 6 (3 self) - Add to MetaCart
Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented de-mands on traditional single-processor bioinformatics algo-rithms. Middleware and technologies for scientific work-flows and data-intensive computing promise new capabil-ities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experi-ences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System mod-ule, called “bioKepler”, that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequenc-ing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.
(Show Context)

Citation Context

...cution in distributed environments, e.g., MapReduce [29] and MasterSlave [28]. We are extending this generic higher-order actor set to support more data-parallel execution patterns, such as All-Pairs =-=[18]-=-, and Match and CoGroup in PACT [5]. These higherorder actors will reused to build domain-specific bioActors. • bioKepler: The bioKepler module contains a specialized set of actors, namely bioActors, ...

Optimizing Load Balancing and Data-Locality with Data-aware Scheduling

by Ke Wang, Xiaobing Zhou, Tonglin Li, Dongfang Zhao, Michael Lang, Ioan Raicu, Hortonworks Inc
"... Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, fo ..."
Abstract - Cited by 5 (3 self) - Add to MetaCart
Abstract—Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, for data-intensive applications where tasks are dependent and task execution involves processing a large amount of data, migrating tasks blindly yields poor data-locality and incurs significant data-transferring overhead. This work improves work stealing by using both dedicated and shared queues. Tasks are organized in queues based on task data size and location. We implement our technique in MATRIX, a distributed task scheduler for many-task computing. We leverage distributed key-value store to organize and scale the task metadata, task dependency, and data-locality. We evaluate the improved work stealing technique with both applications and micro-benchmarks structured as direct acyclic graphs. Results show that the proposed data-aware work stealing technique performs well. Keywords—data-intensive computing; data-aware scheduling; work stealing; key-value stores; many-task computing I.
(Show Context)

Citation Context

...t the extreme case where the locality issinfinitely large, there would be only one file on one computesnode, eventually all the tasks need to be run on that node.s2) All-Pairs in BiometricssAll-Pairs =-=[31]-=- is a common benchmark for data-intensivesapplications that describes the behavior of a new function onssets A and sets B. For example, in Biometrics, it is verysimportant to find out the covariance o...

Design Patterns for Scientific Applications in DryadLINQ CTP, to appear

by Hui Li, Yang Ruan, Yuduo Zhou, Judy Qiu, Geoffrey Fox - in Proceedings of The Second International Workshop on Data Intensive Computing in the Clouds ( DataCloud-2) 2011, The International Conference for High Performance Computing, Networking, Storage and Analysis (SC11
"... The design and implementation of higher level data flow programming language interfaces are becoming increasingly important for data intensive computation. DryadLINQ is a declarative, data-centric language that enables programmers to address the Big Data issue in the Windows Platform. DryadLINQ has ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
The design and implementation of higher level data flow programming language interfaces are becoming increasingly important for data intensive computation. DryadLINQ is a declarative, data-centric language that enables programmers to address the Big Data issue in the Windows Platform. DryadLINQ has been successfully used in a wide range of applications for the last five years. The latest release of DryadLINQ was published as a Community Technology Preview (CTP) in December 2010 and contains new features and interfaces that can be customized in order to achieve better performances within applications and in regard to usability for developers. This paper presents three design patterns in DryadLINQ CTP that are applicable to a large class of scientific applications, exemplified by SW-G, Matrix-Matrix Multiplication and PageRank with real data.
(Show Context)

Citation Context

...on for the many subsets of the input records. The workflow of the three distributed grouped aggregation approaches is shown in Figure 2. 3.1.Pleasingly Parallel Application The Alu clustering problem =-=[11]-=- [12] is one of the most challenging problems when sequencing clustering because Alus represent the largest repeat families in the human genome. About one million copies of the Alu sequence exist in t...

HyMR: a Hybrid MapReduce Workflow System

by Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox - Proceedings of the Third ECMLS Workshop of ACM HPDC 2012 conference , 2012
"... Various distributed computing models have been developed for high performance computing to process increasing computational data. Among them, MapReduce is one of the most popular choices and widely used. Several distributed workflow systems already exist to solve the problem which contains several M ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
Various distributed computing models have been developed for high performance computing to process increasing computational data. Among them, MapReduce is one of the most popular choices and widely used. Several distributed workflow systems already exist to solve the problem which contains several MapReduce jobs. However, they have limited supports for some features such as fault tolerance and efficient execution for iterative applications inside the workflow. In this paper, we described HyMR: a hybrid MapReduce workflow system based on two different MapReduce frameworks. HyMR greatly improved the performance of data processing over the workflow systems based on a single MapReduce framework. HyMR optimized scheduling for individual jobs and supports fault tolerance for the entire workflow pipeline. A distributed file system is used for fast data sharing between jobs. We also compared a pipeline using HyMR to the workflow model based on a single MapReduce framework. The result proves that the hybrid model has a higher efficiency.
(Show Context)

Citation Context

.... Pairwise sequence alignment (PSA) does all-pair sequence alignment over a given sequence dataset, which is usually in FASTA format, where the result is generated as an all-pair dissimilarity matrix =-=[16]-=-. There are many pairwise sequence alignment algorithms, such as SmithWaterman-GOTOH (SWG) [17] and Needleman-Wunsch [18]. In this particular pipeline, we use SWG algorithm for the each pair’s sequenc...

to appear

by S Kanemura, S Moretti, K Odagiri - in Proceedings of 5th International Linear Collider Workshop (LCWS 2000), Fermilab , 2000
"... ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...pute half of the result matrix. Previous researchers have studied All-Pairs theoretically [28] and on small clusters [4]. Our contribution is to scale the problem up to hundreds of nodes in the cloud =-=[16]-=-. As with the previous abstraction, the user provides a “function” in the form of a program that compares two input files. The data sets A and B are text files listing the remaining files to be compar...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University