• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Data-Intensive Supercomputing: The Case for DISC (2007)

by R E Bryant
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Automatic Optimization of Parallel Dataflow Programs

by Christopher Olston, Benjamin Reed, Adam Silberstein, Utkarsh Srivastava
"... Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages en ..."
Abstract - Cited by 14 (0 self) - Add to MetaCart
Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages engender greater transparency in program structure, and open up opportunities for automatic optimization. This paper proposes a set of optimization strategies for this context, drawing on and extending techniques from the database community. 1

Data-intensive file systems for internet services: A rose by any other name

by Wittawat Tantisiriroj, Swapnil Patil, Garth Gibson , 2008
"... rose by any other name... ..."
Abstract - Cited by 10 (3 self) - Add to MetaCart
rose by any other name...

MapReduce Optimization Using Regulated Dynamic Prioritization

by Thomas S, Kevin Lai
"... We present a system for allocating resources in shared data and compute clusters that improves MapReduce job scheduling in three ways. First, the system uses regulated and user-assigned priorities to offer different service levels to jobs and users over time. Second, the system dynamically adjusts r ..."
Abstract - Cited by 9 (0 self) - Add to MetaCart
We present a system for allocating resources in shared data and compute clusters that improves MapReduce job scheduling in three ways. First, the system uses regulated and user-assigned priorities to offer different service levels to jobs and users over time. Second, the system dynamically adjusts resource allocations to fit the requirements of different job stages. Finally, the system automatically detects and eliminates bottlenecks within a job. We show experimentally using real applications that users can optimize not only job execution time but also the cost-benefit ratio or prioritization efficiency of a job using these three strategies. Our approach relies on a proportional share mechanism that continuously allocates virtual machine resources. Our experimental results show a 11−31 % improvement in completion time and 4−187 % improvement in prioritization efficiency for different classes of MapReduce jobs. We further show that delay intolerant users gain even more from our system.

Tritonsort: A balanced large-scale sorting system

by Er Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Er Pucher, Amin Vahdat - In USENIX NSDI’11 , 2011
"... sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in 52 nodes at a rate of 0.916 TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 60 % better in absolute performance ..."
Abstract - Cited by 8 (3 self) - Add to MetaCart
sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in 52 nodes at a rate of 0.916 TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 60 % better in absolute performance and has over six times the per-node efficiency of the previous record holder. In this paper, we describe the hardware and software architecture necessary to operate TritonSort at this level of efficiency. Through careful management of system resources to ensure cross-resource balance, we are able to sort data at approximately 80 % of the disks ’ aggregate sequential write speed. We believe the work holds a number of lessons for balanced system design and for scale-out architectures in general. While many interesting systems are able to scale linearly with additional servers, per-server performance can lag behind per-server capacity by more than an order of magnitude. Bridging the gap between high scalability and high performance would enable either significantly cheaper systems that are able to do the same work or provide the ability to address significantly larger problem sets with the same infrastructure. 1

Does erasure coding have a role to play in my data center?

by Zhe Zhang, Amey Deshp, Xiaosong Ma, Eno Thereska, Dushyanth Narayanan
"... Today replication has become the de facto standard for storing data within and across data centers that process data-intensive workloads. Erasure coding (a form of software RAID), although heavily researched and theoretically more space efficient than replication, has complex tradeoffs which are not ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Today replication has become the de facto standard for storing data within and across data centers that process data-intensive workloads. Erasure coding (a form of software RAID), although heavily researched and theoretically more space efficient than replication, has complex tradeoffs which are not well-understood by practitioners. Today’s data centers have diverse foreground and background data-intensive workloads, and getting these tradeoffs right is becoming increasingly important. Through a series of realistic data center deployment scenarios and workload characteristics, coupled with the implementation of a prototype Hadoop library with erasure coding functionalities, we revisit traditional metrics (performance and dollar cost), present new tradeoffs (power proportionality and complexity) and make recommendations on directions worth researching. 1

Enabling Computational Steering with an Asynchronous-Iterative Computation Framework

by Re Di Costanzo, Chao Jin, Carlos A. Varela, Rajkumar Buyya
"... In this paper, we present a framework that enables scientists to steer computations executing over large-scale grid computing environments. By using computational steering, users can dynamically control their simulations or computations to reach expected results more efficiently. The framework suppo ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
In this paper, we present a framework that enables scientists to steer computations executing over large-scale grid computing environments. By using computational steering, users can dynamically control their simulations or computations to reach expected results more efficiently. The framework supports steerable applications by introducing an asynchronous iterative MapReduce programming model that is deployed using Hadoop over a set of virtual machines executing on a multi-cluster grid. To tolerate the heterogeneity between different sites, results are collected asynchronously and users can dynamically interact with their computations to adjust the area of interest. According to users ’ dynamic interaction, the framework can redistribute the computational overload between the heterogeneous sites and explore the user’s interest area by using more powerful sites when possible. With our framework, the bottleneck induced by synchronisation between different sites is considerably avoided, and therefore the response to users ’ interaction is satisfied more efficiently. We illustrate and evaluate this framework with a scientific application that aims to fit models of the Milky Way galaxy structure to stars observed by the Sloan Digital Sky Survey.

MapReduce Programming Model for.NET-based Distributed Computing

by Chao Jin, Rajkumar Buyya
"... Recently many data center scale of computer systems are built in order to meet the high storage and processing demands of data-intensive and computeintensive applications. MapReduce is one of the most popular programming models designed to support the development of such applications. It is initiall ..."
Abstract - Add to MetaCart
Recently many data center scale of computer systems are built in order to meet the high storage and processing demands of data-intensive and computeintensive applications. MapReduce is one of the most popular programming models designed to support the development of such applications. It is initially proposed by Google for simplifying the development of large scale web search applications in data centers and has been proposed to form the basis of a “data center computer”. This technical report presents a realization of MapReduce for.NET-based data centers, including the programming model and runtime system. The design and implementation of MapReduce.NET are described and its performance evaluation is presented. 1.

MapReduce Programming Model for.NET-based Cloud Computing

by Chao Jin, Rajkumar Buyya
"... Abstract. Recently many large scale computer systems are built in order to meet the high storage and processing demands of compute and data-intensive applications. MapReduce is one of the most popular programming models designed to support the development of such applications. It was initially creat ..."
Abstract - Add to MetaCart
Abstract. Recently many large scale computer systems are built in order to meet the high storage and processing demands of compute and data-intensive applications. MapReduce is one of the most popular programming models designed to support the development of such applications. It was initially created by Google for simplifying the development of large scale web search applications in data centers and has been proposed to form the basis of a ‘Data center computer ’ This paper presents a realization of MapReduce for.NET-based data centers, including the programming model and the runtime system. The design and implementation of MapReduce.NET are described and its performance evaluation is presented. 1

EfficiencyMatters!

by Eric Anderson, Joseph Tucek
"... Current data intensive scalable computing (DISC) systems, although scalable, achieve embarrassingly low rates of processing per node. We feel that current DISC systems have repeated a mistake of old high-performance systems: focusing on scalability without considering efficiency. This poor efficienc ..."
Abstract - Add to MetaCart
Current data intensive scalable computing (DISC) systems, although scalable, achieve embarrassingly low rates of processing per node. We feel that current DISC systems have repeated a mistake of old high-performance systems: focusing on scalability without considering efficiency. This poor efficiency comes with issues in reliability, energy, and cost. As the gap between theoretical performance and what is actually achieved has become glaringly large, we feel there is a pressing need to rethink the design of future data intensive computing and carefully consider the direction of future research. 1.

Evaluating SPLASH-2 Applications Using MapReduce

by Shengkai Zhu, Zhiwei Xiao, Haibo Chen, Rong Chen, Weihua Zhang, Binyu Zang
"... Abstract. MapReduce has been prevalent for running data-parallel applications. By hiding other non-functionality parts such as parallelism, fault tolerance and load balance from programmers, MapReduce significantly simplifies the programming of large clusters. Due to the mentioned features of MapRed ..."
Abstract - Add to MetaCart
Abstract. MapReduce has been prevalent for running data-parallel applications. By hiding other non-functionality parts such as parallelism, fault tolerance and load balance from programmers, MapReduce significantly simplifies the programming of large clusters. Due to the mentioned features of MapReduce above, researchers have also explored the use of MapReduce on other application domains, such as machine learning, textual retrieval and statistical translation, among others. In this paper, we study the feasibility of running typical supercomputing applications using the MapReduce framework. We port two applications (Water Spatial and Radix Sort) from the Stanford SPLASH-2 suite to MapReduce. By completely evaluating them in Hadoop, an open-source MapReduce framework for clusters, we analyze the major performance bottleneck of them in the MapReduce framework. Based on this, we also provide several suggestions in enhancing the MapReduce framework to suite these applications. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University