Results 1 - 10
of
50
Mapreduce: a flexible data processing tool
- Commun. ACM
"... contributed articles doi:10.1145/1629175.1629198 MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs. ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
contributed articles doi:10.1145/1629175.1629198 MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.
Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)
"... MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapRe ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop — an open-source implementation of MapReduce — often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even ‘notice it’). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly
Dremel: Interactive Analysis of Web-Scale Datasets
"... Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of da ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system. 1.
The Performance of MapReduce: An In-depth Study
"... MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open so ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of “renting more nodes ” is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient. 1.
Starfish: A Self-tuning System for Big Data Analytics
- In CIDR
, 2011
"... Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Timely and cost-effective analytics over “Big Data ” is now a key ingredient for success in many businesses, scientific and engineering disciplines, and government endeavors. The Hadoop software stack—which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces—is a popular choice for big data analytics. Most practitioners of big data analytics—like computational scientists, systems researchers, and business analysts—lack the expertise to tune the system to get good performance. Unfortunately, Hadoop’s performance out of the box leaves much to be desired, leading to suboptimal use of resources, time, and money (in payas-you-go clouds). We introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. While Starfish’s system architecture is guided by work on self-tuning database systems, we discuss how new analysis practices over big data pose new challenges; leading us to different design choices in Starfish. 1.
Manimal: Relational Optimization for Data-Intensive Programs
"... The MapReduce distributed programming framework is very popular, but currently lacks the optimization techniques that have been standard with relational database systems for many years. This paper proposes Manimal, which uses static code analysis to detect MapReduce program semantics and thereby ena ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
The MapReduce distributed programming framework is very popular, but currently lacks the optimization techniques that have been standard with relational database systems for many years. This paper proposes Manimal, which uses static code analysis to detect MapReduce program semantics and thereby enable wholly-automatic optimization of MapReduce programs. For example, a programmer’s map function that emits data only when an if... statement holds true is essentially encoding a selection condition; code analysis can detect and characterize these conditions. If Manimal has an appropriate index available, it can then alter MapReduce execution to use it. Manimal can address many different optimization opportunities, including projections, structure-aware data compression, and others. However, this paper illustrates the system by focusing on one: efficient selection. We give a static analysis algorithm that can detect selections in user programs, and cover how Manimal can employ a B+Tree to execute these selections efficiently at runtime. Testing Manimal on several standard MapReduce programs, we show that selection alone can automatically reduce a standard program’s runtime to 63 % of conventional MapReduce execution time on identical hardware. We also give an in-depth discussion of other optimization targets and detection techniques. 1.
Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database
"... Abstract — In this paper, we describe a scheme for tolerating and recovering from mid-query faults in a distributed shared nothing database. Rather than aborting and restarting queries, our system, Osprey, divides running queries into subqueries, and replicates data such that each subquery can be re ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract — In this paper, we describe a scheme for tolerating and recovering from mid-query faults in a distributed shared nothing database. Rather than aborting and restarting queries, our system, Osprey, divides running queries into subqueries, and replicates data such that each subquery can be rerun on a different node if the node initially responsible fails or returns too slowly. Our approach is inspired by the fault tolerance properties of MapReduce, in which map or reduce jobs are greedily assigned to workers, and failed jobs are rerun on other workers. Osprey is implemented using a middleware approach, with only a small amount of custom code to handle cluster coordination. Each node in the system is a discrete database system running on a separate machine. Data, in the form of tables, is partitioned amongst database nodes and each partition is replicated on several nodes, using a technique called chained declustering [1]. A coordinator machine acts as a standard SQL interface to users; it transforms an input SQL query into a set of subqueries that are then executed on the nodes. Each subquery represents only a small fraction of the total execution of the query; worker nodes are assigned a new subquery as they finish their current one. In this greedy-approach, the amount of work lost due to node failure is small (at most one subquery’s work), and the system is automatically load balanced, because slow nodes will be assigned fewer subqueries. We demonstrate Osprey’s viability as a distributed system for a small data warehouse data set and workload. Our experiments show that the overhead introduced by the middleware is small compared to the workload, and that the system shows promising load balancing and fault tolerance properties. I.
Automatic Optimization for MapReduce Programs
"... The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizatio ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The MapReduce distributed programming framework has become popular, despite evidence that current implementations are inefficient, requiring far more hardware than a traditional relational databases to complete similar tasks. MapReduce jobs are amenable to many traditional database query optimizations (B+Trees for selections, column-storestyle techniques for projections, etc), but existing systems do not apply them, substantially because free-form user code obscures the true data operation being performed. For example, a selection in SQL is easily detected, but a selection in a MapReduce program is embedded in Java code along with lots of other program logic. We could ask the programmer to provide explicit hints about the program’s data semantics, but one of MapReduce’s attractions is precisely that it does not ask the user for such information. This paper covers Manimal, which automatically analyzes MapReduce programs and applies appropriate dataaware optimizations, thereby requiring no additional help at all from the programmer. We show that Manimal successfully detects optimization opportunities across a range of data operations, and that it yields speedups of up to 1,121% on previously-written MapReduce programs. 1.

