• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce

by Songting Chen
Add To MetaCart

Tools

Sorted by:
Results 1 - 7 of 7

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems

by Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, Zhiwei Xu
"... MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
MapReduce-based data warehouse systems are playing important roles of supporting big data analytics to understand quickly the dynamics of user behavior trends and their needs in typical Web service providers and social network sites (e.g., Facebook). In such a system, the data placement structure is a critical factor that can affect the warehouse performance in a fundamental way. Based on our observations and analysis of Facebook production systems, we have characterized four requirements for the data placement structure: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) strong adaptivity to highly dynamic workload patterns. We have examined three commonly accepted data placement structures in conventional databases, namely rowstores, column-stores, and hybrid-stores in the context of large data analysis using MapReduce. We show that they are not very suitable for big data processing in distributed systems. In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. With intensive experiments, we show the effectiveness of RCFile in satisfying the four requirements. RCFile has been chosen in Facebook data warehouse system as the default option. It has also been adopted by Hive and Pig, the two most widely used data analysis systems developed in Facebook and Yahoo! I.

Efficient Processing of Data Warehousing Queries in a Split Execution Environment

by Kamil Bajda-pawlikowski, Daniel J. Abadi, Avi Silberschatz, Erik Paulson, Hadapt Inc
"... Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL quer ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize performance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework. In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

ColumnOriented Storage Techniques for MapReduce

by Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, Sandeep Tata
"... Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation su ..."
Abstract - Add to MetaCart
Users of MapReduce often run into performance problems when they scale up their workloads. Many of the problems they encounter can be overcome by applying techniques learned from over three decades of research on parallel DBMSs. However, translating these techniques to a Map-Reduce implementation such as Hadoop presents unique challenges that can lead to new design choices. This paper describes how column-oriented storage techniques can be incorporated in Hadoop in a way that preserves its popular programming APIs. We show that simply using binary storage formats in Hadoop can provide a 3x performance boost over the naive use of text files. We then introduce a column-oriented storage format that is compatible with the replication and scheduling constraints of Hadoop and show that it can speed up MapReduce jobs on real workloads by an order of magnitude. We also show that dealing with complex column types such as arrays, maps, and nested records, which are common in MapReduce jobs, can incur significant CPU overhead. Finally, we introduce a novel skip list column format and lazy record construction strategy that avoids deserializing unwanted records to provide an additional 1.5x performance boost. Experiments on a real intranet crawl are used to show that our column-oriented storage techniques can improve the performance of the map phase in Hadoop by as much as two orders of magnitude. 1.

Parallel Data Processing with MapReduce: A Survey

by Kyong-ha Lee, Yoon-joon Lee, Hyunsik Choi, Yon Dohn Chung, Bongki Moon
"... A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency pe ..."
Abstract - Add to MetaCart
A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce. 1.

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop 1

by Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, John Mcpherson
"... Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop t ..."
Abstract - Add to MetaCart
Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. In contrast to previous approaches, Co-Hadoop retains the flexibility of Hadoop in that it does not require users to convert their data to a certain format (e.g., a relational database or a specific file format). Instead, applications give hints to CoHadoop that some set of files are related and may be processed jointly; CoHadoop then tries to colocate these files for improved efficiency. Our approach is designed such that the strong fault tolerance properties of Hadoop are retained. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization. We conducted a detailed study of joins and sessionization in the context of log processing—a common use case for Hadoop—, and propose efficient map-only algorithms that exploit colocated data partitions. In our experiments, we observed that CoHadoop outperforms both plain Hadoop and previous work. In particular, our approach not only performs better than repartition-based algorithms, but also outperforms map-only algorithms that do exploit data partitioning but not colocation. 1.

An Empirical Analysis of Scheduling Techniques for Real-time Cloud-based Data Processing

by Linh T. X, Phan Zhuoyao, Zhang Qi, Zheng Boon, Thau Loo, Insup Lee
"... Abstract—In this paper, we explore the challenges and needs of current cloud infrastructures, to better support cloud-based data-intensive applications that are not only latency-sensitive but also require strong timing guarantees. These applications have strict deadlines (e.g., to perform time-depen ..."
Abstract - Add to MetaCart
Abstract—In this paper, we explore the challenges and needs of current cloud infrastructures, to better support cloud-based data-intensive applications that are not only latency-sensitive but also require strong timing guarantees. These applications have strict deadlines (e.g., to perform time-dependent mission critical tasks or to complete real-time control decisions using a human-in-the-loop), and deadline misses are undesirable. To highlight the challenges in this space, we provide a case study of the online scheduling of MapReduce jobs executed by Hadoop. Our evaluations on Amazon EC2 show that the existing Hadoop scheduler is ill-equipped to handle jobs with deadlines. However, by adapting existing multiprocessor scheduling techniques for the cloud environment, we observe significant performance improvements in minimizing missed deadlines and tardiness. Based on our case study, we discuss a range of challenges in this domain posed by virtualization and scale, and propose our research agenda centered around the application of advanced real-time scheduling techniques in the cloud environment. I.

Information

by Jens Dittrich, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad
"... Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider’s orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that i ..."
Abstract - Add to MetaCart
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider’s orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60 % with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL. 1.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University