• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

and S.Ghemawat, “MapReduce: simplified data processing on large clusters (2008)

by J Dean
Venue:Communications of the ACM
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 3,439
Next 10 →

Bigtable: A distributed storage system for structured data

by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber - IN PROCEEDINGS OF THE 7TH CONFERENCE ON USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION - VOLUME 7 , 2006
"... Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications ..."
Abstract - Cited by 1028 (4 self) - Add to MetaCart
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
(Show Context)

Citation Context

...back into Bigtable, but it does allow various forms of data transformation, filtering based on arbitrary expressions, and summarization via a variety of operators. Bigtable can be used with MapReduce =-=[12]-=-, a framework for running large-scale parallel computations developed at Google. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for ...

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

by Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein , 2010
"... ..."
Abstract - Cited by 1001 (20 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...GL [GL05], GraphLab [LGK+10], and Pregel [MAB+10], among others. Since all three follow the general outline above, we refer the reader to the individual papers for details. 45 9.4 MapReduce MapReduce =-=[DG08]-=- is a popular programming model for distributed batch processing of very large datasets. It has been widely used in industry and academia, and its adoption has been bolstered by the open source projec...

Above the Clouds: A Berkeley View of Cloud Computing

by Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Matei Zaharia , 2009
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract - Cited by 955 (14 self) - Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The RAD Lab's existence is due to the generous support of the founding members Google, Microsoft, and Sun Microsystems and of the affiliate members Amazon Web Services, Cisco Systems, Facebook, Hewlett-
(Show Context)

Citation Context

..., Microsoft and others, were already doing so. Equally important, these companies also had to develop scalable software infrastructure (such as MapReduce, the Google File System, BigTable, and Dynamo =-=[16, 20, 14, 17]-=-) and the operational expertise to armor their datacenters against potential physical and electronic attacks. Therefore, a necessary but not sufficient condition for a company to become a Cloud Comput...

Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility

by Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic , 2008
"... With the significant advances in Information and Communications Technology (ICT) over the last half century, there is an increasingly perceived vision that computing will one day be the 5th utility (after water, electricity, gas, and telephony). This computing utility, like all other four existing u ..."
Abstract - Cited by 656 (63 self) - Add to MetaCart
With the significant advances in Information and Communications Technology (ICT) over the last half century, there is an increasingly perceived vision that computing will one day be the 5th utility (after water, electricity, gas, and telephony). This computing utility, like all other four existing utilities, will provide the basic level of computing service that is considered essential to meet the everyday needs of the general community. To deliver this vision, a number of computing paradigms have been proposed, of which the latest one is known as Cloud computing. Hence, in this paper, we define Cloud computing and provide the architecture for creating Clouds with market-oriented resource allocation by leveraging technologies such as Virtual Machines (VMs). We also provide insights on market-based resource management strategies that encompass both customer-driven service management and computational risk management to sustain Service Level Agreement (SLA)-oriented resource allocation. In addition, we reveal our early thoughts on interconnecting Clouds for dynamically creating global Cloud exchanges and markets. Then, we present some representative Cloud platforms, especially those developed in industries along with our current work towards realizing market-oriented resource allocation of Clouds as realized in Aneka enterprise Cloud technology. Furthermore, we highlight the difference between High Performance Computing (HPC) workload and Internet-based services workload. We also describe a meta-negotiation infrastructure to establish global Cloud
(Show Context)

Citation Context

... increase prices when the availability of nodes is low and a lower β to reduce prices when there are more unused nodes which will otherwise be wasted. 8.2.2 Internet-based Services Workload MapReduce =-=[40]-=- is one of the most popular programming models designed for data centers. It was originallyproposed by Google to handle large-scale web search applications and has been proved to be an effective prog...

Pig Latin: A Not-So-Foreign Language for Data Processing

by Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins
"... There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively e ..."
Abstract - Cited by 607 (13 self) - Add to MetaCart
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use. 1.
(Show Context)

Citation Context

...or code, toward writing declarative queries in SQL, which they often find unnatural, and overly restrictive. As evidence of the above, programmers have been flocking to the more procedural map-reduce =-=[4]-=- programming model. A map-reduce program essentially performs a groupby-aggregation in parallel over a cluster of machines. The programmer provides a map function that dictates how the grouping is per...

The Landscape of Parallel Computing Research: A View from Berkeley

by Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick - TECHNICAL REPORT, UC BERKELEY , 2006
"... ..."
Abstract - Cited by 487 (25 self) - Add to MetaCart
Abstract not found

VL2: Scalable and Flexible Data Center Network”,

by Albert Greenberg , James R Hamilton , Navendu Jain , Srikanth Kandula , Changhoon Kim , Parantap Lahiri , David A Maltz , Parveen Patel , Sudipta Sengupta - ACM SIGCOMM Computer Communication Review, , 2009
"... Abstract To be agile and cost e ective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL, a practical network architecture that scales t ..."
Abstract - Cited by 461 (12 self) - Add to MetaCart
Abstract To be agile and cost e ective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL, a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer- semantics. VL uses () at addressing to allow service instances to be placed anywhere in the network, () Valiant Load Balancing to spread tra c uniformly across network paths, and () end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. VL's design is driven by detailed measurements of tra c and fault data from a large operational cloud service provider. VL's implementation leverages proven network technologies, already available at low cost in high-speed hardware implementations, to build a scalable and reliable network architecture. As a result, VL networks can be deployed today, and we have built a working prototype. We evaluate the merits of the VL design using measurement, analysis, and experiments. Our VL prototype shu es . TB of data among  servers in  seconds -sustaining a rate that is  of the maximum possible.

Power provisioning for a warehousesized computer,”

by Xiaobo Fan , Wolf-Dietrich Weber Luiz , André Barroso - ACM SIGARCH Computer Architecture News, , 2007
"... ABSTRACT Large-scale Internet services require a computing infrastructure that can be appropriately described as a warehouse-sized computing system. The cost of building datacenter facilities capable of delivering a given power capacity to such a computer can rival the recurring energy consumption ..."
Abstract - Cited by 450 (2 self) - Add to MetaCart
ABSTRACT Large-scale Internet services require a computing infrastructure that can be appropriately described as a warehouse-sized computing system. The cost of building datacenter facilities capable of delivering a given power capacity to such a computer can rival the recurring energy consumption costs themselves. Therefore, there are strong economic incentives to operate facilities as close as possible to maximum capacity, so that the non-recurring facility costs can be best amortized. That is difficult to achieve in practice because of uncertainties in equipment power ratings and because power consumption tends to vary significantly with the actual computing activity. Effective power provisioning strategies are needed to determine how much computing equipment can be safely and efficiently hosted within a given power budget. In this paper we present the aggregate power usage characteristics of large collections of servers (up to 15 thousand) for different classes of applications over a period of approximately six months. Those observations allow us to evaluate opportunities for maximizing the use of the deployed power capacity of datacenters, and assess the risks of over-subscribing it. We find that even in well-tuned applications there is a noticeable gap (7 -16%) between achieved and theoretical aggregate peak power usage at the cluster level (thousands of servers). The gap grows to almost 40% in whole datacenters. This headroom can be used to deploy additional compute equipment within the same power budget with minimal risk of exceeding it. We use our modeling framework to estimate the potential of power management schemes to reduce peak power and energy usage. We find that the opportunities for power and energy savings are significant, but greater at the cluster-level (thousands of servers) than at the rack-level (tens). Finally we argue that systems need to be power efficient across the activity range, and not only at peak performance levels.

The eucalyptus open-source cloud-computing system

by Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia Youseff, Dmitrii Zagorodnov - In Proceedings of Cloud Computing and Its Applications [Online
"... Cloud computing systems fundamentally provide access to large pools of data and computational resources through a variety of interfaces similar in spirit to existing grid and HPC resource management and programming systems. These types of systems offer a new programming target for scalable applicati ..."
Abstract - Cited by 415 (9 self) - Add to MetaCart
Cloud computing systems fundamentally provide access to large pools of data and computational resources through a variety of interfaces similar in spirit to existing grid and HPC resource management and programming systems. These types of systems offer a new programming target for scalable application developers and have gained popularity over the past few years. However, most cloud computing systems in operation today are proprietary, rely upon infrastructure that is invisible to the research community, or are not explicitly designed to be instrumented and modified by systems researchers. In this work, we present EUCALYPTUS – an opensource software framework for cloud computing that implements what is commonly referred to as Infrastructure as a Service (IaaS); systems that give users the ability to run and control entire virtual machine instances deployed across a variety physical resources. We outline the basic principles of the EUCALYPTUS design, detail important operational aspects of the system, and discuss architectural trade-offs that we have made in order to allow Eucalyptus to be portable, modular and simple to use on infrastructure commonly found within academic settings. Finally, we provide evidence that EUCALYPTUS enables users familiar with existing Grid and HPC systems to explore new cloud computing functionality while maintaining access to existing, familiar application development software and Grid middle-ware. 1
(Show Context)

Citation Context

...cation policies. Thanks in part to the new facilities provided by virtualization platforms, a large number of systems have been built using these technologies for providing scalable Internet services =-=[4, 1, 8, 10, 11, 19, 37]-=-, that share in common many system characteristics: they must be able to rapidly scale up and down as workload fluctuates, support a large number of users requiring resources “on-demand”, and provide ...

Detecting influenza epidemics using search engine query data.

by Jeremy Ginsberg , Matthew H Mohebbi , Rajan S Patel , Lynnette Brammer , Mark S Smolinski , Larry Brilliant - Nature , 2009
"... ..."
Abstract - Cited by 413 (1 self) - Add to MetaCart
Abstract not found
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University