Results 1 - 10
of
28
Spark: Cluster Computing with Working Sets
"... MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class o ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time. 1
CIEL: a universal execution engine for distributed data-flow computing
- in Proceedings of the 8th USENIX Symposium on Networked System Design and Implementation (NSDI). USENIX
"... This paper introduces CIEL, a universal execution engine for distributed data-flow programs. Like previous execution engines, CIEL masks the complexity of distributed programming. Unlike those systems, a CIEL job can make data-dependent control-flow decisions, which enables it to compute iterative a ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
This paper introduces CIEL, a universal execution engine for distributed data-flow programs. Like previous execution engines, CIEL masks the complexity of distributed programming. Unlike those systems, a CIEL job can make data-dependent control-flow decisions, which enables it to compute iterative and recursive algorithms. We have also developed Skywriting, a Turingcomplete scripting language that runs directly on CIEL. The execution engine provides transparent fault tolerance and distribution to Skywriting scripts and highperformance code written in other programming languages. We have deployed CIEL on a cloud computing platform, and demonstrate that it achieves scalable performance for both iterative and non-iterative algorithms. 1
Mesos: A platform for fine-grained resource sharing in the data center,” UCBerkeley
- Online]. Available
, 2010
"... We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI 1. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI 1. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today’s frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. Our experimental results show that Mesos can achieve near-optimal locality when sharing the cluster among diverse frameworks, can scale up to 50,000 nodes, and is resilient to node failures. 1
Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing
, 2011
"... We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algo ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarsegrained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks. 1
Dominant resource fairness: Fair allocation of multiple resource types
, 2011
"... We consider the problem of fair resource allocation in a system containing different resource types, where each user may have different demands for each resource. To address this problem, we propose Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types. We ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We consider the problem of fair resource allocation in a system containing different resource types, where each user may have different demands for each resource. To address this problem, we propose Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types. We show that DRF, unlike other possible policies, satisfies several highly desirable properties. First, DRF incentivizes users to share resources, by ensuring that no user is better off if resources are equally partitioned among them. Second, DRF is strategy-proof, as a user cannot increase her allocation by lying about her requirements. Third, DRF is envyfree, as no user would want to trade her allocation with that of another user. Finally, DRF allocations are Pareto efficient, as it is not possible to improve the allocation of a user without decreasing the allocation of another user. We have implemented DRF in the Mesos cluster resource manager, and show that it leads to better throughput and fairness than the slot-based fair sharing schemes in current cluster schedulers. 1
Disk-Locality in Datacenter Computing Considered Irrelevant
"... Data center computing is becoming pervasive in many organizations. Computing frameworks such as MapReduce [17], Hadoop [6] and Dryad [25], split jobs into small tasks that are run on the cluster’s compute nodes. ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Data center computing is becoming pervasive in many organizations. Computing frameworks such as MapReduce [17], Hadoop [6] and Dryad [25], split jobs into small tasks that are run on the cluster’s compute nodes.
Online Aggregation for Large MapReduce Jobs
"... In online aggregation, a database system processes a user’s aggregation query in an online fashion. At all times during processing, the system gives the user an estimate of the final query result, with the confidence bounds that become tighter over time. In this paper, we consider how online aggrega ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In online aggregation, a database system processes a user’s aggregation query in an online fashion. At all times during processing, the system gives the user an estimate of the final query result, with the confidence bounds that become tighter over time. In this paper, we consider how online aggregation can be built into a MapReduce system for large-scale data processing. Given the MapReduce paradigm’s close relationship with cloud computing (in that one might expect a large fraction of MapReduce jobs to be run in the cloud), online aggregation is a very attractive technology. Since large-scale cloud computations are typically pay-as-you-go, a user can monitor the accuracy obtained in an online fashion, and then save money by killing the computation early once sufficient accuracy has been obtained. 1.
permission. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Verifiable Resource Accounting for Cloud Computing Services ABSTRACT
"... Cloud computing offers users the potential to reduce operating and capital expenses by leveraging the amortization benefits offered by large, managed infrastructures. However, the black-box and dynamic nature of the cloud infrastructure makes it difficult for them to reason about the expenses that t ..."
Abstract
- Add to MetaCart
Cloud computing offers users the potential to reduce operating and capital expenses by leveraging the amortization benefits offered by large, managed infrastructures. However, the black-box and dynamic nature of the cloud infrastructure makes it difficult for them to reason about the expenses that their applications incur. At the same time, the profitability of cloud providers depends on their ability to multiplex several customer applications to maintain high utilization levels. However, this multiplexing may cause providers to incorrectly attribute resource consumption to customers or implicitly bear additional costs thereby reducing their cost-effectiveness. Our position in this paper is that for cloud computing as a paradigm to be sustainable in the long term, we need a systematic approach for verifiable resource accounting. Verifiability here means that cloud customers can be assured that (a) their applications indeed physically consumed the resources they were charged for and (b) that this consumption was justified based on an agreed policy. As a first step toward this vision, in this paper we articulate the challenges and opportunities for realizing such a framework.
TowardsSynthesizingRealistic Workload Traces for StudyingtheHadoop Ecosystem
"... Abstract—Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications’ behavior. The design space, and the scale of the clusters, ma ..."
Abstract
- Add to MetaCart
Abstract—Designing cloud computing setups is a challenging task. It involves understanding the impact of a plethora of parameters ranging from cluster configuration, partitioning, networking characteristics, and the targeted applications’ behavior. The design space, and the scale of the clusters, make it cumbersome and error-prone to test different cluster configurations using real setups. Thus, the community is increasingly relying on simulations and models of cloud setups to infer system behavior and the impact of design choices. The accuracy of the results from such approaches depends on the accuracy and realistic nature of the workload traces employed. Unfortunately, few cloud workload traces are available (in the public domain). In this paper, we present the key steps towards analyzing the traces that have been made public, e.g., from Google, andinferringlessonsthatcanbeusedtodesignrealistic cloudworkloads as well as enable thoroughquantitative studies of Hadoop design. Moreover, we leverage the lessons learned from the traces to undertake two case studies: (i) Evaluating Hadoop job schedulers; and (ii) Quantifying the impact of shared storage on Hadoop system performance. Keywords-Cloud computing, Performance analysis, Design optimization, Software performance modeling

